11 Comments

Everything here makes sense, great summary. The only question I have concerns the controllers. My understanding is that they are constantly evaluating incoming traffic and then optimizing the system in various ways. I wonder how are these controllers able to compile and process so much data concurrently? I'd be interested in knowing if the paper includes any information about how any of the controllers are able to do this. I'm wondering if it's based on aggregate data being constantly produced into view models, some sort of append only log data, Kafka, or maybe just best effort estimations based on periodic sampling (since at that huge of a scale, sampling could statistically provide decently accurate approximations of the underlying reality)?

Expand full comment

Thomas, I'm so glad you asked! I actually had a bunch of notes about the controllers because I went down a rabbit hole with them. The paper itself doesn't describe them much, unfortunately, and whatever notes I did have for them didn't make it into the article. Next time, I'll link to a separate snippet or page that can contain this info without taking up more article space.

One of the controllers is the Configerator, written about here: https://research.facebook.com/publications/holistic-configuration-management-at-facebook/

To answer your question, the controllers don't necessarily process data.

The paper says: "To ensure fault tolerance, the central controllers at the top of Figure 6 remain separate from the critical path of function execution. The controllers continuously optimize the system by periodically updating key configuration parameters, which are consumed by critical-path components like workers and schedulers. Since these configurations are cached by the critical-path components, they continue to execute functions using the current configurations even if the central controllers fail. However, when the central controllers are down, XFaaS would not be reconfigured in response to workload changes. For example, the traffic matrix for cross-region function dispatching won’t be updated even if the actual traffic has shifted. Typically, this does not lead to significant imbalance immediately and can withstand central controller downtime for tens of minutes.

For the Global Traffic Conductor: "The Global Traffic Conductor (GTC) in Figure 6 maintains a near-real-time global view of the demand (pending function calls) and supply (capacity of worker pools) across all regions. It periodically computes a traffic matrix where its element 𝑇𝑖𝑗 specifies the fraction of function calls that the schedulers in region 𝑖 should pull from region 𝑗. To compute the traffic matrix, the GTC starts by setting ∀𝑖,𝑇𝑖𝑖 = 1 and ∀𝑖 ≠ 𝑗,𝑇𝑖𝑗 = 0, meaning that all schedulers will only pull from DurableQs in their local region. However, this might lead to the workers in certain regions becoming overloaded. The GTC calculates the shift of traffic in those overloaded regions to their nearby regions until no region is overloaded or all regions are equally loaded. The GTC periodically distributes a new traffic matrix to all schedulers in all regions via the Configuration Management System in Figure 6. The schedulers then follow the traffic matrix to retrieve function calls from DurableQs in different regions."

From these snippets, we see that the controllers don't process each data item, run periodically, and mostly view aggregate data. They push out configurations which are consumed by various pieces of the system.

Thanks so much for your question again. I'm always happy to answer and it makes my day to get great questions like yours.

Expand full comment

> XFaaS has achieved a daily average CPU utilization of 66%, much better than the industry average.

What are the industry averages?

> These configs are cached in the components themselves, so if the central controllers fail, the system keeps running just fine (but it cannot be reconfigured).

I liked all the details about how they made the system resilient. Thanks for writing this article!

Expand full comment

Great question! I'm glad you asked. The writers of the paper actually didn't say what the industry averages were, seemingly due to it being proprietary and anecdotal.

They did say that based on the information they knew that it was many "magnitudes higher" than public cloud averages.

Interestingly, this made me realize that I should qualify some of the competitive claims made in the article that they were made by the authors of the paper and not me. :)

Expand full comment

I see, no worries. Just a question out of curiosity

> many "magnitudes higher"

Wow, makes sense since their private cloud has some unique advantages. E.g. running different user workloads in the same process

Expand full comment

When will the paper be publicly available?

Expand full comment

I'm not sure! But when it is publicly available, I'll re-comment here (so you get a notification) and also link it in the article directly.

Expand full comment

All good, one of the authors was able to share it!

Expand full comment

Just saw it too, thanks! I've added it into the article now.

Expand full comment

Directing function call traffic across regions is high level stuff!

Expand full comment

Amazing article Leonardo, super detailed and technical, but somehow still understandable :)

Thanks

Expand full comment