Systems Papers - Defcon: Preventing Overload with Graceful Feature Degradation
Hello new and old subscribers!
This is one of my first posts on Substack - if you have any feedback, I would love to hear it! Feel free to respond to this email with what is on your mind or if you have found any interesting papers recently.
This week’s paper is Defcon: Preventing Overload with Graceful Feature Degradation.
The Defcon paper talks about how Meta built a system that allows it to turn off features when their products are under heavy load, an approach called “graceful degradation”.
The implementation allows oncallers to respond to near-overload scenarios by gradually disabling non-critical features for the business - for example, it is preferable to disable a “user status” feature rather than having all of Instagram go down.
Defcon is also used to understand how much capacity a feature is using - by periodically turning off a feature, it is possible to visualize differences in CPU and memory usage correlated to the feature being turned off (this also closely connects to the ideas in Flux, another recently published Meta paper on capacity management).
The paper review is best enjoyed on my blog.
Discussion on:
Hacker News: https://news.ycombinator.com/item?id=36864764
Until next time,
Micah