Systems Papers - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
Hello new and old subscribers!
This is one of my first posts on Substack - if you have any feedback, I would love to hear it! Feel free to respond to this email with what is on your mind or if you have found any interesting papers recently.
This week’s paper is Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems.
The Perseus paper won a best paper award at FAST 2023 (File and Storage Technologies) and describes a system for detecting fail-slow instances in Alibaba storage clusters - fail-slow is a failure mode in which hardware fails non-obviously, potentially by consistently degrading performance over time. At scale, this category of problem is extremely difficult to detect and can dramatically impact tail latency.
The paper review is best enjoyed on my blog.
Until next time,
Micah