Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
newsletter.micahlerner.com
Hi everyone, Micah here. This week’s paper is one of several papers I’ll be reading from 2023’s Symposium on Operating Systems Principles (SOSP). Enjoy!Just arriving? Join my growing community of engineers to receive 1 email per week with a deep dive into cutting-edge Computer Science research.
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Gemini: Fast Failure Recovery in Distributed…
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Hi everyone, Micah here. This week’s paper is one of several papers I’ll be reading from 2023’s Symposium on Operating Systems Principles (SOSP). Enjoy!Just arriving? Join my growing community of engineers to receive 1 email per week with a deep dive into cutting-edge Computer Science research.