Micah Learns
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
Jan 27
•
Micah Lerner
6
March 2024
[Paper Review] ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta
Hi everyone,
Mar 29, 2024
•
Micah Lerner
10
1
[Paper Review] A Cloud-Scale Characterization of Remote Procedure Calls
Hi everyone,
Mar 5, 2024
•
Micah Lerner
2
January 2024
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Hi everyone,
Jan 31, 2024
•
Micah Lerner
1
[Paper Review] XFaaS: Hyperscale and Low Cost Serverless Functions at Meta
Hi everyone,
Jan 24, 2024
•
Micah Lerner
2
[Paper Review] Efficient Memory Management for Large Language Model Serving with PagedAttention
Thanks for reading Micah Learns!
Jan 11, 2024
•
Micah Lerner
7
[Paper Review] Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications
Thanks for reading Micah Learns!
Jan 3, 2024
•
Micah Lerner
2
December 2023
2023 and looking forward to 2024
A tradition of mine is to write a year end reflection, regardless of whether it makes it up onto the blog or not - in this case it did :)
Dec 27, 2023
•
Micah Lerner
2
July 2023
Systems Papers - Defcon: Preventing Overload with Graceful Feature Degradation
Hello new and old subscribers!
Jul 25, 2023
1
Systems Papers - live tomorrow!
Hello new and old subscribers!
Jul 5, 2023
June 2023
Systems Papers - Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale
Hello new and old subscribers!
Jun 29, 2023
Systems Papers - TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
Hello new and old subscribers!
Jun 7, 2023
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts