Continuous Checkpointing in Orbax and MaxText: Halves Checkpoint Gaps, Saves Hours on TPU Failures
On a dual-slice v5p-128 TPU cluster training Llama 3.1 70B, continuous checkpointing slashed P50 intervals from 100 steps to under 50—without tanking goodput. Here's why this async trick rewrites large-scale LLM training.
theAIcatchupApr 10, 20264 min read
⚡ Key Takeaways
Continuous checkpointing halves P50 intervals on v5p TPUs, slashing lost work on failures.𝕏
Async saves avoid DCN bottlenecks in multi-slice runs, scaling better than fixed schedules.𝕏
Orbax's policy flexibility—from min intervals to custom preservation—fits any training scale.𝕏
The 60-Second TL;DR
Continuous checkpointing halves P50 intervals on v5p TPUs, slashing lost work on failures.
Async saves avoid DCN bottlenecks in multi-slice runs, scaling better than fixed schedules.
Orbax's policy flexibility—from min intervals to custom preservation—fits any training scale.