Avoid redundant Megatron empty_cache and GC cleanup by FurtherAI · Pull Request #730 · OpenPipe/ART

FurtherAI · 2026-06-17T20:27:22Z

Summary

Remove redundant gc.collect() and torch.cuda.empty_cache() calls from Megatron service, job, and step cleanup paths.
Keep CUDA cache clearing centralized for colocated weight offload only, after weights are offloaded.
Avoid dedicated-mode cleanup syncs between training jobs.

Benchmarked Megatron throughput at max sequence length with CP4/EP4 Qwen3.5; completed without OOM.
Throughput improved over previous results.

FurtherAI added 3 commits June 17, 2026 19:48

Optimize Megatron CUDA cache cleanup

de6b6d7

Remove Megatron colocated GC cleanup

f62bff0

Fix Megatron shared prefix gc import

86e12a6

FurtherAI marked this pull request as ready for review June 17, 2026 20:57

FurtherAI merged commit 81d8f9e into main Jun 17, 2026
5 checks passed