Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters

Abstract

The breakthrough of foundation models makes foundation model fine-tuning (FMF) workloads prevalent in modern GPU datacenters. However, existing schedulers tailored for model training do not consider the unique characteristics of FMs, making them inefficient in handling FMF workloads. To bridge the gap, we propose Ymir, a scheduler to improve the efficiency of FMF workloads in GPU datacenters. Ymir leverages the shared FM backbone architecture to expedite FMF workloads from two aspects: (1) Ymir investigates the task transferability among different FMF workloads and automatically merges FMF workloads with the same FM into one to improve the cluster-wide efficiency via transfer learning. (2) Ymir reuses the fine-tuning runtime of FMF workloads to reduce the significant context switch overhead. We conduct 32-GPU physical experiments and 240-GPU trace-driven simulations to validate the effectiveness of Ymir. Ymir can reduce the average job completion time by up to 4.3 × compared with existing state-of-the-art schedulers. It also promotes scheduling fairness by fully exploiting the task transferability. More supplementary materials can be found on our project website https://sites.google.com/view/ymir-project.

Publication
Proceedings of the ACM International Conference on Supercomputing 2024 (ICS'24)
Weiming Zhuang
Weiming Zhuang
Research Scientist

My current research interests include vision foundation model, federated learning, computer vison, and machine learning system.