




報告人:周睿婷 武漢大學副教授

報告題目:Online Scheduling of Heterogeneous Distributed Machine Learning Jobs


報告摘要Distributed machine learning (ML) has played a key role in today’s proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today’s AI cloud systems.


報告人介紹:周睿婷,武漢大學國家網絡安全學院副研究員、碩導。2018年在加拿大卡爾加裡大學(University of  Calgary)獲得計算機科學博士學位。香港中文大學計算機科學與工程學系博士後。研究興趣包括雲計算、5G、物聯網、機器學習、聯邦學習中的優化算法研究。已發表/錄用30篇頂級會議及期刊論文,包括IEEE INFOCOM、ACM MOBIHOC、ICPP、IEEE/ACM TON、IEEE TPDS、 IEEE JSAC、 IEEE TMC等。擔任IEEE INFOCOM Workshop-ICCN 2019/2020/2021審稿委員會主席,并擔任IEEE MSN、Globecom、ICC、IEEE JSAC、TMC、 IEEE/ACM ToN等多個頂級會議期刊審稿人。獲得ACM GREENMETRICS 2016最佳學生論文提名獎、中國圖靈大會2019最佳論文提名獎、BIGCOM 2020最佳論文獎。入選武漢市黃鶴英才優秀青年人才項目、江蘇省蘇北發展特聘專家等省級人才項目。
