🎓 Ph.D. Dissertation (2026)

ARCHITECTURES AND PROTOCOLS FOR ORCHESTRATION OF DISTRIBUTED LEARNING SYSTEMS

Author: Andrea Pinto

Institution: Saint Louis University

Advisor: Dr. Flavio Esposito

Abstract

Machine learning (ML) systems are rapidly growing in size and complexity. Computation has scaled through specialized accelerators and distributed clusters. However, the network is now a performance bottleneck, limiting model convergence speed. Therefore, resource orchestration must account for synchronization delays and bandwidth contention. These factors increasingly dominate training times. Existing solutions are insufficient because they often orchestrate compute and network resources in isolation.

Both data centers and edge environments experience this strain. In this dissertation, we show that computing and network resources must be optimized simultaneously. To address this, we develop joint orchestration and networking solutions for both data centers and the edge.

In data centers, existing orchestration systems lack network-aware controllers. To address this, we introduce Plebiscito, a joint resource- and network-aware scheduler. Plebiscito optimizes job placement and manages network resources natively. We show that this approach reduces network contention, significantly improving job completion time and cluster throughput for ML workloads.

Moreover, deploying distributed ML on the edge presents complementary challenges. Edge devices face constrained compute and unstable connectivity. Motivated by Federated Learning (FL), we introduce systems to accelerate training and optimize network utilization. First, we exploit the programmable 5G stack for in-network aggregation. Offloading parameter aggregation to the radio access network (gNB) cuts latency and improves time-to-accuracy. Second, we develop Split Federated Learning (SFL). This hybrid approach partitions models across devices and servers, improving computation efficiency. Finally, we use Large Language Models (LLMs) to improve the selection of FL clients. These models adapt to dynamic conditions in real-time. We show that they outperform traditional rigid heuristics by leveraging client state and data availability.

Ultimately, these complementary mechanisms improve distributed ML performance. These results offer new network-aware orchestration solutions for both large-scale clusters and diverse wireless edge networks.

System Architecture

Orchestration architecture of distributed learning systems. View PDF Version