Back to Portfolio
🏆 Best Paper Award (Honorable Mention)

Plebiscito: A Decentralized, Bandwidth-Aware Architecture for Distributed Learning Workloads

Authors: Andrea Pinto, Stefano Galantino, Fulvio Risso, & Flavio Esposito
Venue: IFIP Networking
Read Paper PDF LinkedIn Post

Abstract

As large-scale AI training increasingly relies on distributed GPU clusters, data-center network bandwidth has become a critical bottleneck. Existing systems often overlook real-time link utilization during job placement, leading to suboptimal scheduling decisions that exacerbate congestion and increase Job Completion Time (JCT).

To address this gap, we introduce Plebiscito, a policy-based architecture that enables bandwidth-aware job placement in distributed AI training clusters. Using a distributed max-consensus auction protocol, nodes autonomously bid on incoming jobs based on local resource availability and network conditions.

We formulate this as a network utility maximization problem and prove that our decentralized algorithm achieves a (1 - 1/e) optimality bound. Experiments on a Kubernetes-based prototype and through large-scale, trace-driven simulation show that Plebiscito reduces JCT, improves bandwidth utilization, and lowers allocation failure rates compared to bandwidth-agnostic baselines.

Index Terms—Distributed Training, Resource Orchestration, Network Management, Data Center Systems, Bandwidth-Aware Allocation, Distributed Auctions.

System Architecture

Plebiscito System Architecture Diagram
A decentralized architecture for bandwidth-aware job placement using distributed auctions. View Full Image