Quantifying and Optimizing Performance of Distributed Deep Learning with Cloud Storage Buckets

Krichevsky, Nicholas J.; St. Louis, Matthew A.

Student Work

Quantifying and Optimizing Performance of Distributed Deep Learning with Cloud Storage Buckets

Public

Cloud environments provide a powerful, low-cost environment for running distributed deep learning workloads. As problems scale up, the methods of storing and loading training data become a significant concern. While cloud storage buckets seem like a cost-effective option for large data storage, their bandwidth limitations can impose a non-trivial performance overhead for distributed training. We propose two approaches to compensate for this bandwidth limitation: caching and pre-fetching. Our project quantifies the performance and cost of these approaches, and discusses their usefulness in existing cloud-based distributed deep learning systems. With these approaches, we achieve performance close to that of storing data on each node and potentially lower cost, especially with models with long training times—all while storing only a fraction of the data on disk at a time.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator

Publisher