Student Work

 

Quantifying and Optimizing Performance of Distributed Deep Learning with Cloud Storage Buckets Public

Downloadable Content

open in viewer

Cloud environments provide a powerful, low-cost environment for running distributed deep learning workloads. As problems scale up, the methods of storing and loading training data become a significant concern. While cloud storage buckets seem like a cost-effective option for large data storage, their bandwidth limitations can impose a non-trivial performance overhead for distributed training. We propose two approaches to compensate for this bandwidth limitation: caching and pre-fetching. Our project quantifies the performance and cost of these approaches, and discusses their usefulness in existing cloud-based distributed deep learning systems. With these approaches, we achieve performance close to that of storing data on each node and potentially lower cost, especially with models with long training times—all while storing only a fraction of the data on disk at a time.

Last modified
  • 05/25/2021
Creator
Publisher
Identifier
  • E-project-031021-125018
  • 5611
Keyword
Advisor
Year
  • 2021
Date created
  • 2021-03-10
Resource type
Major
Rights statement
License

Relationships

In Collection:

Items

Permanent link to this page: https://digital.wpi.edu/show/xs55mf90p