Quantifying and Optimizing Performance of Distributed Deep Learning with Cloud Storage Buckets Public
Downloadable Contentopen in viewer
Cloud environments provide a powerful, low-cost environment for running distributed deep learning workloads. As problems scale up, the methods of storing and loading training data become a significant concern. While cloud storage buckets seem like a cost-effective option for large data storage, their bandwidth limitations can impose a non-trivial performance overhead for distributed training. We propose two approaches to compensate for this bandwidth limitation: caching and pre-fetching. Our project quantifies the performance and cost of these approaches, and discusses their usefulness in existing cloud-based distributed deep learning systems. With these approaches, we achieve performance close to that of storing data on each node and potentially lower cost, especially with models with long training times—all while storing only a fraction of the data on disk at a time.
- Last modified
- Date created
- Resource type
- Rights statement
Permanent link to this page: https://digital.wpi.edu/show/xs55mf90p