MiseEnPlace : Fine- and Coarse-grained Approaches for Mobile-Oriented Deep Learning Inference

Ogden, Samuel

Etd

MiseEnPlace : Fine- and Coarse-grained Approaches for Mobile-Oriented Deep Learning Inference

Public Deposited

Deep learning is becoming a ubiquitous component of mobile applications. However, using deep learning models on mobile devices faces several core challenges. Chief among these is the high accuracy of deep learning models enabled by high resource demand, which is inherently at odds with constrained mobile resources. While computation offloading is a common technique, access to remote resources can be subject to highly variable networks. Further, effectively managing resources to serve these models in the cloud is difficult due to disparities in demanded resources and widely ranging model popularity. Taken together, these challenges make supporting deep learning inference for mobile applications, either on-device or in-cloud, an important and interesting problem. In this thesis, I argue that addressing these challenges from a mobile-oriented perspective can lead to improved performance, both in terms of better latency bounds but also higher accuracy. I approach the problem of deep learning inference as a mobile-oriented task, enabling adaptations to resource constraints, network variation, and the demands of a diverse workload. More concretely, I do this by focusing on individual requests, adapting their execution to enable timely responses, and considering the impact of model resource needs when serving inferences. Finally, I introduce a middleware system for deep learning inference that will schedule deep learning inferences across on-device and in-cloud resources to improve response latency and decrease monetary serving costs. My thesis has three core components. For the first component, I approach how to improve response latency and accuracy of individual inference requests. In PieSlicer (Chapter 3), through characterization and modeling input data preparation, I demonstrate it is possible to dynamically select the pre-execution workflow to reduce the response latency. In MDInference (Chapter 4), I introduce an approach that considers a set of similar models to allow for adapting the execution to satisfy a time budget, thus allowing serving systems to meet a specific response latency, as well as improving accuracy whenever possible. Together, these two works reduce response latency and use this reduction to improve the accuracy of inferences for resource-constrained mobile devices. In the second component, I address the resource ramifications of serving deep learning models for the diverse mobile demands. To do this, in CremeBrulee (Chapter 5) I introduce model-level caching, where deep learning models are treated as cacheable objects. This is motivated by a close analysis of the characteristics of deep learning workloads and the high resource demands of deep learning models; the analysis demonstrates that, contrary to common belief, memory is a key limiting resource. By approaching the management of deep learning models as objects that can be removed from memory depending on the characteristics of the models themselves and the workload characteristics, we can greatly improve resource utilization. I further demonstrate a simple eviction-based approach that dramatically improves memory utility over alternatives. The final component is LayerCake (Chapter 6), a system that leverages both on-device and in-cloud resources to support execution across a wide range of latency and accuracy targets. By leveraging both on-device and in-cloud resources for execution it is possible to take advantage of high accuracy models in the cloud when possible, but also avoid the network transfer time when necessary to meet latency targets. Overall, LayerCake achieves attains latency targets with high accuracy across a range of input requirements while decreasing the cost of leveraging cloud-based resources.

Creator