Towards an End-to-End Training Data Management System for Machine Learning Models

Zhang, Huayi

Etd

Towards an End-to-End Training Data Management System for Machine Learning Models

Public Deposited

A successful machine learning application requires powerful models and high-quality training data. Powerful machine learning models that achieve promising results on benchmark tasks have been widely developed. However, techniques and tools to help practitioners prepare and evaluate high-quality training data remain limited. This dissertation introduces an end-to-end training data management system that tackles several related challenges. 1. Raw datasets collected in real applications may contain low-quality data samples, such as anomalies produced due to sensor issues. To tackle this, I develop a semi-supervised anomaly detection algorithm called ELITE. While only requiring the users to label an extremely small set of samples, ELITE significantly improves the state-of-the-art deep anomaly detection models. Our experiments on public benchmark datasets show that ELITE achieves up to 30% improvement in ROC-AUC score compared to the state-of-the-art techniques. This project has been published in KDD2021. 2. In practice, it is often infeasible to manually label a sufficient number of training samples for modern large-scale machine learning models. To minimize labeling efforts by domain experts, I propose a label propagation system, LANCET, that automatically propagates manually annotated labels to similar unlabeled data objects. LANCET addresses three challenges in an integrated framework: (1) which data samples to ask humans to label, (2) how to propagate labels to other samples automatically, and (3) when to stop labeling. Our experiments on diverse public data sets demonstrate that LANCET consistently outperforms state-of-the-art methods by a large margin -- up to 30 percentage increase in accuracy. This project has been published in VLDB2021. 3. The training data preparation tools, such as the ones mentioned above, still inevitably generate erroneous training data. To tackle this, I develop the MetaStore system to help data scientists in curating deep learning models’ training data based on training samples’ gradients produced during the model training process. MetaStore supports efficient gradient-based analytics query execution with three key components : (1) a lightweight gradient collector, (2) a compact gradient storage, and (3) an efficient gradient analytics engine. Our experiments demonstrate that MetaStore outperforms alternative baseline methods from 4 to 578x in storage costs and from 2 to 1000x in running time.This project has been submitted to VLDB2023.

Creator