Addressing Imbalanced Data in Machine Learning: Methods and Challenges

Philpot, Robert; Jahnz, Joshua

Student Work

Addressing Imbalanced Data in Machine Learning: Methods and Challenges

Public Deposited

Imbalanced datasets are prevalent in machine learning, posing significant challenges due to the underrepresentation of certain classes. This often leads to biased models with poor predictive performance on minority classes. The project we completed dives into various strategies to mitigate such biases, focusing on innovative methods that enhance model accuracy and fairness across different data distributions. We explore ten distinct techniques, including Nearest Neighbor Guidance (NNGuide), Parameter-Efficient Long-Tailed Recognition (PEL), and Ensemble Learning combined with data augmentation strategies like SMOTE. Each method was rigorously tested across popular datasets like CIFAR-10, CIFAR-100, and ImageNet-LT, utilizing metrics such as AUROC and F1 scores for a comprehensive evaluation. Our findings not only highlight the strengths and limitations of each approach but also guide the selection of appropriate techniques depending on the specific characteristics of the dataset. The insights from this research contribute to both theoretical and practical advancements in handling class imbalance, offering a pathway to more robust and equitable machine learning applications. This study underscores the necessity of tailored approaches to manage class disparities, paving the way for future innovations in the field.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator

Publisher