Contextual Outlier Detection from Heterogeneous Data Sources

Yan, Yizhou

Etd

Contextual Outlier Detection from Heterogeneous Data Sources

Public

The dissertation focuses on detecting contextual outliers from heterogeneous data sources. Modern sensor-based applications such as Internet of Things (IoT) applications and autonomous vehicles are generating a huge amount of heterogeneous data including not only the structured multi-variate data points, but also other complex types of data such as time-stamped sequence data and image data. Detecting outliers from such data sources is critical to diagnose and fix malfunctioning systems, prevent cyber attacks, and save human lives. The outlier detection techniques in the literature typically are unsupervised algorithms with a pre-defined logic, such as, to leverage the probability density at each point to detect outliers. Our analysis of the modern applications reveals that this rigid probability density-based methodology has severe drawbacks. That is, low probability density objects are not necessarily outliers, while the objects with relatively high probability densities might in fact be abnormal. In many cases, the determination of the outlierness of an object has to take the context in which this object occurs into consideration. Within this scope, my dissertation focuses on four research innovations, namely techniques and system for scalable contextual outlier detection from multi-dimensional data points, contextual outlier pattern detection from sequence data, contextual outlier image detection from image data sets, and lastly an integrative end-to-end outlier detection system capable of doing automatic outlier detection, outlier summarization and outlier explanation.\n\n1. Scalable Contextual Outlier Detection from Multi-dimensional Data. Mining contextual outliers from big datasets is a computational expensive process because of the complex recursive kNN search used to define the context of each point. In this research, leveraging the power of distributed compute clusters, we design distributed contextual outlier detection strategies that optimize the key factors determining the efficiency of local outlier detection, namely, to localize the kNN search while still ensuring the load balancing. \n\n2. Contextual Outlier Detection from Sequence Data. \nFor big sequence data, such as messages exchanged between devices and servers and log files measuring complex system behaviors over time, outliers typically occur as a subsequence of symbolic values (or sequential pattern), in which each individual value itself may be completely normal. However, existing sequential pattern mining semantics tend to mis-classify outlier patterns as typical patterns due to ignoring the context in which the pattern occurs. In this dissertation, we present new context-aware pattern mining semantics and then design efficient mining strategies to support these new semantics. In addition, methodologies that continuously extract these outlier patterns from sequence streams are also developed.\n\n3. Contextual Outlier Detection from Image Data. An image classification system not only needs to accurately classify objects from target classes, but also should safely reject unknown objects that belong to classes not present in the training data. Here, the training data defines the context of the classifier and unknown objects then correspond to contextual image outliers. Although the existing Convolutional Neural Network (CNN) achieves high accuracy when classifying known objects, the sum operation on multiple features produced by the convolutional layers causes an unknown object being classified to a target class with high confidence even if it matches some key features of a target class only by chance. In this research, we design an Unknown-aware Deep Neural Network (UDN for short) to detect contextual image outliers. The key idea of UDN is to enhance existing Convolutional Neural Network (CNN) to support a product operation that models the product relationship among the features produced by convolutional layers. This way, missing a single key feature of a target class will greatly reduce the probability of assigning an object to this class. To further improve the performance of our UDN at detecting contextual outliers, we propose an information-theoretic regularization strategy that incorporates the objective of rejecting unknowns into the learning process of UDN. \n\n4. An End-to-end Integrated Outlier Detection System. Although numerous detection algorithms proposed in the literature, there is no one approach that brings the wealth of these alternate algorithms to bear in an integrated infrastructure to support versatile outlier discovery. In this work, we design the first end-to-end outlier detection service that integrates outlier-related services including automatic outlier detection, outlier summarization and explanation, human guided outlier detector refinement within one integrated outlier discovery paradigm. \n\nExperimental studies including performance evaluation and user studies conducted on benchmark outlier detection datasets and real world datasets including Geolocation, Lighting, MNIST, CIFAR and the Log file datasets confirm both the effectiveness and efficiency of the proposed approaches and systems.

Creator