Big Time Series Analytics Using a Distributed Infrastructure

Alghamdi, Noura

Etd

Big Time Series Analytics Using a Distributed Infrastructure

Public Deposited

Over the last few decades, the explosion of the internet-of-things (IoT) has led to the unprecedented growth of time series data. This leads to three phenomena. First, the data is generated at an explosive speed and volume - resulting in big time series. Second, the lifespan of data generated may span months or years – producing exceedingly long time series. Third, devices often produce a sequence of intermittent time series separated by time gaps associated with the same device, i.e., interconnected time series. Systems to scalable process such complex data must leverage indexing techniques. Unfortunately, the state-of-the-art indexing techniques lack both the required functionality, scalability, and desired accuracy to process such big, long, interconnected time series data. In this dissertation, we thus focus on the following open research themes. 1. Indexing and Querying Long Time Series. We propose a lightweight distributed indexing framework, called ChainLink, that supports approximate similarity search for full and subsequence matching over TB-scale datasets of time series objects composed of several 100s of data points. As a foundation of ChainLink, we design a novel hashing technique that as we experimentally demonstrate ensures a compact structure, efficient search and comparison, and efficient index construction. 2. Indexing and Querying Interconnected Time Series Objects. We design a new data model called Time Series Compound (or, TSC) to handle interconnected time series. We tackle the unique challenges that arise when managing, querying, and analyzing repositories of big TSC objects. Our distributed indexing infrastructure features a TSC-aware representation technique, TSC similarity semantics, and efficient processing strategies. Experimental studies show the effectiveness and efficiency of our solution. 3. Advanced Interconnected Time Series Analytics. As the TSC model inherently captures objects’ evolution over time, we propose a new class of queries, called convergence queries. We define their semantics, and design innovative query processing strategies for their scalable execution over TB-scale datasets. Experiments confirm that our strategies are not only significantly faster than the base solution on TB-scale datasets but also consistently achieve excellent accuracy.

Creator