Efficient XML Stream Processing with Automata and Query Algebra

Jian, Jinhuj

Etd

Efficient XML Stream Processing with Automata and Query Algebra

Public

XML Stream Processing is an emerging technology designed to support declarative queries over continuous streams of data. The interest in this novel technology is growing due to the increasing number of real world applications such as monitoring systems for stock, email, and sensor data that need to analyze incoming data streams. There are however several open challenges. One, we must develop efficient techniques for pattern matching over the nested tag structure of XML as data streams in token by token. Two, we must develop techniques for query optimization to cope with complex user queries while given only incomplete knowledge of source data. When considering these challenges separately, then automata models have been shown by several recent works to be suited to tackle the first problem, while algebraic query models have been regarded as appropriate foundations to tackle the second problem. The question however remains how best to put these two models together to have an overall effective system. This thesis aims to exactly fill this gap. We propose a unified query framework to augment automata-style processing with algebra-based query optimization capabilities. We use the automata model to handle the token-oriented streaming XML data and use the algebraic model to support set-oriented optimization techniques. The framework has been designed in two layers such that the logical layer provides a uniform abstraction across the two models and any optimization techniques can be applied in either model uniformly using query rewritings. The physical layer, on the other hand, allows us to refine the implementation details after the logical layer optimization. We have successfully applied this framework in the Raindrop stream processing system. We have identified several trade-offs regarding which query functionality should be realized in which specific query model. We have developed novel optimization techniques to exploit these trade-offs. For example, a query rewrite rule can flexibly push down a pattern matching into the automata model when the optimizer decides that it is more efficient to do so. To deal with incomplete knowledge of source data, we have also developed novel techniques to monitor data statistics, based on which we can apply optimization techniques to choose the optimal query plan at runtime. Our experimental study confirms that considerable performance gains are being achieved when these optimization techniques are applied in our system.

Creator