Home > Database > Robust Query Processing in Database Systems*

Robust Query Processing in Database Systems*

1.2 Query Execution Plans

The important point to note in the above formulation is that several decisions are left unspecified by the user. For instance, the sequence in which the tables are combined – candidate sequences could be ((STUDENT ⋈ REGISTER) ⋈ COURSE) or ((REGISTER ⋈ COURSE) ⋈ STUDENT). The result of the joins would be identical in both cases because the ⋈ operator is, from a mathematical perspective, both associative and commutative – however, the effort involved in the two executions could be vastly different. Moreover, even after a sequence has been decided, we still have to decide the implementation mechanism for each of the logical join operators. A spectrum of join algorithms has been developed in the literature, including NESTED-LOOPS JOIN, SORT-MERGE JOIN, HASH JOIN, etc., and a choice from among these has to be made for each logical join.

All the above decisions have to be made by the database engine, which is responsible for identifying and executing the most efficient means to process the user query. These two steps are implemented by the query optimizer and the query executor components within the core of the engine, and over the past half century, research on the design and implementation of these components has been a foundational topic for both the academic and industrial database communities.

The alternative strategies considered by the query optimizer, in its quest to find the ideal, are referred to as “plans” in the database jargon. They are usually costed and compared in terms of their estimated query response times. It is important to note here that optimization is a mandatory exercise since the difference between the cost of the best plan and a random choice could be in orders of magnitude. The role of query optimizers has become especially critical in recent times due to the high degree of query complexity characterizing current decision-support applications, as exemplified by the industry-standard TPC-H and TPC-DS performance benchmarks [59] [60] .

1.3 Query Optimization Challenges

In spite of the long-standing research mentioned above, the query processing domain has largely remained a “black art”. This is due to a variety of well-documented complexities and challenges [8] [9] , discussed in Section 2. In fact, even as recently as 2014, a highly-respected industry veteran had this extremely pessimistic statement to make in a blog post [34]: The wonder isn’t “Why did the optimizer pick a bad plan?” Rather, the wonder is “Why would the optimizer ever pick a decent plan?”! Similar sentiments have been expressed by other academic and industrial experts, including: “Query optimizers do a terrible job of producing reliable, good plans (for complex queries) without a lot of hand tuning.” [16] , and “Almost all of us who have worked on query optimization find the current state of the art unsatisfactory with known big gaps in the technology.” [10]

The scale of performance degradation faced by database queries arising out of the poor choices of execution plans, can be huge – often in orders of magnitude – as compared to an oracular ideal that magically knows the ideal plan for optimal query processing. As a case in point, when Query 19 of the TPC-DS benchmark is executed on contemporary database engines, the worst-case slowdown, relative to the hypothetical oracle, can exceed a million! [17] Therefore, it is self-evident that the current design of database systems is extremely brittle, with large-scale performance variations arising out of inadequate design methodologies. In a nutshell, they substantively fail to provide robust performance.

Finally, apart from the obvious negative impact on user productivity and satisfaction, there are also financial implications of this performance degradation – for instance, it is reported in [49] that the lack of robustness can contribute as much as a third to the total cost of ownership for a database system.

1.4 Recent Research Advances

In the midst of this gloom and doom, the good news is that in recent times, there have been a host of exciting research advances, which collectively promise to provide strong foundations for designing the next generation of query engines. The expectation is that these advances will eventually organically support robust query processing (RQP), relegating to the past the above-mentioned cynicism on this bedrock feature. Many of the new ideas owe their genesis to a series of influential and well-attended Dagstuhl Seminars in Germany on the topic of Robust Query Processing over the last decade (2010 [49] , 2012 [50] , 2017 [51] ). Further, they have arisen from research teams located at diverse locations across the world, including the US, Europe and Asia.

1.5 Organization

The remainder of this article is organized as follows: In Section 2, we outline the basics of the query
optimization process. Then in Section 3, we present the recently developed PlanBouquettechnique
which provides performance guarantees and takes an important step towards addressing the RQP
problem. Finally, in Section 4, we outline future research avenues that can be explored in the RQP
context.

2 Query Optimization

Given its industrial relevance, it is not surprising that a rich body of literature has developed on database query processing. A sampling of this prior work is given in the Bibliography – in particular, we direct the reader to the comprehensive and readable surveys presented in [19] [8] [9] . In this section, we will only provide an intuitive overview of the process to serve as motivation for explicitly incorporating robustness in database engine design.

Given a generic SQL query that requires combining information across N relations, there are in principle N! different permutations of the combination sequence, implying that the strategy search space is at least exponential in the query size. The automated identification of efficient plans from the humongous search space is the responsibility of the database query optimizer.

Plans are typically comprised of a tree of data processing operators that are evaluated in a bottom-up paradigm. A sample plan is shown in Figure 3 for the example query of Figure 2, where the STUDENT and REGISTER relations are first combined with a NESTED-LOOP join operator, and this intermediate result is then combined with the COURSE relation using a HASH-JOIN operator. The plan is evaluated in a bottom-up traversal, beginning with the leaf nodes, and working our way up to the root, which provides the final results. Further, the bracketed numbers within each operator node indicate the estimated aggregate processing costs incurred until this stage in the evaluation. While operator costs could be defined with regard to various measures, including execution time, output latency, or resource consumption, we will assume in this article that the costs are indicative of execution time.

Modern query optimizers each have their own “secret sauce” to identify the best (i.e. cheapest/fastest) plan for answering these declarative SQL queries. However, the de-facto standard underlying strategy that is present in all these systems, and which was pioneered by the System R project at IBM Research [42] , is the following:

1. First apply a variety of heuristics to restrict the exponential plan search space to a manageable size. For instance, the early database systems only considered left-deep trees, wherein at least one of the relations in each join is a base user relation appearing in the database schema.
2. Next, estimate with a cost model and a dynamic-programming-based processing algorithm, the efficiency of each of these candidate plans.
3. Finally, choose the plan with the lowest estimated cost.

Pages ( 2 of 9 ): « Previous1 2 34 ... 9Next »

Leave a Comment:

Your email address will not be published. Required fields are marked *