A pictorial representation of the canonical query optimization framework is shown in Figure 5. The input is the declarative SQL query and the output is the optimal (i.e. cheapest/fastest) execution plan. The core processing is implemented by the dynamic programming-based search algorithm, which leverages the fact that the globally optimal plan can be incrementally built up using the locally optimal solution for each operator. Further, there are two essential models that serve as inputs to this process, namely, the cardinality estimation model and the cost estimation model, which are described below.
2.1 Cardinality Estimation Model
The cardinality model estimates the volume of data, measured in number of database rows, that flows from one operator to the next in the plan tree. As shown in Figure 5, this volume is a function of the data distributions within the relational columns, and the data correlations across the columns. The individual column distributions are usually approximated, in a piece-wise manner, using histogram-based summaries.
Typically, a host of cardinalities have to be estimated while constructing efficient execution plans for declarative decision-support queries. For instance, consider the example query shown in Figure 4 for enumerating orders of cheap parts (costing less than Rs. 1000) from a manufacturing database – here, the optimizer has to estimate the cardinalities of the three bold-faced constraints:
a filter predicate (p_retailprice < 1000) and two join predicates (part ⋈ lineitem, orders ⋈ lineitem). Unfortunately, in practice, these estimates are often significantly in error with respect to the actual values subsequently encountered during query execution. Such errors, which can even be in orders of magnitude in real database environments   , arise due to a variety of well-documented reasons   , including outdated statistics, coarse summaries, attributevalue independence assumptions, complex user-defined predicates, and error propagation in the query plan tree . Moreover, in industrial environments such as ETL (Extract, Transform, Load) workflows, the statistics may actually be unavailable due to data source constraints, forcing the optimizer to resort to “magic numbers” for the cardinality values (e.g. 0.1R for equality selections on columns of a relation with R rows  ). The net outcome of these erroneous estimates is that the execution plans recommended by the query optimizer may turn out to be very poor choices at run-time, resulting in grossly inflated query response times.
A considerable body of literature exists on proposals to tackle this classical problem. For instance, techniques for improving the statistical quality of the associated meta-data include improved summary structures   , feedback-based adjustments  , and on-the-fly reoptimization of queries   . A complementary approach is to identify robust plans that are relatively less sensitive to estimation errors     . While all these prior techniques provide novel and innovative formulations, a common limitation is their inability to furnish performance guarantees. That is, we can only hope for good performance, but cannot provide provable bounds in this regard.
2.2 Cost Estimation Model
We now turn our attention to the cost estimation model, which is responsible for estimating the time taken for processing the data at each operator in the plan. As shown in Figure 5, its estimates, which are usually computed on a normalized per-row basis, are dependent on the underlying computing platform and the software implementation of the database engine. The overall cost of each operator is the product of the estimated row cardinalities (as obtained from the cardinality model) and the per-row cost estimates.
Just as in the cardinality model, the cost model can also be subject to errors that adversely impact robustness. These errors arise out of difficulties in predicting runtime resource availability, modeling complex interactions among database and platform components, catering to heterogeneity in the hardware, etc. Again, similar to the cardinality model, the cost model also has been the subject of several studies in the prior literature, and has seen renewed interest in the last decade. For instance, the effectiveness of machine learning techniques to predict query execution time, by using both plan-level and operator-level models, was demonstrated in  . Their features included optimizer cost estimates, query parameters and the actual runtime statistics. In marked contrast, another research group showed in  as to how the existing statistical models themselves, with some degree of augmentation and proper tuning, could produce satisfactory estimates of query execution times with much less computational effort. In their approach, an initial offline profiling phase was used to to establish the unit temporal costs for utilizing various system resources. Subsequently, online sampling was employed to estimate the number of usages for a given query. In a series of follow-up papers   , stronger statistical models were incorporated in their algorithmic suite to maintain the prediction quality in the presence of uncertainty and concurrency.