Apache Calcite a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

#Apache Calcite

March 10, 2024

Michael StonebrakerとUğur Çetintemel presented One Size Fits All in 2005. The title refers to the fact that traditional RDBMS had been used to implement a variety of data-centric applications. Tn the paper, they argued that the phrase was no longer applicable and predicted the rise of domain-specific database engines. Indeed, organizations leverage various domain-specific database engines such as column stores and text search engines today.

However, the end of the “One Size Fits All” paradigm brought two problems. First, the developers of each domain-specific database engine have to independently implement similar essential features such as query optimiztion and query interpretaion. Second, data are distributed across heterogeneous databases, and they have to be integrated before processing.

Apache Calcite was developed to solves these problems. Calcite provides SQL processsing features including query intepretation and optimization. Calcite does not contain a storage layer. Istead of containing it, given a SQL query, Calcite interprets and optimizes the query, and sends requests to backend databases. After receving the processed data from backends, Calcite integrates the data and responds to the client. Client can run SQL queries through Calcite to process data in backends as if the data were stored in one RDBMS.

Calcite connects to backend databases to get data to handle instead of holding its own storage. The locations of the backends are specified in files called models. The collection of data in a backend is seen as if a RDBMS schema from the Calcite clients. The following block is an example of a model. factory in the model denotes the class that implements SchemaFactory. The factory creates a Schema object that behaves as a RDMBS schema.

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'sales'
      }
    }
  ]
}

A relational operator is an unit of operation to a set. For instance, filter, project and join are operators. The Table instances that the schema contains implement the table scan operator that provides the data in the table. Calcite traslates a SQL query to an expression tree of relational operators. The physical properties of an operator are called traits, and the optimzier uses traits to evaluates the cost of different alternative plans. One of major traits is calling convention trait that represents the backend where the expression will be executed. For example, the operator with spark convention trait makes the expression run in Apache Spark.

The query operator repeatedly transforms an expression that keep the original semantics but runs at a lower cost by applying transformation rules called planner rules. For example, Calcite processes queries involing tables across multiple backends by pushing down all possible logic to each backend.