Since the introduction to Cube in late 2021 and an update in 2022, the SQL API has gone a long way to support connecting your semantic layer to practically all popular BI and data exploration tools, including Power BI, Tableau, Thoughtspot, Sigma, Superset / Preset, and Metabase.

Today, we’re expanding the capabilities of the SQL API and announcing the public preview of query pushdown, a new feature that brings BI support to the next level. With query pushdown, Cube can understand and execute even the most convoluted and intricate SQL that your BI tool dares to generate.

How SQL API works with query pushdown

Cube’s SQL API implements a wire-compatible Postgres protocol, which allows Cube to be seen as a Postgres database by BI tools. However, there’s no actual Postgres engine in Cube.

Under the hood, Cube uses Apache DataFusion (an open-source query engine) and egg (an open-source e-graph term rewriting library) to analyze incoming SQL queries and find the best query plan out of a wide variety of possible plans to execute.

Without query pushdown, Cube plans the execution of incoming SQL queries as a combination of regular queries (compatible with Cube’s REST API) and post-processing on top of them using DataFusion. Such an approach works great for many SQL queries, both hand-written and generated by BI tools, while benefiting from Cube’s in-memory cache and pre-aggregations.

However, sometimes BI tools would generate SQL that just can’t be decomposed to regular queries with post-processing. In those cases, Cube would output: “Can't detect a Cube query or it may be not supported yet.” It can be due to the presence of a correlated subquery, a complex expression, or a function that is not yet supported by DataFusion, etc.

With query pushdown, there’s a new branch for possible query execution. Cube may decide to transpile an incoming SQL query, in Postgres dialect, to the SQL dialect of the supported upstream data source (e.g., Snowflake, BigQuery, or Redshift). Thanks to e-graphs, all possible execution branches are evaluated in parallel, so that Cube can choose the most optimal query plan.

sql-api-diagram

Some examples of queries that are now supported with query pushdown: complex expressions within SELECT or WHERE clauses, all kinds of window function expressions, aggregations over dimensions, correlated subqueries, and more.

Query pushdown has been in use by self-hosting users of Cube and Cube Cloud customers as well. You can hear from one of the early adopters:

Query pushdown in the SQL API enabled us to create calculated dimensions and measures on the fly in Preset (using the custom SQL functionality). This enabled prototyping of new dashboards without first having to update the data model in Cube.

Guy DawsonGuy DawsonEngineer at Permutive

Security considerations

With query pushdown, Cube provides more flexibility for executing arbitrary SQL queries to your end users and BI tools. However, these SQL queries are not going to your database directly. With or without query pushdown, all security controls that Cube provides still work.

You can apply additional filters in query rewrite, implement multitenancy, and control the visibility of data model entities with public parameters. If members of a cube or a view are hidden (public: false), there’s still no way to access those. Also, any DDL and system SQL statements are ignored by the SQL API.

Please keep in mind that if users have access to any dimension, they would be able to run queries performing arbitrary aggregations on top of that dimension. This effectively exposes your fact tables to end users. However, row-level security controls still apply and probing data outside of the security filters set is impossible. So, even with query pushdown, SQL API still provides very strong access isolation and security for your data.

Ungrouped queries

With query pushdown, the SQL API will execute ungrouped queries differently. This is considered a breaking change. When enabling query pushdown, please check if these changes affect your existing deployments.

Without query pushdown, Cube was treating queries without GROUP BY as aggregating queries with all dimensions implicitly included in GROUP BY. It was convenient for ad-hoc queries, but such semantics were breaking a lot of SQL guarantees.

When you enable query pushdown, queries without GROUP BY would be treated similarly to the REST API queries with the ungrouped: true flag. However, aggregation over nested ungrouped queries would still yield correct results. To drill down into this concept, let’s consider the following data model:

cubes:
- name: orders
sql_table: orders
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: status
sql: status
type: string
- name: created_at
sql: created_at
type: time
measures:
- name: count
type: count
- name: completed_count
type: count
filters:
- sql: "{status} = 'completed'"
- name: completed_percentage
sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
type: number
format: percent

With query pushdown enabled, let’s run the following query to the SQL API:

SELECT
id,
status,
created_at,
completed_percentage
FROM orders

Such a query is considered an ungrouped query and would result in the following result set:

| id | status | created_at | completed_percentage |
| -- | --------- | ---------- | -------------------- |
| 1 | shipped | 2024-01-01 | 0.0 |
| 2 | completed | 2024-01-01 | 100.0 |
| 3 | completed | 2024-01-02 | 100.0 |

Now, here’s a typical query that various BI tools would generate:

SELECT
DATE_TRUNC('day', created_at) AS created_at__day,
MEASURE(completed_percentage) AS completed_percentage
FROM (
SELECT
id,
status,
created_at,
completed_percentage
FROM orders
) AS inner_query
GROUP BY 1

When SQL API with query pushdown runs this query, it would yield correct results:

| created_at__day | completed_percentage |
| --------------- | -------------------- |
| 2024-01-01 | 50.0 |
| 2024-01-02 | 100.0 |

Here’s how this works. In this particular case, inner_query is not evaluated as a table. Instead, Cube postpones its execution until the wrapping GROUP BY is executed. Then, Cube would use only DATE_TRUNC('day', created_at) as a dimension to evaluate the completed_percentage measure—instead of full set of inner_query columns that includes id , status, and created_at. To make this possible, Cube keeps track of all ungrouped queries and evaluates them only on the first occurrence of GROUP BY query in case there’s one.

Also, with query pushdown, ungrouped queries don’t require primary key dimensions to be present, i.e., the CUBEJS_ALLOW_UNGROUPED_WITHOUT_PRIMARY_KEY configuration option is enabled as well for the best user experience. It means that removing the id column from the query above would not affect the calculations and still yield the same result.

How to enable query pushdown

First of all, please upgrade Cube to the latest available version, at least v0.35.40 or later. We’re actively working on the SQL API, shipping updates and improvements daily.

Then, you can enable query pushdown in the SQL API by setting the CUBESQL_SQL_PUSH_DOWN environment variable to true. Please refer to the documentation to learn more about query pushdown configuration.

What’s next?

Starting today, query pushdown in Cube’s SQL API is available as public preview. Once it becomes a generally available feature, we believe it should be enabled by default. Query pushdown is still in active development, though, and its behavior might change as we move forward.

Please give query pushdown a try, share your feedback on our Slack community, and please feel free to file issues on GitHub if needed.

And—we’re hiring!

Do you think hacking on the SQL API is fun? We have an open engineering position dedicated to it specifically. Please check out the job description if you’re interested in abstract syntax trees, abstract rewriting systems, databases, and Rust.