SQL API

Migration to Postgres protocol

This is documentation of Cube’s SQL API for the Postgres protocol. The MySQL protocol used in the first versions of the Cube SQL API is no longer being developed and will be phased out soon. Please consider migrating to the Postgres protocol.

The Cube SQL API allows querying Cube via Postgres-compatible SQL. It enables the use of BI applications, Python notebooks, reverse ETL tools, and other downstream tools on top of Cube.

Cube SQL API has been tested with psql CLI, Apache Superset, Tableau Cloud, Tableau Desktop with JDBC driver and PowerBI. Please see this GitHub issue for the tools roadmap and to suggest and vote for tools of your interest.

The first step to get started with the SQL API in Cube Cloud is to create a deployment. You can follow this step-by-step guide on creating deployment within Cube Cloud.

Once the deployment is ready, click How to connect link on the Overview page. It will open a modal with instructions on different ways to connect to Cube. Navigate to the SQL API tab and enable the SQL API.

Once it is enabled, you should see a screen like the one below with your connection credentials.

Since the Cube SQL API is Postgres-compatible, please make sure to select Postgres as a database type when connecting from BI tools.

To enable the SQL API, we first need to add a new environment variable:

CUBEJS_PG_SQL_PORT=5432

If you're running Cube with Docker, remember to add a port mapping to the Cube service for CUBEJS_PG_SQL_PORT. Docker compose example:

services:
  cube_api:
    ...
    ports:
      - 5432:5432 # Cube SQL API

Or running docker from command line:

docker run -p 4000:4000 -p 5432:5432 \
  -v ${PWD}:/cube/conf \
  -e CUBEJS_DEV_MODE=true \
  -e CUBEJS_PG_SQL_PORT=5432 \
  cubejs/cube

Then, set Cube SQL credentials auth:

CUBEJS_SQL_USER=myusername
CUBEJS_SQL_PASSWORD=mypassword

Now, you can start your Cube instance and connect via the psql client with provided credentials:

psql -h 127.0.0.1 --port 5432 -U myusername --password

Under the hood, SQL API uses Apache Datafusion as its SQL execution engine. It's responsible for query planning and execution. As the conversion process from SQL to Cube Query is ambiguous, additional step is done before the query is executed. This step is called rewriting. During this step, the query plan is being rewritten in a way a maximum number of Cube Queries can be detected within the given query plan. Overall, rewriting is a seamless process. There're some practical considerations that you should keep in mind while querying, though.

In order to simplify interaction with data tools, every cube is represented as a table. Measures, dimensions, and segments in this table are columns. However, not all queries are possible on these tables. In order to be valid, query against cube table should be:

  1. valid SQL statement and be compilable to the initial query plan by Datafusion;
  2. one of the supported query types to the cube so it can be converted to Cube Query.

Let's discuss some of the supported query types.

There're two types of queries supported against cube tables: aggregated and non-aggregated. Aggregated are those with GROUP BY statement, and non-aggregated are those without it. Cube Queries issued to your database will always be aggregated, and it doesn't matter if you provide GROUP BY in a query or not.

Whenever you use a non-aggregated query you need to provide only column names in SQL:

SELECT status, count FROM Orders

The same aggregated query should always aggregate measure columns using a corresponding aggregating function or special MEASURE() function:

In cases where measure columns are not aggregated Projection references non-aggregate values error will be thrown. It means there're columns that are neither in GROUP BY or aggregated. It's a standard SQL GROUP BY operation consistency check enforced by SQL API as well.

SELECT status, SUM(count) FROM Orders GROUP BY 1
SELECT status, MEASURE(count) FROM Orders GROUP BY 1

Cube supports most of simple equality operators like =, <>, <, <=, >, >= as well as IN and LIKE operators. Cube tries to push down all filters into Cube Query. In some cases, SQL filters aren't available in Cube and can be done in a post-processing phase. Time dimension filters will be converted to time dimension date ranges whenever it's possible.

Cube tries to push down all ORDER BY statements into Cube Query. If it can't be done ordering part would be done in a post-processing phase. In case there're more than 50k rows in the result set, incorrect results can be received in this case. Please use EXPLAIN in order to check if it's the case.

Limit push down is supported by Cube however, a limit over 50k can't be overridden. In future versions, paging and streaming would be used to avoid this limitation.

Consider the following schema.

cube(`Orders`, {
  sql: `SELECT * FROM public.orders`,

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    status: {
      sql: `status`,
      type: `string`,
    },

    created: {
      sql: `created_at`,
      type: `time`,
    },
  },
});

It would be represented as table in SQL API with count, status, created columns.

To get the count of orders grouped by status we can run the following query.

cube=> SELECT count, status FROM Orders;
 count |   status
-------+------------
 15513 | completed
 14652 | processing
 13829 | shipped
(3 rows)

Cube will automatically apply the GROUP BY clause in case it is missing in the query. We can also provide the GROUP BY statement to control how results are grouped. In the following example we group orders by created month and also by status within every month.

cube=> SELECT MEASURE(count), status, DATE_TRUNC('month', createdAt) date FROM Orders GROUP BY date, status ORDER BY date asc;
 measure(Orders.count) |   status   |            date
-----------------------+------------+----------------------------
                    31 | shipped    | 2016-01-01 00:00:00.000000
                    28 | completed  | 2016-01-01 00:00:00.000000
                    28 | processing | 2016-01-01 00:00:00.000000
                    28 | shipped    | 2016-02-01 00:00:00.000000
                    18 | processing | 2016-02-01 00:00:00.000000
                    28 | completed  | 2016-02-01 00:00:00.000000
                    54 | processing | 2016-03-01 00:00:00.000000
                    57 | completed  | 2016-03-01 00:00:00.000000
                    56 | shipped    | 2016-03-01 00:00:00.000000
                    54 | shipped    | 2016-04-01 00:00:00.000000
                    60 | completed  | 2016-04-01 00:00:00.000000
                    43 | processing | 2016-04-01 00:00:00.000000
                    55 | shipped    | 2016-05-01 00:00:00.000000

Querying dimensions is straightforward, simply add any required fields to the SELECT clause.

cube=> SELECT status FROM Orders;
   status
------------
 completed
 processing
 shipped
(3 rows)

Measures can similarly be queried through Cube SQL.

Because measures are already aggregated in Cube there is no need to apply aggregate functions to them in SQL API if you don't have a GROUP BY statement in query.

cube=> SELECT count FROM Orders;
 count
-------
 43994
(1 row)

Some of the BI systems or SQL constraints may require you to apply aggregate functions. To support this Cube allows aggregate functions on measures as long as they match the type of the measure.

count measure in our example is of type count, It means we can apply COUNT() aggregate function to it. The below query is similiar to the above one.

cube=> SELECT COUNT(count) FROM Orders;
 COUNT(Orders.count)
---------------------
               43994
(1 row)

There's also universal aggregate function MEASURE() that matches any measure type.

cube=> SELECT MEASURE(count) FROM Orders;
 measure(Orders.count)
-----------------------
                 43994
(1 row)

Let's look at more measures types:

cube('Orders', {
  ...,

  measures: {
    count: {
      type: `count`,
    },
    distinctCount: {
      sql: `id`,
      type: `countDistinct`,
    },
    approxDistinctCount: {
      sql: `id`,
      type: `countDistinctApprox`,
    },
    minValue: {
      sql: `min_value`,
      type: `min`
    },
    maxValue: {
      sql: `max_value`,
      type: `max`
    },
  },
})

As we can see, we have a mix of measure types in the above schema. To query them, we could use the following SQL statements:

--- Both the following statements are equivalent
SELECT count FROM Orders
SELECT COUNT(*) FROM Orders

--- Count distinct, and count distinct approx
--- Both the following statements are equivalent
SELECT distinctCount FROM Orders
SELECT COUNT(DISTINCT distinctCount) FROM Orders

--- Both the following statements are equivalent
SELECT approxDistinctCount FROM Orders
SELECT COUNT(DISTINCT approxDistinctCount) FROM Orders

--- Both the following statements are equivalent
SELECT minValue FROM Orders
SELECT MIN(minValue) FROM Orders

--- Both the following statements are equivalent
SELECT maxValue FROM Orders
SELECT MAX(maxValue) FROM Orders

Any segments defined in a schema can also be used in Cube SQL queries. Looking at the schema below, we have one segment isCompleted:

cube('Orders', {
  ...,

  segments: {
    isCompleted: {
      sql: `${CUBE}.status = 'completed'`,
    },
  },
});

Segments must be used as boolean types in Cube SQL queries:

WHERE isCompleted = true

Cube can be configured with dynamic username & password verification system by setting a checkSqlAuth() function in the cube.js configuration file. This function should verify username and return object with password and security context.

If password returned from this function matches provided in connection string user will be authenticated with provided security context.

module.exports = {
  checkSqlAuth: async (req, username) => {
    if (username === 'fooUser') {
      return {
        password: 'mypassword',
        securityContext: {},
      };
    }

    throw new Error('Incorrect user name or password');
  },
};

Cube's SQL API can also use the Security Context for Dynamic Schema Creation or queryRewrite property in your cube.js configuration file.

By default, the SQL API uses the current user's Security Context, but this behaviour can be modified so that certain users are allowed to switch. To do this, we must first define which user is allowed to change Security Context:

First, you need to define what user is allowed to change security context:

CUBEJS_SQL_SUPER_USER=admin

If it's not enough for your case, you define your logic for check with canSwitchSqlUser property in your cube.js configuration file.

You can change security context for specific query via virtual filter on:

SELECT * FROM Orders WHERE __user = 'anotheruser';

SELECT statements only support the following projections:

* for all dimensions:

SELECT * FROM Orders;

A valid expression for a dimension or measure:

SELECT COUNT(*) FROM Orders;

A valid expression as an alias:

SELECT COUNT(*) AS order_count FROM Orders;

Cube SQL supports most conditional checks for the WHERE clause.

Comparison operators:

WHERE price > 50
WHERE price >= 50 AND <= 100

Boolean logic:

WHERE isPaid = true
  AND isCompleted = false
  OR isReviewed = false

IN operator::

WHERE status IN ('completed', 'shipped')
WHERE status NOT IN ('processing')

IS NULL:

WHERE completedAt IS NULL
WHERE completedAt IS NOT NULL

LIKE:

WHERE name LIKE 'joe'
WHERE name NOT LIKE 'bloggs'

SQL API currently does not support joins. We plan to support joins in future releases.

As a workaround, it is possible to define proxy dimension or measure inside the Cube.

cube(`Orders`, {
  sql: `SELECT * FROM public.orders`,

  joins: {
    Users: {
      relationship: `belongsTo`,
      sql: `${CUBE}.user_id = ${Users}.id`,
    },
  },

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primaryKey: true,
    },

    // this is proxy dimension
    user_city: {
      sql: `${Users.city}`,
      type: `string`,
    },
  },
});

cube(`Users`, {
  sql: `SELECT * FROM public.users`,

  measures: {},

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      primaryKey: true,
    },

    city: {
      sql: `city`,
      type: `string`,
    },
  },
});

Now, it is possible to get orders count by users city with the following query.

cube=> SELECT count, user_city FROM Orders;
 count |   user_city
-------+---------------
  9524 | New York
  9408 | San Francisco
  6360 | Mountain View
  6262 | Seattle
  4393 | Los Angeles
  3183 | Chicago
  3060 | Austin
  1804 | Palo Alto
(8 rows)

Did you find this page useful?