Data Schema Concepts

Cube.js borrows a lot of terminology from data science and OLAP theory, and this document is intended for both newcomers and regular users to refresh their understanding.

We'll use a sample e-commerce database with two tables, orders and line_items to illustrate the concepts throughout this page:

orders

idstatuscompleted_atcreated_at
1completed2019-01-05 00:00:00.0000002019-01-02 00:00:00.000000
2shipped2019-01-17 00:00:00.0000002019-01-02 00:00:00.000000
3completed2019-01-27 00:00:00.0000002019-01-02 00:00:00.000000
4shipped2019-01-09 00:00:00.0000002019-01-02 00:00:00.000000
5processing2019-01-29 00:00:00.0000002019-01-02 00:00:00.000000

line_items

idproduct_idorder_idquantitypricecreated_at
131112752019-01-31 00:00:00.000000
249262482021-01-20 00:00:00.000000
389361972021-11-25 00:00:00.000000
471482232019-12-23 00:00:00.000000
56455752019-04-20 00:00:00.000000

A cube represents a dataset in Cube.js, and is conceptually similar to a view in SQL. Cubes are typically declared in separate files with one cube per file. Within each cube are definitions of dimensions and measures. Typically, a cube points to a single table in your database using the sql property:

cube('Orders', {
  sql: `SELECT * FROM orders`,
});

The sql property of a cube is flexible enough to accommodate more complex SQL queries too:

cube('Orders', {
  sql: `
SELECT
  *
FROM
  orders,
  line_items
WHERE
  orders.id = line_items.order_id
  `,
});

Dimensions represent the properties of a single data point in the cube. The orders table contains only dimensions, so representing them in the Orders cube is straightforward:

cube('Orders', {

  ...,

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      // Here we explicitly let Cube.js know this field is the primary key
      // This is required for de-duplicating results when using joins
      primaryKey: true
    },
    status: {
      sql: `status`,
      type: `string`
    },
  },
});

The line_items table also has a couple of dimensions which can be represented as follows:

cube('LineItems', {

  ...,

  dimensions: {
    id: {
      sql: `id`,
      type: `number`,
      // Again, we explicitly let Cube.js know this field is the primary key
      // This is required for de-duplicating results when using joins
      primaryKey: true
    },
  },
});

Dimensions can be of different types, and you can find them all here.

Time-based properties should be represented as dimensions with type time. Time dimensions allow grouping the result set by a unit of time (e.g. hours, days, weeks). In analytics, this is also known as "granularity".

We can add the necessary time dimensions to both schemas as follows:

cube('Orders', {

  ...,

  dimensions: {

    ...,

    createdAt: {
      sql: `created_at`,
      type: `time`
    },

    completedAt: {
      sql: `completed_at`,
      type: `time`,
    },
  },
});
cube('LineItems', {

  ...,

  dimensions: {

    ...,

    createdAt: {
      sql: `created_at`,
      type: `time`
    },
  },
});

Time dimensions are essential to enabling performance boosts such as partitioned pre-aggregations and incremental refreshes.

Measures represent the properties of a set of data points in the cube. To add a measure called count to our Orders cube, for example, we can do the following:

cube('Orders', {

  ...,

  measures: {
    count: {
      type: `count`,
    },
  },
});

In our LineItems cube, we can also create a measure to sum up the total value of line items sold:

cube('LineItems', {

  ...,

  measures: {
    total: {
      sql: `price`,
      type: `sum`,
    },
  },
})

Measures can be of different types, and you can find them all here.

Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube.js, all joins are LEFT JOINs.

An INNER JOIN can be replicated with Cube.js; when making a Cube.js query, add a filter for IS NOT NULL on the required column.

In the following example, we are left-joining the LineItems cube onto our Orders cube:

cube('Orders', {

  ...,

  joins: {
    LineItems: {
      relationship: `belongsTo`,
      // Here we use the `CUBE` global to refer to the current cube,
      // so the following is equivalent to `Orders.id = LineItems.order_id`
      sql: `${CUBE}.id = ${LineItems}.order_id`,
    },
  },
});

There are three kinds of join relationships:

  • belongsTo
  • hasOne
  • hasMany

More information can be found in the Joins reference documentation.

Segments are filters that are predefined in the schema instead of a Cube.js query. They allow simplifying Cube.js queries and make it easy to re-use common filters across a variety of queries.

To add a segment which limits results to completed orders, we can do the following:

cube('Orders', {
  ...,
  segments: {
    onlyCompleted: {
      sql: `${CUBE}.status = 'completed'`
    },
  },
});

Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date on a periodic basis. Within a data schema, they are defined under the preAggregations property:

cube('Orders', {

  ...,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.createdAt,
      granularity: 'day',
    },
  },
});

A more thorough introduction can be found in Getting Started with Pre-Aggregations.

Did you find this page useful?