Announcing the Cube integration with Materialize: Building real-time applications over data streams

Materialize is a different kind of database: on the surface it presents as Postgres, but internally it maintains the results of a query as a materialized view that stays up-to-date as the data changes. While a traditional database does the evaluation when a SELECT statement is issued, Materialize asks for queries upfront and computes the results on the fly as new data arrives.

As we’ll discuss more below, this makes Materialize great for scenarios where you need the complex join and transformation capabilities of a warehouse, but you want to serve the results to applications.

What is Cube?

Cube is a headless business intelligence platform that makes data accessible and consistent across every application. Cube provides data modeling, access control, caching, and a variety of APIs (SQL, GraphQL, and REST) on top of any data warehouse, database, or query engine, including Materialize.

If you’re familiar with the concepts of “headless CMS” or “headless ecommerce”, you already have the right mental model for Cube. It’s essentially a “headless BI”—say, open-source Looker with less UI and more API—that delivers consistent data to your BI tools, data apps, and notebooks.

The Integration

Integrating Cube and Materialize was a perfect example of the benefits of building in public and using well-known protocols.

Cube is open source and has an existing PostgreSQL driver. Materialize contributed a lightly modified version of the PostgreSQL driver to help Cube read from Materialize.

Materialize integration

Why are Materialize and Cube useful together?

Data warehouses were built with the end goal of showing people data. They work best for data exploration, analysis and reporting. But just looking at the data is no longer sufficient: now we’re connecting systems (APIs, applications, tools and automation) to the data in the warehouse.

When people point systems at data warehouses, they inevitably face performance problems:

Not built for lookup queries
Not built for low latency and high concurrency
Reactivity is slow and time-to-insight is high

Also, there’s always a gap between how the data is stored and how the data is analyzed. Systems that consume the data (e.g., data apps or BI tools) and people working with them are most comfortable dealing with business-level metrics definitions (e.g., “revenue growth MoM by countries”) rather than calculations over the raw data in the warehouse.

With Materialize, you get warehouse-style views of your business data at application-style availability (always up-to-date, low latency queries). Combine that with Cube’s semantic layer and a variety of APIs to get a “live read API for your entire business”.

Show me an example!

Here’s how you can make Materialize work with Cube for fun and profit. Let’s build an application that consumes a live stream of Hacker News stories but only shows the ones that ranked in top 3 so you can still have plenty of free time.

Running Materialize and Cube. Both tools have managed cloud-based offerings (Materialize Cloud and Cube Cloud) but, for the sake of simplicity, we can run them in Docker. The following docker-compose.yml file will run Materialize, seed it with data, and then run Cube:

version: '2.2'
 
services:
 materialize:
   image: materialize/materialized:v0.26.1
   ports:
     - 6875:6875
 
 seed:
   image: jbergknoff/postgresql-client
   volumes:
     - .:/seed
   entrypoint: ["sh", "seed/seed.sh"]
   depends_on:
     - materialize
 
 cube:
   image: cubejs/cube:latest
   ports:
     - 4000:4000
   environment:
     - CUBEJS_DEV_MODE=true
     - CUBEJS_DB_TYPE=materialize
     - CUBEJS_DB_HOST=materialize
     - CUBEJS_DB_PORT=6875
     - CUBEJS_DB_NAME=materialize
     - CUBEJS_DB_USER=materialize
     - CUBEJS_API_SECRET=SECRET
   volumes:
     - .:/cube/conf
   depends_on:
     - seed

In the seed.sh file, we’ll use a vanilla PostgreSQL client to connect to the Materialize instance and run three SQL queries: the first one will connect a live PubNub stream with Hacker News stories data, the other one will transform the stream into a normalized table view with columns, and the last one will create a materialized view of the stories:

#!/bin/bash
 
cat > seed.sql << EOL
 
CREATE SOURCE hn_raw
FROM PUBNUB
SUBSCRIBE KEY 'sub-c-c00db4fc-a1e7-11e6-8bfd-0619f8945a4f'
CHANNEL 'hacker-news';
 
CREATE VIEW hn AS
SELECT
  (item::jsonb)->>'link' AS link,
  (item::jsonb)->>'comments' AS comments,
  (item::jsonb)->>'title' AS title,
  ((item::jsonb)->>'rank')::int AS rank
FROM (
  SELECT jsonb_array_elements(text::jsonb) AS item
  FROM hn_raw
);
 
CREATE MATERIALIZED VIEW hn_top AS
SELECT link, comments, title, MIN(rank) AS rank
FROM hn
GROUP BY 1, 2, 3;
 
EOL
 
psql -U materialize -h materialize -p 6875 materialize -f ./seed.sql

The only missing piece is Cube’s data model file, placed under schema/HN.js. Cube will read from the live-updated materialized view and expose metrics via its APIs.

cube(`HN`, {
  sql: `SELECT * FROM public.hn_top`,
 
  refreshKey: {
    every: '1 second'
  },

  measures: {
    count: {
      type: `count`
    },
 
    countTop3: {
      type: `count`,
      filters: [ {
        sql: `${rank} <= 3`
      } ]
    },
 
    bestRank: {
      sql: `rank`,
      type: `min`
    }
  },
  
  dimensions: {
    link: {
      sql: `link`,
      type: `string`
    },
 
    comments: {
      sql: `comments`,
      type: `string`
    },
 
    title: {
      sql: `title`,
      type: `string`
    },
 
    rank: {
      sql: `rank`,
      type: `number`
    }
  },
 
  segments: {
    show: {
      sql: `${title} LIKE 'Show HN:%'`
    }
  }
});

You copy and paste these files or pick them from GitHub. Now, we can run the whole stack with docker compose up.

Exploring the data. Now, if you navigate to Cube’s Playground at localhost:4000, you’ll instantly be able to run queries against the Materialize instance with connected stream by picking measures, dimensions, segments, and filters in the UI.

Here’s how you can check how many stories we have had in the data stream (30) and how many of them have even been ranked in top 3 (no surprise, just three):

First query

However, if you wait a few minutes and refresh the query results, you’ll see that data changes because new stories are posted to HN and existing stories change their ranks.

Then, here’s how you can combine segments and filters to pick only “Show HN” stories ranked in top 3:

Second query

Connecting to APIs. Cube exposes SQL, GraphQL, and REST APIs, each with its own query syntax. For instance, the query directly above is expressed as the following JSON query accepted by the REST API:

{
  "dimensions": [
    "HN.title",
    "HN.rank"
  ],
  "order": {
    "HN.rank": "asc"
  },
  "filters": [ {
    "member": "HN.rank",
    "operator": "lte",
    "values": [ "10" ]
  } ],
  "segments": [
    "HN.show"
  ]
}

Let’s run this query against the REST API and extract the result set with the amazing jq utility:

curl localhost:4000/cubejs-api/v1/load -G -s \
  --data-urlencode "query=$(cat query.json)" | jq '.data'

REST API usage

Also, if you are a GraphQL aficionado, you can use GraphQL API and an alternative syntax:

GraphQL API usage

And, there’s one more thing—Cube’s real-time data fetch based on WebSockets is a great match to the real-time experience of Materialize. You can use that to live-feed your data from the streams through Materialize and Cube all the way down to your end-user front-end application.

Future improvements. The current implementation of Cube’s Materialize driver makes Cube periodiucally poll Materialize for updates. A more effective and native for Materialize way to get data out of Materialize is subscribing to changes using the TAIL feature. This would require more in-depth changes which we are exploring.

Try Materialize and Cube for yourself

Feel free to check out Materialize and Cube by running the example above on your machine. The full source code is available on GitHub. Also, please don’t hesitate to join Cube’s Slack community or Materialize’s Slack community to share your feedback.

To give both tools a try in a professional setting, check out Materialize Cloud and Cube Cloud. The future is real-time, let’s build it together with Materialize and Cube.

What is Cube?

The Integration

Why are Materialize and Cube useful together?

Show me an example!

Try Materialize and Cube for yourself

Upgrade your data stack today