Today, we're happy to announce Cube's integrations with DuckDB, an extremely potent in-process SQL OLAP database management system, also known as "the most hyped thing in data in 2023 prior to the AI craze", and MotherDuck, a cloud service providing hybrid execution for DuckDB, i.e., offloading some computations to the cloud and merging the results locally.
We see vast advantages for users of Cube, the semantic layer, and DuckDB, the super-fast OLAP database, in using them together. Let's unpack the use cases and explore the benefits.
Built-in query engine
Since DuckDB is an in-process OLAP DBMS, the integration with Cube implies bundling DuckDB together with Cube. Indeed, now every Cube Core or Cube Cloud deployment comes with a built-in and instantly available DuckDB which has the HTTPFS extension installed and loaded by default. With this extension, DuckDB can directly query Parquet, CSV, and JSON files over HTTPS, including files on S3-compatible object storage servers like AWS S3, Google Cloud Storage, and Cloudflare R2.
With built-in DuckDB, your semantic layer now has direct access to semi-structured data in a data lake or object storage without the need for any intermediary tools, e.g., an additional query engine between your data and your semantic layer. You still can do this in complex scenarios, but DuckDB helps get you started in almost no time.
You can also connect DuckDB to MotherDuck and get hybrid execution for your queries, so that MotherDuck will do all the heavy lifting querying the data from remote storage.
Value-add semantic layer
In lots of excitement about DuckDB, one could always spot a bit of scepticism about its intended production use. No doubt DuckDB is a fantastic and super-fast tool for local data exploration and experimentation—but what should be the next step to share the results with the team or stakeholders and put the insights into use? (We know one can always send an Excel sheet over email but it's so ad-hoc and so not 2023, right?)
Now, you can go from a quick analysis in a local DuckDB instance accessible to you and you only to a governed metric in a semantic layer accessible across the whole organization in a single step. As soon as you're happy with the metric definition and the results it yields, you can move the definition into Cube's code-first data model. It will take seconds to enter the development mode, update the data model, test the result, and roll out an updated metric into production without any disruption to existing data consumers.
In the demo video above, a Parquet file with 16 million rows and 250 MB in size, residing in a Google Cloud Storage bucket, is queried using a local DuckDB (left) and MotherDuck (right). You can see that MotherDuck drastically outperforms local DuckDB when querying remote files. You can also see that moving from an analytical query against DuckDB or MotherDuck to a metric defined in the semantic layer and available to all data consumers is a matter of seconds.
Next steps
Just like the MotherDuck team, we'd love to see how you'd put the power of DuckDB and MotherDuck into a production use with Cube:
Pairing the universal semantic layer of Cube with the hybrid execution of MotherDuck allows users to easily go from local development to building code-first data models and consistent metrics that keep everyone in an organization on the same page.
In October, we’re co-hosting a webinar with the MotherDuck team. Please stay tuned for the webinar registration; we'd love to show you live how Cube and DuckDB work together.
DuckDB and MotherDuck integrations are available in Cube Core and Cube Cloud, and you can try them today. Join our Slack community at slack.cube.dev to share your thoughts and opinions about this update.