Core to our business at Simon Data is a data platform that enables our clients to access, operationalize, and centralize their marketing data. The ultimate goal of Simon Data is to give marketers an intuitive, non-technical platform in which to manage sophisticated data operations required to drive great customer marketing outcomes.
The warehouse is powered by a complex, multi-tenant environment comprising datasets of varying lineages, schemas, and business purposes. Some datasets are common across client accounts — such as data we produce or import through marketing channel integrations — while other datasets are client-specific. While this delivers significant leverage for our customers to orchestrate bespoke end-user impressions, it also introduces a brain-bending taxonomical challenge to the engineers looking to drive analytical insights across the entire system.
Since the product’s creation, the platform has seen several analytics products built into it. Early last year, we organized a focused team to consolidate them on a standard foundation. At that same time, we also wanted to rethink how we integrated analytics in our application to streamline the development and deployment processes.
We knew the best path forward was to build a framework that could be used both now and in the future for analytics product development on any part of our core platform. To pull this off, we had some goals for the system.
We wanted a seamless development experience, support for querying arbitrary data, fluency in managing various schemas, and an ability to rapidly prototype and develop the user experience. Equally important was the ability to transparently present the queries and transformations used to produce results to facilitate QA by multiple stakeholders.
When we talked with users and business stakeholders, we discovered several pain points:
We started with a holistic review of the analytics products we already had to gather requirements for the new reporting suite and took a close look at how clients were (or were not) using them. Alongside that, we also did a systematic review of technologies we could use to build a new reporting platform.
We looked at many commercial products and open-source projects and narrowed the list down to a handful of contenders, including Apache Kylin and Druid from the open-source space and Looker from the enterprise. Cube.js quickly rose to the top of that list.
What we liked most about Cube.js (no other offering provided all of these):
Marketers interact with Simon Data through our easy-to-use web application. We needed a way to query data from multiple independent data warehouses, each with its different schema, and present the results in a common interface. This was a challenge on two fronts: development and deployment.
First, we had to invest in integrating Cube.js with our infrastructure to provide credentials and schema configuration for each individual client's data warehouse. Cube.js provided hooks we could use, so this was reasonably straightforward.
However, we needed to give developers a way to run Cube.js to mirror the production environment while being able to rapidly iterate on schema, query, and frontend component development. This meant building a common application bootstrap and configuration for Cube.js that we could use to launch it in dev, staging, and production environments.
One neat trick we pulled off here was getting Cube.js to seamlessly serve both its playground application and API queries from our web app when running in development mode. This is important for us to be able to run everything (i.e. both Cube.js and Simon Data web app) on our laptops and have the web app be able to send queries to the Cube.js Playground running locally.
Another challenge we faced was safely and securely deploying configurations to production without impacting what was already running. We achieved this by making schemas immutable.
When we want to deploy a new schema or patch an existing one, we create a whole new schema or increment the current schema with the changes. Once we have the latest schema working in our development environment, we can test it out in the staging environment and deploy it to production. We can do all of this in the background, so we only switch the production web application over to the new schema once we have complete confidence in it (see Figures 2 & 3 below).
Our repeatable deploy process sets up the new schema in Snowflake and Cube.js simultaneously, which gives strong consistency between the Snowflake and Cube.js configurations.
We have defined schemas as a set of Snowflake views and Cube.js cubes that are developed and deployed together. We create a schema by writing view templates that map onto the data in Simon’s warehouse designed for the given customer to provide a standard interface to create the cubes on top of (see Figure 1 below). Our developers write these view and cube templates using Jinja from their IDEs and can rapidly iterate on them locally using our end-to-end development environment integration. Once the schema is complete and committed, our deployer tool dynamically renders the templates by filling in variables specific to each customer’s datasets. This pattern was the key to building repeatable schemas and deploying them consistently in our complex multi-tenant environment.
Finally, we created a comprehensive, automated end-to-end integration test suite (written in Python) that exercises each cube across all schemas that are live in Cube.js. This serves two important purposes:
The integration test runs periodically and also after the deployment of any new schemas. Once the test passes, we can safely switch the web application to the new schema with confidence.
Figure 1: Schema is the unit of deployment
Figure 2: Deployment process
Figure 3: Schema migration
We can develop a new schema and deploy it in the background, then switch the Simon web app over when ready.
We delivered three major wins to our clients and business stakeholders with this platform:
Speed. We can develop and ship analytics products without dealing with any net new infrastructure development going forward. Our efforts are focused on what matters: the data and the UI. Integrating the Cube.js Playground into our development stack enables us to rapidly build a schema and query in Cube.js and then export that for implementation in the front end, all of which a developer can quickly do on a laptop.
Flexibility. We have a framework that enables us to seamlessly query any data in Snowflake and present it in the web application. The real win here is that we won't have to choose between maintainability and time to market to get this flexibility -- we can have both.
Stability (Antifragility). When we're ready to ship a new product feature, we can deploy it with the confidence that we will not introduce regression on existing functionality because our schemas are immutable. We can also roll out new features to client accounts in a controlled manner, one by one or in batches, and even keep the changes hidden to perform production validation before they are released to end users.
We’ve already built net-new analytics products on this platform that solve fundamental client pain points, proving that the platform works. We’re also starting to realize the velocity gains when iterating on these products to add functionality and refine the user experience.
Now we’re starting to look at consolidating some of the earlier analytics products onto the new framework. Finally, our business stakeholders now have transparency into the calculations that power analytics products in the application, giving them confidence in the data and helping them more easily answer client questions.