BigQuery Public Datasets for COVID-19 Impact Research

In this tutorial we'll explore how to build an analytical application on top of Google BigQuery, a serverless data warehouse, and use a few public datasets to visualize the impact of the COVID-19 pandemic on people's lives.

What is Google BigQuery?

Before you can ask, here's the application we're going to build:

Alt Text

And not only for the United States but for every country.

Now to BigQuery! BigQuery is a serverless big data warehouse available as a part of Google Cloud Platform. It's highly scalable, meaning that it can process tiny datasets as well as petabytes of data in seconds, using more cloud capacity as needed. (However, due to BigQuery's distributed architecture, you can't possibly expect it to have a sub-second query response time.)

BigQuery has a gentle learning curve, in part due to its excellent support for SQL, although (big surprise!) we won't be writing SQL queries in this tutorial.

BigQuery also has a free usage tier: you get up to 1 TB of processed data per month and some free credits to spend on Google Cloud during the first 90 days. You can probably guess that BigQuery is billed by the amount of processed data.

Alt Text

BigQuery web console in Google Cloud, with the most important information being: "Query complete (2.3 sec elapsed, 2.1 GB processed)."

So, let's see what datasets are waiting to be explored! 🦠

What are BigQuery Public Datasets?

BigQuery public datasets are made available without any restrictions to all Google Cloud users. Google pays for the storage of these datasets. You can use them to learn how to work with BigQuery or even build your application on top of them, exactly as we're going to do.

You could query them just if they were your own. However, always pay attention to the license and other relevant info, like update frequency and last update date. Unfortunately, some datasets are really outdated:

Alt Text

So, what to expect? Some of these 212 public datasets are quite interesting:

28 datasets about climate, including the Real-time Air Quality dataset
24 datasets related to developer tools, including GitHub Activity Data
33 datasets marked encyclopedic, including Hacker News dataset
and 33 datasets for COVID-19 research — let's talk about them!

COVID-19 Government Response Tracker. This dataset is maintained by the University of Oxford Blavatnik School of Government. It tracks policy responses to COVID-19 from governments around the world. Basically, all lockdowns, curfews, and workplace closures worldwide are registered in this dataset.

Google Community Mobility Reports. This dataset is maintained by Google. It provides insights into what has changed in people's habits and behavior in response to policies aimed at combating COVID-19. It reports movement trends over time by geography, across different retail and recreation categories, groceries and pharmacies, parks, transit stations, workplaces, and residential.

We can use both datasets to visualize and correlate the time measures against COVID-19 with changes in social mobility. Here's how it might look like:

Alt Text

Want to try it? Here's the live demo you can use right away. Also, the full source code is on GitHub.

For that, we need to create an analytical API over BigQuery and a web application talking to that API. So, let's get hacking! 🦠

How to Create an Analytical API

Why do we need an API in the first place?

The most obvious reason is that BigQuery can't provide a sub-second query response time, meaning that an application that talks directly to BigQuery will have a suboptimal user experience. Also, BigQuery bills you by the amount of transferred data, so if you have a popular app, you might suddenly know about that from a billing alert.

Also, direct interaction with BigQuery means that you'll need to write SQL queries. There's nothing wrong with SQL; it's a great domain-specific language, but having SQL queries all over your codebase smells like a leaky abstraction — your application layers will know about column names and data types in your database.

So, what are we going to do? In this tutorial, we'll use Cube.

Cube is an open-source analytical API platform, and it allows you to create an API over any database, BigQuery included.

Cube provides an abstraction called a "semantic layer," or a "data schema," which encapsulates database-specific things, generates SQL queries for you, and lets you use high-level, domain-specific identifiers to work with data.

Also, Cube has a built-in caching layer that provides predictable, low-latency response query times. It means that an API built with Cube is a perfect middleware between your database and your analytical app.

Let's try it in action.

The first step is to create a new Cube project. Here, I assume that you already have Node.js installed on your machine. Note that you can also use Docker to run Cube. Run in your console:

npx cubejs-cli create bigquery-public-datasets -d bigquery

Now you have your new Cube project in the bigquery-public-datasets folder containing a few files. Let's navigate to this folder.

The second step is to add BigQuery and Google Cloud credentials to the .env file. Cube will pick up its configuration options from this file. Make sure your .env file looks like this:

# Cube environment variables: https://cube.dev/docs/reference/environment-variables

CUBEJS_DB_TYPE=bigquery
CUBEJS_DB_BQ_PROJECT_ID=your-project-id
CUBEJS_DB_BQ_KEY_FILE=./your-key-file-name.json

CUBEJS_DEV_MODE=true
CUBEJS_API_SECRET=SECRET

Here's what all these options mean and how to fill them:

Obviously, CUBEJS_DB_TYPE says we'll be connecting to BigQuery.
CUBEJS_DB_BQ_PROJECT_ID should be set to the identifier of your project in Google Cloud. First, go to the web console of Google Cloud. Create an account if you don't have one. Then go to the new project creation page and create one. Your project identifier is just below the name text field:

Alt Text

CUBEJS_DB_BQ_KEY_FILE should be set to the key file name for your Google Cloud user that will connect to BigQuery. It's better to use a service account, a special kind of Google Cloud account with restricted access. Go to the service account creation page and create one. On the second step, you'll be asked to specify the roles for this service account. The only roles needed for read-only access to public datasets are BigQuery Data Viewer and BigQuery Job User. After the user is created, you need to add a new authentication key — use the ... button on the right to manage the keys for this account and add a new one of JSON type. The key file will be automatically downloaded to your machine. Please put it in the bigquery-public-datasets folder and update your .env file with its name.

Alt Text

The rest of the options configure Cube and have nothing to do with BigQuery. Save your .env file.

The third step is to start Cube. Run in your console:

npm run dev

And that's it! Here's what you should see:

Alt Text

Great, the API is up and running. Let's describe our data! 🦠

How to Define a Data Schema

Before we can explore the data, we need to describe it with a data schema. The data schema is a high-level domain-specific description of your data. It allows you to skip writing SQL queries and rely on Cube query generation engine.

Create two schema files with the following contents: take schema/Measures.js from this file, and schema/Mobility.js from that file. Here is a redacted version of the first file with a few interesting things:

cube(`Measures`, {
  sql: '
    SELECT *
    FROM `bigquery-public-data.covid19_govt_response.oxford_policy_tracker`
  ',

  measures: {
    confirmed_cases: {
      sql: `confirmed_cases`,
      type: `max`
    },

    cancelPublicEvents: {
      sql: `CAST(cancel_public_events AS NUMERIC)`,
      type: `max`
    },
  },

  dimensions: {
    country: {
      sql: `country_name`,
      type: `string`
    },

    date: {
      sql: `TIMESTAMP(${Measures}.date)`,
      type: `time`
    },
  },
});

Note that:

in this data schema, you describe an analytical cube
it contains the data retrieved via a straightforward sql query
you can define measures, i.e., numerical values to be calculated
measures are calculated using various functions, such as max
you can define dimensions, i.e., attributes for which the measures are calculated
dimensions can have different data types, such as string or time
in measure and dimension definitions, you can use BigQuery functions, e.g., CAST(... AS NUMERIC) or TIMESTAMP

And here's a part of another file:

cube(`Mobility`, {
  sql: '
    SELECT *
    FROM `bigquery-public-data.covid19_google_mobility.mobility_report`
  ',

  measures: {

  },

  dimensions: {

  },

  joins: {
    Measures: {
      sql: `${Measures}.country_name = ${Mobility}.country_region AND
            ${Measures}.date = ${Mobility}.date`,
      relationship: `hasOne`,
    }
  }
});

Here you can see that our two cubes, based on different tables from different BigQuery datasets, are joined together with join, where a join condition is provided as an SQL statement. Cube takes care of the rest.

Now we have the data schema in place, and we can explore the data! 🦠

How to Explore the Data

As the console output suggests, let's navigate to localhost:4000 and behold Cube Developer Playground. It provides a lot of features, but we have a clear path to follow. Let's build a query.

To do so, go to the "Build" tab, click "+ Measure," and select a measure. For example, select Measures Confirmed Cases. As you can see, the Measures Date time dimension has been automatically selected, and the chart below displays the count of confirmed COVID-19 cases over time. You can even "+ Filter" by Measures Country, use the "equals" condition, and put your own country's name into the filter field. Looks interesting, right?

Alt Text

Definitely feel free to experiment and try your own queries, measures, dimensions, time dimensions, granularities, and filters.

After that, let's move on and build an analytical app! 🦠

How to Build an Analytical App

It's worth noting that Cube Developer Playground has one more feature to explore.

If you go to the "Dashboard App" tab, you'll be able to generate the code for a front-end application with a dashboard. There're various templates for different frameworks (React and Angular included) and charting libraries there. Still, you can always choose to "create your own," and if you choose a "dynamic" template, you'll be able to compose queries and add charts just like you did.

Alt Text

However, we'll choose a much simpler way to go from zero to a full-fledged analytical app — we'll grab the code from GitHub:

first, download this dashboard-app.zip file
unzip it to your bigquery-public-datasets folder
run yarn and yarn start (or npm install and npm start)

You should be all set! Navigate to localhost:3000 and have a look at this app:

Alt Text

Choose your country and take your time to explore the impact of COVID-19 and how mitigation measures correlate with social mobility.

Let's take Israel. You can clearly see three waves and the "easing" effect of "stay at home" requirements — after they are introduced, every wave spreads with lesser speed.

Alt Text

Let's take Germany. You can see how Germans interact with the rules: after the first "stay at home" requirements are lifted, park activity grows, and after the second "stay at home" requirements are introduced, parks instantly become deserted.

Alt Text

Let's take Singapore. Obviously enough, you can see Singapore doing a great job containing the virus. The third wave is nearly unexistent.

Alt Text

What are your own insights? Please share them in the comments!

And now, let's explore a few crucial parts of this app to understand better how it works and, more specifically, how it retrieves data from Cube API.

First, as you can see from package.json, it's obviously a React app created with the create-react-app utility. It has an index.js as an entry point and the App root component.

Second, it references @cubejs-client/core and @cubejs-client/react packages as dependencies. Here's what you can see in the api.js file:

// Let's use Cube client library to talk to Cube API
import cubejs from '@cubejs-client/core'

// API URL and authentication token are stored in .env file
const cubejsApi = cubejs(
    process.env.REACT_APP_CUBEJS_TOKEN,
    { apiUrl: `${process.env.REACT_APP_API_URL}/cubejs-api/v1` }
);

// The simplest Cube query possible:
// "Hey, Cube, give us a list of all countries."
const countriesQuery = {
    dimensions: [ 'Mobility.country' ]
}

export function loadCountries(callback) {
    // cubejsApi.load returns a promise.
    // Once it's resolved, we can get the result.
    // We can even transform it with tablePivot() or chartPivot()
    cubejsApi
        .load(countriesQuery)
        .then(result => {
            const countries = result
                .tablePivot()
                .map(row => row['Mobility.country'])

            callback(countries)
        })
}

Believe it or not, that's the bare minimum we should know about working with Cube REST API in the front-end apps. You import a client library, you compose your query as a JSON object, you load the result asynchronously, and you do whatever you want with the data.

In this application, the data is visualized with Chart.js, a great data visualization library. However, you can choose any library you're familiar with. And maybe your app will look even better than this one:

Alt Text

And that's all, folks! 🦠 I hope you liked this tutorial 🤗

Here's just a few things you can do in the end:

go to the Cube repo on GitHub and give it a star ⭐️
share a link to this tutorial on social media or with a friend 🙋‍♀️
share your insights about the impact of COVID-19 in Cube community Slack

BigQuery Public Datasets for COVID-19 Impact Research

What is Google BigQuery?

What are BigQuery Public Datasets?

How to Create an Analytical API

How to Define a Data Schema

How to Explore the Data

How to Build an Analytical App

Upgrade your data stack today

More on Google BigQuery

DeWitt clause, or Can you benchmark a database and get away with it

How Gadsme is using Cube for analytics on their in-game advertising platform

Performance capabilities of data warehouses and how Cube can help