From an engineering standpoint, the technology behind Google Analytics was pretty sophisticated when it was created. Custom, tailored-made algorithms were implemented for event collection, sampling, aggregation, and storing output for reporting purposes. Back then it required years of engineering time to ship such a piece of software. Big data landscapes have changed drastically since then. In this tutorial, we’re going to rebuild an entire Google Analytics pipeline. We’ll start from data collection and reporting. By using the most recent big data technology available, we’ll see how simple it is to reproduce such software nowadays.
Here’s an analytics dashboard with an embedded tracking code that collects data about its visitors while visualizing it at the same time.
Check out the source code on GitHub. Give it a star if you like it!
How Google Analytics works
If you’re familiar with Google Analytics, you probably already know that every web page tracked by GA contains a GA tracking code. It loads an async script that assigns a tracking cookie to a user if it isn’t set yet. It also sends an XHR for every user interaction, like a page load. These XHR requests are then processed and raw event data is stored and scheduled for aggregation processing. Depending on the total amount of incoming requests the data will also be sampled.
Even though this is a high level overview of Google Analytics essentials, it’s enough to reproduce most of the functionality. Let me show you how.
Your very own GA architecture overview
There are numerous ways of implementing a backend. We’ll take the serverless route because the most important thing about web analytics is scalability. In this case, your event processing pipeline scales in proportion to the load. Just as Google Analytics does.
We’ll stick with Amazon Web Services for this tutorial. Google Cloud Platform can also be used as they have pretty similar products. Here’s a sample architecture of the web analytics backend we’re going to build.
For the sake of simplicity, we’re only going to collect page view events. The journey of a page view event begins in the visitor’s browser, where an XHR request to an API Gateway is initiated. The request event is then passed to Lambda where event data is processed and written to a Kinesis Data Stream. Kinesis Firehose uses the Kinesis Data Stream as input and writes processed parquet files to S3. Athena is used to query parquet files directly from S3. Cube.js will generate SQL analytics queries and provide an API for viewing the analytics in a browser.
This seems very complex at first, but component decomposition is key. It allows us to build scalable and reliable systems. Let’s start implementing the data collection.
Building event collection using AWS Lambda
To deploy data collection backend, we’ll use the Serverless Application Framework. It lets you develop serverless applications with minimal code dependencies on cloud providers. Before we start, please ensure Node.js is installed on your machine. Also, if you don’t have an AWS account yet you’d need to signup for free and install and configure AWS CLI.
To install the Serverless Framework CLI let’s run:
Now create the event-collection service from a Node.js template:
This will scaffold the entire directory structure. Let’s
cd to the created directory and add the
Install yarn package manager if you don’t have it:
We’ll need to update
handler.js with this snippet:
As you can see, the only thing this simple function does is write a record into a Kinesis Data Stream named
event-collection. Please note that we’re writing data in new-line delimited JSON format so Athena and Kinesis Firehose can understand it.
Also, we need to modify the
serverless.yml in order to deploy everything. Paste this into your
This config will deploy the
collect function and assign an API Gateway event trigger to it. It’ll also assign AWS Kinesis Data Stream permissions to the function.
With that, we’re done with writing all the backend code we need for our homemade GA. It’ll be able to handle thousands of incoming events per second. Too much for 2018, isn’t it? :)
Let’s deploy it to AWS:
If everything is okay you’ll get a URL endpoint. Let’s test it with CURL:
It should return a
400 status code and an error message that looks like this:
If this is the case, let’s proceed with the Kinesis setup.
AWS Kinesis setup
First of all, we need to create a Kinesis Data Stream called
event-collection. First, sign in to your AWS account at console.aws.amazon.com and select Kinesis service from the menu. By default the Serverless Framework deploys resources to the
us-east-1 region, so we’ll assume the AWS Lambda function was created there and switch regions if necessary before creating the stream.
To create the data stream, we need to set the name to
event-collection and set the number of shards. It can be set to 1 for now. The number of shards define your event collection throughput. You can find more information about it here.
Once you’re done with the data stream, create a Kinesis Firehose delivery stream.
You should select
event-collection Kinesis stream as a source.
For now, to keep this tutorial simple, we don’t need to process any data. In production you would need to transform it to ORC or Parquet to ensure optimal performance. You can also use this step for event data population like IP to location.
We’ll be using S3 as a destination. You need to create a new S3 bucket. Please choose whichever name you like, but add an
events suffix as it will contain events.
Here you can choose Gzip compression to save some billing. You will also be prompted to create an IAM role for this delivery stream. Just follow the instructions.
That’s it. If you’ve done everything right, try to run your AWS Lambda function again with a real payload.
Events should start flowing to your S3 bucket within five minutes. At this point, the event collection is done. Let’s set up querying for analytics.
Setting up AWS Athena for querying analytics
As data starts flowing to S3, we need to support it with metadata. Athena uses it to understand where to find the data and what structure it has. This is a cumbersome process, but it can easily be done with AWS Glue. Glue is a metadata manager and ETL by AWS. It also has a crawler concept, which acts as a cron job that analyzes S3 data in order to extract metadata from it.
Navigate to Glue from the services menu and select Databases. Add a new database and name it
aws_web_analytics. Then go to crawlers and choose ‘Add crawler’.
events-crawler and choose the S3 bucket that was just created as the data store:
Create an IAM Role according to the instructions and set to run it hourly:
As an output, select the previously created database:
Once it's created, let’s run it manually. If it’s successful, you should see a table in the
aws_web_analytics database. Let’s try to query it.
Go to Athena from the services menu. Select the
aws_web_analytics database and write some simple query, such as
select * from aws_web_analytics_event_collection. You should get a result like this:
If everything looks fine, we can proceed with building the analytics UI.
Setting up Cube.js to provide analytics for end users
AWS Athena is a great analytics backend suitable to query petabytes of data, but as any big data backend, it isn’t suitable for directly querying by end users. To provide an acceptable performance vs cost balance, you should use a caching and pre-aggregation layer on top of it along with an API for querying analytics. This is exactly what Cube.js does!
First of all, we need to create an IAM user to access Athena from Cube.js. Select IAM from the AWS services menu. Select Users and click the Add User button. Set the user name to
cubejs and enable Programmatic access:
At step 2, select attach existing policies directly then select AmazonAthenaFullAccess and AmazonS3FullAccess:
Move on and create a user. Once it’s created, copy the Access key ID and Secret access key and save it. Please note that you’ll only see the secret once, so don’t forget to store it somewhere.
If you don’t have a Cube.js account yet, let’s signup for free now. After you’ve signed up, connect Athena as a data source:
You should also create a new, or find an existing, S3 bucket to store the Athena result output within the us-east-1 region. The setup should look like this:
If Athena connected successfully, you’ll be forwarded to the Cube.js schema. Let’s create a new
PageViews file and paste this in:
Please replace the event table name with your own. Save the file and go to Explorer. You can learn more about the Cube.js Schema in the documentation here. If everything works the way it should, you will see the Page Views Count for this week
Once it works, we’re ready to enable Cube.js API access. In order to do that, go to Data Sources and Edit Athena data source. In the Cube.js API tab enable Cube.js API access and copy the Global Token:
We’re now ready to create a React application to visualize our analytics.
Building a React analytics dashboard
Let’s use create-react-app scaffolding to create directory structure for our app:
cd into the created directory and add the required dependencies:
@cubejs-client/react modules are used to access the Cube.js API in a convenient manner. While
bizcharts are used to create layouts and visualize results. The last three,
whatwg-fetch are used to implement a track page function, which collects event data about users.
Let’s start with the tracking function. Create a
track.js file in the
analytics-dashboard directory and paste this in:
Please replace the URL with your own collect function endpoint. This is all the code we need to track user page views on the client side. This code should be called when a page is loaded.
Let’s create the main App page with two simple charts. In order to do that, replace the App.js contents with this snippet:
Make sure to replace the
API_KEY constant with your own Cube.js Global Token. You should be able to see the dashboard with two charts:
Once again, here’s the deployed version of the dashboard, if you want to check it out.
To deploy your own, create a public S3 bucket with static site serving enabled, build the app, and sync it to the bucket:
You also can use a service like Netlify to host your website. They make it incredibly simple to deploy and host a website.
This tutorial has shown you how to build a proof of concept compared to the Google Analytics feature set. The architecture is scalable enough to handle thousands of events per second, and can analyze trillions of data points without breaking a sweat. Cube.js is suitable for implementing all metrics you expect to see in GA, such as bounce rate, session time spent, etc. You can read more about it here. Large scale analytics can’t be built without the pre-aggregation of data. GA does this a lot and Cube.js has a built-in solution for it.