ShopBack is the largest shopping rewards and discovery platform in Asia-Pacific, with a presence in Australia, Indonesia, South Korea, the Philippines, Singapore, Malaysia, Taiwan, Thailand, and Vietnam.
From a small team of six in 2014, we now house over six hundred employees across the region, having scaled to nine countries to provide a more rewarding shopping experience for over 28 million users. Our platform enables users to make better purchase decisions while delivering performance-based marketing with high and measurable return on investment to merchants.
ShopBack continues to innovate, evolve, and create value for our users and merchant partners. In 2020 alone, we powered close to US$3 billion in sales for over 5,000 merchant partners across the Asia-Pacific. To date, we have also awarded over US$200 million of cashback to our users.
Why ShopBack started using Cube
For all transactions on our ShopBack sites, we need to analyze all transactions, including purchase value, sales volume, etc. In January 2020, we started working on a new project that required a lot of dashboard reporting of aggregated data for both internal and external (e.g. merchant partners) application users. One of the options considered was building an in-house application and storing the data in a graph database. However, in consultation with Yann AïtBachir (Head of Data at ShopBack), our engineers decided to go with Cube as it was a better fit for our use case as our data was very much relational and will be pre-aggregated into OLAP cubes for analytics.
Since we first started using Cube, there has been an ongoing effort to add new Cube schemas, measures, etc., as our data continued to grow while improving the performance of our queries at the same time. To address performance concerns, we implemented pre-aggregations in early 2021. Before implementing pre-aggregations, our p95 query loading time could take as long as ~50 seconds, which is an insanely bad user experience. Once we started implementing pre-aggregations, we were able to go as low as ~20 seconds p95. After we started further optimizing our pre-aggregations with indexes, modifying our approach to pre-aggregations for unique queries, and a couple of optimizations outside of Cube, we were able to get to below 5 seconds p95.
We still have a few more optimizations planned to reach our target of <1 second p95, but this goes to show that proper application of pre-aggregations can drastically improve the performance of your Cube queries. The performance improvement with pre-aggregations is further demonstrated in the two figures below:
Implementing Cube at ShopBack
Since we started working with Cube in 2020, we have had nine people working on this project full-time with six engineers, 2 product managers, and 1 QA engineer. In addition, we had several other engineers who helped over the past 1½ year as we implemented Cube at ShopBack.
What you see on the figure above is an architectural diagram of our Cube implementation.
- Main Application. The Node.js application in AWS uses an integrated Cube API instance to serve requests to our React frontend application. It fetches 95% of our data from our pre-aggregations, and the remaining 5% have consistently small datasets, so we have not seen the need to pre-aggregate it. Cube responses are also cached within Redis whenever there’s a successful first fetch. Given that our application allows users to use multiple different filters, this significantly reduces the load times for filters that have already been used for the day.
- Data Pipeline. Our application acts as a downstream service that receives/pulls in data from our other services. All of this data goes through our ETL process (built in-house), which then stores the processed data into our Postgres database.
- Caching. Our architecture uses pre-aggregations that are being stored in the same Postgres database and a non-clustered instance of Redis for storing fetched data and tracking the latest refreshed pre-aggregations for a multi-tenant application setup. The refreshing of our pre-aggregations is done by a separate instance of our application that acts as the refresh worker; this refresh-worker tracks the refresh keys stored in our database (as seen below) to keep track of which pre-aggregations need to be refreshed. Making the refresh keys manageable like this allows us to fully control whenever we want to trigger a refresh in the middle of the day should we want to.
While working with Cube, here are some of the challenges/learnings that we came across:
- Making sure we are hitting pre-aggregations. With the amount of data that we have, we had to make sure that we pre-aggregate our different use-cases correctly. For example, you cannot make a pre-aggregation with a granularity set as a month if there are queries with date ranges that can start/end in the middle of the month. We learned that if we didn’t pay attention to these details, we may be surprised by not “hitting” our pre-aggregations database.
- Performance. While it was important to make sure that pre-aggregations are being hit, it was also important to remember that these pre-aggregations are still database tables that are queried on (and in our case, they’re stored in the same source database). In other words, if you end up with a pre-aggregation with a lot of data and a lot of cardinalities, then we also need to index those pre-aggregations. We learned that indexing our pre-aggregation tables has dramatically improved the performance of our queries. Below you can see a sample of our pre-aggregations that can easily have millions of rows in the pre-aggregation table. As mentioned above, adding indexes (in this case to columns
outletId) helped contribute significantly to speeding up our queries.
- Going beyond raw SQL queries. In general, it’s fairly easy to create raw SQL for your queries with their own joins/nested statements. However, when we started using pre-aggregations, we realized that the original SQL statements weren’t flexible enough, and you couldn’t simply “port SQL to Cube.” Now, rather than starting with SQL, we consider how we will implement Cube from the outset.
- Extracting refresh-related information. We are currently trying to get more logging data from refreshes for the ShopBack team and our clients (e.g., "chart data last refreshed at hh:mm:ss"). We extract refresh-related information by looking for select keywords in refresh logs, but if there are easier ways to do this, we’d love to learn more!
The next major Cube project for us is implementing Cube Store to achieve lower latency and higher concurrency. We are looking forward to implementing Cube Store after our region’s shopping campaigns/festivals and once our team has enough bandwidth (you can check out what else our tech team is working on here). From our investigation into Cube Store, we know that we will need to explore some workarounds:
- For security reasons, we do not set our AWS keys in our instance environment variables. As such, we do not store values in
CUBESTORE_AWS_SECRET_ACCESS_KEY, and these are causing unexpected behaviors with our network calls.
- With the Cube Dev team, we’re investigating a low-level error message regarding the Parquet library when we have a lot of data (e.g., 4M rows).
Working with the Cube community
Another key benefit of working with Cube is the community, as we believe it is a great example of an active community. I worked with other open source communities before joining ShopBack, and I typically saw the number of questions/issues pile up to a pretty high number. On the other hand, with the Cube community, whenever we have questions, they typically get answered within 24 hours of posting them.
Also, the community members are quite collaborative in terms of helping each other out. I have had a number of direct messages on Slack with other community members to discuss how they did their implementations/deployments. In return, I also try to reply to threads that I feel I can help with.