I’m a software engineer at Ternary, a company that provides a FinOps platform to serve Google Cloud (GCP) customers' unique needs. Ternary’s platform empowers cloud engineers, IT finance, and business teams to collaborate more effectively on public cloud cost optimization. The company is in high-growth mode on all fronts, and its platform now supports many of the largest GCP customers. These customers bring the challenge of providing a SaaS platform at scale.
Below, I show how Ternary has solved the challenge of managing the large volume of cost-related data these customers produce, utilizing that data to forecast and optimize spend. I explain the so-called 3-query approach to the analysis of large multidimensional datasets that we use to identify the most interesting data for our users and provide real-world code snippets in the second half of this article.
Multidimensional analysis of cost datasets
Ternary’s platform features the ability to break down costs by projects and other dimensions across a time series. For users that have many values in a given dimension (e.g., a few thousand GCP projects), or for those who want to analyze unique combinations of dimensions, we would have to load and analyze substantial result sets on the front-end, not to mention that we would also frequently run into issues with Cube’s response limit of 50,000 rows. For example, let’s say I want to look at total spend for every project across the last 30 days while breaking that spend by one or more additional dimensions. Chances are that the result set would contain thousands of rows and if I exceed the limit of 50K rows, I will generate an incomplete dataset and won’t be able to calculate my total costs.
Ternary is a cost-management platform, so it’s critical that we present complete, accurate data. To enable our users to perform multidimensional analysis of vast volumes of cost data, we asked ourselves what data our users most want to see.
Ternary displays most of its data in line or bar graphs, which are useful only with data points up to a certain amount. A line graph with 50 increments on the y-axis can be hard to read, especially if some y-axis values are much greater than others. At Ternary, we hypothesized that many of our key users would want to focus on what we call, informally, their “most interesting” data points, which we defined as their highest spend in a 30-day period, as well as the general shape of their expenses during that time. It’s debatable what “most interesting” is in every case. But for us, it is usually the groupings that account for the most spend.
Now that we have assumed what our users are really interested in, we imposed limits on our queries so that the user sees only that data. This focus makes our charts more readable and removes the concern about maxing out the Cube response limit.
In limiting the data, we inevitably lose some portion of the entire picture. The top 10 “most interesting” dimension values will likely not account for all of a user’s costs. If there were as many as 100 dimension values, our method would expose but a fraction of these.
So, to account for any missing data, we roll all the “less interesting” data points into an “other” column. By including this “other” column, we fulfill our above-mentioned mandate to provide complete and accurate data, reflecting total spend and ensuring the accuracy of any subsequent calculations.
The 3-query approach to querying top N rows
In order to expose the “most important” data and generate the “other” column, we need to fire off at least three queries. Let’s go through the high-level query breakdown here.
Scenario: As a Ternary user, I want to see how much my team has spent day by day, broken down by project ID and category, over the last 30 days.
Query 1: Request all of the cost totals day by day, over the last 30 days, with desired query filters. At this point, we will not include dimensions. Note that the “cost” measure in this example is of type “sum.”
Query 2: Request the top N number of the “most interesting” combinations of projects and categories. In this case, we will find out which project-category pairs had the highest overall spend in the last month. For this scenario, let’s assume N is 3.
Now that we have a list of the top three combinations, we’ll generate query filters. Note that the new filters must use Boolean logical operators to accurately filter for specific project ID-category storage groupings.
Query 3: Request cost day by day over the last 30 days, broken down by project-category pairs, using the filters generated from query two in addition to any user-selected filters. (Note that we will include additional dimensions now.)
This query would return 3 * 30 = 90 data points. However, there is a discrepancy between the total spend of those three projects and the total overall cost on any given day. To account for this difference, we take the totals from query 1 and subtract the values from query 3 to generate an “other” column, which we also display for the user.
By this method, we generate a complete overview, identifying the most expensive projects while preserving the total cost using the “other” column. While the overview does not display every project and some specificity is lost, all the data is reflected in one form or another.
Keep in mind that N can be much larger than three; it should change according to the user’s goals. Lastly, it’s important to acknowledge that while this solution is a “three-step process,” the actual number of queries being run will be 1 + 2m, where m is the number of measures that the user wants to see. This is because we want to gather a separate set of filters for every measure requested.
For example, say we wanted to see separately the three projects responsible for the most spend and the three projects responsible for the most usage. While we can use query 1 to request all of the selected measure totals across the time series, we will need to fire off queries 2 and 3 for both cost and usage separately. Otherwise, we will muddle the results with an inconsistent number of data points for cost and usage.
To try this out yourself, create the following Cube schema:
The three queries you’ll need to run are shown below, in JSON query format:
Because including a large number of dimensions complicates the Boolean logical filters applied, and because including a large number of measures increases the number of queries to which we must apply that complicated logic, it’s crucial to consider actual use cases and their resulting user impact. To test Ternary’s new query format, we used the Ternary Reporting Engine: a feature of our platform that allows users to query their data flexibly, by selecting different dimensions, measures, and filters. Here, we ran a number of queries with multiple measures, dimensions, and filters, to get a sense of how complex our queries could be without significantly compromising the app’s performance. We found that, using this 3-query approach, the duration of a request where N = 20 was the same as that of using one query to gather all the data, with the added benefit that our results stayed within the 50K-row limit.
Given that Ternary is a cost-management platform, we recognize the importance of considering costs when making changes such as this one. As stated before, the number of Cube queries needed to fulfill a single request on behalf of any given customer would increase, from 1 to 1 + 2m. It follows that the cost to implement a feature in this way could also increase by a similar factor if, for example, using BigQuery’s on-demand pricing like we do, where the amount you are charged is proportional to the amount of data you read. Even if we were to increase the amount we spend, serving this data by a factor of, on average, three, five, or even ten, we determined that it would be a reasonable price to pay for the benefits, including showing customer data that conveys the full picture instead of truncating it as we previously had.
Each use case is different and should be considered individually. If all queries are accelerated with pre-aggregations or if Cube points toward a Postgres instance, and if the query volume is relatively low, cost considerations may not be as relevant.