Calculating averages and percentiles
Use case
We want to understand the distribution of values for a certain numeric property
within a dataset. We're used to average values and intuitively understand how to
calculate them. However, we also know that average values can be misleading for
skewed (opens in a new tab) distributions which are common
in the real world: for example, 2.5 is the average value for both (1, 2, 3, 4)
and (0, 0, 0, 10).
So, it's usually better to use
percentiles (opens in a new tab). Parameterized by a
fractional number n = 0..1, where the n-th percentile is equal to a value that
exceeds a specified ratio of values in the distribution. The
median (opens in a new tab) is a special case: it's defined
as the 50th percentile (n = 0.5), and it can be casually thought of as "the
middle" value. 2.5 and 0 are the medians of (1, 2, 3, 4) and (0, 0, 0, 10),
respectively.
Data modeling
Let's explore the data in the users cube that contains various demographic
information about users, including their age:
[
{
"users.name": "Abbott, Breanne",
"users.age": 52
},
{
"users.name": "Abbott, Dallas",
"users.age": 43
},
{
"users.name": "Abbott, Gia",
"users.age": 36
},
{
"users.name": "Abbott, Tom",
"users.age": 39
},
{
"users.name": "Abbott, Ward",
"users.age": 67
}
]Calculating the average age is as simple as defining a measure with the built-in
avg type.
Calculating the percentiles would require using database-specific functions.
However, almost every database has them under names of PERCENTILE_CONT and
PERCENTILE_DISC,
Postgres (opens in a new tab) and
Snowflake (opens in a new tab)
included. For BigQuery (opens in a new tab),
you'd need to use the APPROX_QUANTILES function.
cubes:
- name: users
# ...
measures:
- name: avg_age
type: avg
sql: age
- name: median_age
type: number
sql: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY age)
- name: p95_age
type: number
sql: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY age)Result
Using the measures defined above, we can explore statistics about the age of our users.
[
{
"users.avg_age": "52.3100000000000000",
"users.median_age": 53,
"users.p95_age": 82
}
]For this particular dataset, the average age closely matches the median age, and 95% of all users are younger than 82 years.
Source code
Please feel free to check out the
full source code (opens in a new tab)
or run it with the docker-compose up command. You'll see the result, including
queried data, in the console.