Dec 3, 2021

Cohort analysis using Data Studio and BigQuery

A cohort analysis table is used to visually display how many of the new users are still active after a period of time, like 1 day, 7 days or 1 month. Google Data Studio does not have a cohort chart by default. By using a Pivot Table with a heat map and BigQuery as data source, it is possible to obtain the same result: namely, to visually display the new users retention.

TL;DR

The live version of the sample report can be accessed from here: User retention visualization

The BigQuery source

BigQuery is a fully managed enterprise data warehouse, a scalable and distributed analysis engine capable to query terabytes in seconds and petabytes in minutes. All the BigQuery features can be found here.

For this tutorial a date range of 1 week will be used.

DECLARE DS_START_DATE STRING;
DECLARE DS_END_DATE STRING;
SET DS_START_DATE='20211127';
SET DS_END_DATE='20211203';

The first step is to identify two columns from the raw data (the timestamp and the user identifier) and to decide what is the reference period of time. The chosen period is one day is in this case:

-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as active_date,
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)

The next step is to calculate the day when the user was first seen:

WITH raw AS (
-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as activity_date,
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)
)
-- calculate user's first day
SELECT 
    activity_date, 
    FIRST_VALUE(activity_date) OVER (PARTITION BY userid ORDER BY activity_date) as first_seen,
    userid
FROM raw

followed by the calculation of the number of days between the date when the user was first seen and the current date:

WITH raw AS (
-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as activity_date, 
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)
), t_first_seen AS (
-- calculate user's first day
SELECT
    activity_date, 
    FIRST_VALUE(activity_date) OVER (PARTITION BY userid ORDER BY activity_date) as first_seen,
    userid
FROM raw
)
-- calculate the number of days after the first day
SELECT 
    first_seen,
    DATE_DIFF(activity_date, first_seen, DAY) AS day_order,
    userid
FROM t_first_seen
WHERE first_seen BETWEEN  PARSE_DATE("%Y%m%d",DS_START_DATE) AND PARSE_DATE("%Y%m%d",DS_END_DATE)

followed by counting the number of unique users for each pair of first_seen and day_order:

WITH raw AS (
-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as activity_date, 
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)
), t_first_seen AS (
-- calculate user's first day
SELECT 
    activity_date, 
    FIRST_VALUE(activity_date) OVER (PARTITION BY userid ORDER BY activity_date) as first_seen,
    userid
FROM raw
), t_day_order AS (
-- calculate the number of days after the first day
SELECT 
    first_seen,
    DATE_DIFF(activity_date, first_seen, DAY) AS day_order,
    userid
FROM t_first_seen
WHERE first_seen BETWEEN  PARSE_DATE("%Y%m%d",DS_START_DATE) AND PARSE_DATE("%Y%m%d",DS_END_DATE)
)
-- counting the number of unique users for each day and day_order
SELECT 
    first_seen,
    day_order,
    COUNT(DISTINCT userid) as new_users
FROM t_day_order
GROUP BY 1,2

followed by the calculation of the cohort users:

WITH raw AS (
-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as activity_date, 
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)
), t_first_seen AS (
-- calculate user's first day
SELECT 
    activity_date, 
    FIRST_VALUE(activity_date) OVER (PARTITION BY userid ORDER BY activity_date) as first_seen,
    userid
FROM raw
), t_day_order AS (
-- calculate the number of days after the first day
SELECT 
    first_seen,
    DATE_DIFF(activity_date, first_seen, DAY) AS day_order,
    userid
FROM t_first_seen
WHERE first_seen BETWEEN  PARSE_DATE("%Y%m%d",DS_START_DATE) AND PARSE_DATE("%Y%m%d",DS_END_DATE)
), t_new_users AS (
-- counting the number of unique users for each day and day_order
SELECT 
    first_seen,
    day_order,
    COUNT(DISTINCT userid) as new_users
FROM t_day_order
GROUP BY 1,2
)
-- calculate the cohort users
SELECT 
    first_seen,
    day_order,
    new_users,
    FIRST_VALUE(new_users) OVER (PARTITION BY first_seen ORDER BY day_order) AS cohort_users
FROM t_new_users

and finally, followed by the calculation of the percentage of new users versus cohort users:

WITH raw AS (
-- select raw data
SELECT 
    DISTINCT DATE(timestamp) as activity_date, 
    userid
FROM `codelab1-244421.demo_bq_dataset.users_log`
WHERE timestamp < TIMESTAMP_ADD(PARSE_TIMESTAMP("%Y%m%d",DS_END_DATE), INTERVAL 1 DAY)
), t_first_seen AS (
-- calculate user's first day
SELECT 
    activity_date, 
    FIRST_VALUE(activity_date) OVER (PARTITION BY userid ORDER BY activity_date) as first_seen,
    userid
FROM raw
), t_day_order AS (
-- calculate the number of days after the first day
SELECT 
    first_seen,
    DATE_DIFF(activity_date, first_seen, DAY) AS day_order,
    userid
FROM t_first_seen
WHERE first_seen BETWEEN  PARSE_DATE("%Y%m%d",DS_START_DATE) AND PARSE_DATE("%Y%m%d",DS_END_DATE)
), t_new_users AS (
-- counting the number of unique users for each day and day_order
SELECT 
    first_seen,
    day_order,
    COUNT(DISTINCT userid) as new_users
FROM t_day_order
GROUP BY 1,2
), t_cohort_users AS (
-- calculate the cohort users
SELECT 
    first_seen,
    day_order,
    new_users,
    FIRST_VALUE(new_users) OVER (PARTITION BY first_seen ORDER BY day_order) AS cohort_users
FROM t_new_users
)
-- calculate the cohort users percentage
SELECT 
    first_seen,
    day_order,
    new_users,
    cohort_users,
    SAFE_DIVIDE(new_users, cohort_users) AS cohort_users_per
FROM t_cohort_users
ORDER BY first_seen, day_order

The result of this query will be the source of the Data Studio report.

The Data Studio report

Google Data Studio [https://datastudio.google.com/?requirelogin=1] is the main visualization tool from Google Cloud Platform. It has a rich set of connectors, many visualization widgets and beautiful sharing options.

Next are the steps to follow in order to create the report:

Login using a google account.
Click on “Blank Report” to create a new report.

Create report

Create a new data source using BigQuery connector.

Add BigQuery

Follow the instructions to authorize Data Studio to access the BigQuery dataset.
Choose “CUSTOM QUERY”, select project and paste the query from the previous step. Make sure to check the “Enable date range parameters” option and update the parameters DS_START_DATE and DS_END_DATE with the query parameters @DS_START_DATE and @DS_END_DATE.

Query

Navigate to menu item **Resource -> Manage added data sources", click on “Edit” and update the fields names as follows:
- “cohort_users” becomes “Cohort users”
- “cohort_users_per” becomes “% Cohort users”
- “day_order” becomes “Day”
- “first_seen” becomes “First seen”
- “new_users” becomes “New users”
Add a text box for the title of the report, a date range control and a pivot table with heat map.
For the pivot table set the following options:
- Row dimension: First seen
- Column dimension: Day
- Metric: % Cohort users
- Sorting:
  - Row #1: First seen ascending
  - Column #1: Day ascending
- Style:
  - Missing data: Show '' (blank)
Duplicate the pivot table and replace the metric with New users.

The final report should look like this:

Report

The live version of this report can be found here: User retention visualization

Conclusion

Data Studio is a great tool and in combination with BigQuery anyone can turn data into compelling stories of data visualization art.