The ML.CORRELATION function

This document describes the ML.CORRELATION function, which calculates statistical correlations between a target column and one or more metric columns. ML.CORRELATION offers the following features:

Multi-column matrix: correlates one target variable against multiple metrics simultaneously.
Dimensional slicing: automatically computes correlations for all combinations of specified dimensions, similar to a GROUP BY CUBE operation.
Flexible methods: supports Pearson, Spearman, and Kendall correlation methods.

Syntax

ML.CORRELATION(
  { TABLE TABLE_NAME | (QUERY_STATEMENT) },
  target_col => TARGET_COL,
  target_correlation_cols => TARGET_CORRELATION_COLS
  [, dimension_cols => DIMENSION_COLS, ]
  [, method => METHOD ]
);

Arguments

The ML.CORRELATION function takes the following arguments:

TABLE_NAME: the name of a BigQuery table that contains the data to analyze.
QUERY_STATEMENT: a SQL query whose results contain the data to analyze.
TARGET_COL: a STRING that contains the name of the primary numerical column to analyze.
TARGET_CORRELATION_COLS: a STRING or ARRAY<STRING> value that contains the names of one or more numerical columns to correlate against the TARGET_COL column.
DIMENSION_COLS: a STRING or ARRAY<STRING> value that contains the names of columns to slice the data by. The function calculates correlations for every combination of these dimensions. You can specify a maximum of 12 columns. Each column must be a groupable type.
METHOD: a STRING that specifies the statistical method to use for correlation. Supported values are PEARSON, SPEARMAN, and KENDALL. The default value is PEARSON. The KENDALL method has higher complexity and can be slow on large datasets. For large tables, we recommend that you use PEARSON or SPEARMAN.

Output

The ML.CORRELATION function returns a table where each row represents the correlation for a specific pair of columns within a specific data segment. The results are sorted by segment_size in descending order, and then by corr_col in ascending order. The output table contains the following columns:

segment: an ARRAY<STRUCT<dimension_col STRING, dimension_value JSON>> value that contains the key-value pair for each dimension. The JSON value for dimension_value is generated using the TO_JSON function.
dimension_col: a column from dimension_cols, if you specified any dimension columns. The output includes one column for each dimension specified. A NULL value in one of these columns indicates either a true NULL value in the column or a placeholder NULL value that means the column was part of a rollup. This is conceptually similar to the presence of NULL placeholder values generated when you use grouping sets. To determine whether the NULL value is from the column itself, check for a NULL value in the dimension_value field for that column in the segment column of the output.
target_col: a STRING value that contains the name of the input target column.
corr_col: a STRING value that contains the name of the metric column being correlated against the target column.
correlation: a FLOAT64 value that contains the correlation coefficient in the range of -1.0 to 1.0.
segment_size: an INT64 value that contains the number of rows used to calculate the correlation for this segment.
segment_proportion: a FLOAT64 value that contains the fraction of total rows in the input table (segment_size/total rows) that belong to this segment.

Examples

The following examples show how to use the ML.CORRELATION function with the my_dataset.marketing_sample table:

CREATE OR REPLACE TABLE my_dataset.marketing_sample AS (
  -- New York data
  SELECT 'USA' AS country, 'New York' AS city, 'Electronics' AS product_category, 100 AS ad_spend, 150 AS budget, 1000 AS revenue UNION ALL
  SELECT 'USA', 'New York', 'Electronics', 150, 200, 1500 UNION ALL
  SELECT 'USA', 'New York', 'Apparel',     200, 250, 1800 UNION ALL

  -- Seattle data
  SELECT 'USA', 'Seattle',  'Apparel',     220, 240, 2400 UNION ALL
  SELECT 'USA', 'Seattle',  'Apparel',     300, 320, 2700 UNION ALL

  -- London data (Genuine NULL country)
  SELECT NULL,  'London',   'Electronics', 100, 120, 500  UNION ALL
  SELECT NULL,  'London',   'Electronics', 200, 220, 900  UNION ALL

  -- Missing city data (Genuine NULL city)
  SELECT NULL,  NULL,       'Apparel',     200, 200, 1000 UNION ALL
  SELECT NULL,  NULL,       'Apparel',     250, 250, 1200
);


/*---------+----------+------------------+----------+--------+---------+
 | country | city     | product_category | ad_spend | budget | revenue |
 +---------+----------+------------------+----------+--------+---------+
 | USA     | New York | Electronics      | 100      | 150    | 1000    |
 | USA     | New York | Electronics      | 150      | 200    | 1500    |
 | USA     | New York | Apparel          | 200      | 250    | 1800    |
 | USA     | Seattle  | Apparel          | 220      | 240    | 2400    |
 | USA     | Seattle  | Apparel          | 300      | 320    | 2700    |
 | NULL    | London   | Electronics      | 100      | 120    | 500     |
 | NULL    | London   | Electronics      | 200      | 220    | 900     |
 | NULL    | NULL     | Apparel          | 200      | 200    | 1000    |
 | NULL    | NULL     | Apparel          | 250      | 250    | 1200    |
 +---------+----------+------------------+----------+--------+---------*/

Calculate Pearson correlation

The following example calculates the Pearson correlation between revenue and ad_spend from the table my_dataset.marketing_sample and uses country as a dimension column:

SELECT country, segment, correlation, segment_size
FROM ML.CORRELATION(
  TABLE my_dataset.marketing_sample,
  target_col => 'revenue',
  target_correlation_cols => 'ad_spend',
  dimension_cols => ['country']
);


/*---------+--------------------------------------------------------+-------------+--------------+
 | country | segment                                                | correlation | segment_size |
 +---------+--------------------------------------------------------+-------------+--------------+
 | NULL    | []                                                     | 0.698       | 9            |
 | 'USA'   | [{dimension_col: 'country', dimension_value: '"USA"'}] | 0.968       | 5            |
 | NULL    | [{dimension_col: 'country', dimension_value: 'null'}]  | 0.990       | 4            |
 +---------+--------------------------------------------------------+-------------+--------------*/

The first row of the result contains NULL for country because it corresponds to aggregation over all countries. The third row of the result corresponds to a genuine NULL in the input data for country because the dimension_value field is NULL.

Calculate correlation for multiple columns

The following example calculates the correlation of revenue with ad_spend and budget, sliced by city and product_category, from the table my_dataset.marketing_sample:

SELECT *
FROM ML.CORRELATION(
  (SELECT * FROM my_dataset.marketing_sample WHERE country = 'USA'),
  target_col => 'revenue',
  target_correlation_cols => ['ad_spend', 'budget'],
  dimension_cols => ['city', 'product_category']
)
ORDER BY segment_size DESC, corr_col
LIMIT 5;


/*---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------+
 | segment                                                 | city     | product_category | target_col | corr_col | correlation | segment_size | segment_proportion |
 +---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------+
 | []                                                      | NULL     | NULL             | revenue    | ad_spend | 0.968       | 5            | 1.0                |
 | []                                                      | NULL     | NULL             | revenue    | budget   | 0.924       | 5            | 1.0                |
 | [{dimension_col: 'city', dimension_value: '"New York"'}]| New York | NULL             | revenue    | ad_spend | 0.990       | 3            | 0.6                |
 | [{dimension_col: 'product_category', ...: '"Apparel"'}] | NULL     | Apparel          | revenue    | ad_spend | 0.866       | 3            | 0.6                |
 | [{dimension_col: 'city', dimension_value: '"New York"'}]| New York | NULL             | revenue    | budget   | 0.990       | 3            | 0.6                |
 +---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------*/

Distinguish between NULLs caused by global aggregates and NULLs caused by missing data

The following example shows how to use the segment column to label your rows clearly for reporting. If a city is NULL and isn't in the segment array, it's a global aggregate. If a city is NULL and is in the segment array, it's missing data.

SELECT
  -- Create a clean label for reporting
  CASE
    -- If 'city' is NULL and not in the segment array, it's a global rollup
    WHEN city IS NULL AND NOT EXISTS(SELECT 1 FROM UNNEST(segment) s WHERE s.dimension_col = 'city')
      THEN 'ALL CITIES (Global)'
    -- If 'city' is NULL and in the segment array, it's missing data
    WHEN city IS NULL
      THEN 'UNKNOWN CITY'
    ELSE city
  END AS city_label,
  correlation,
  segment_size
FROM ML.CORRELATION(
  TABLE my_dataset.marketing_sample,
  target_col => 'revenue',
  target_correlation_cols => 'ad_spend',
  dimension_cols => ['city']
);


/*---------------------+-------------+--------------+
 | city_label          | correlation | segment_size |
 +---------------------+-------------+--------------+
 | ALL CITIES (Global) | 0.698       | 9            |
 | New York            | 0.990       | 3            |
 | UNKNOWN CITY        | 1.0         | 2            |
 | London              | 1.0         | 2            |
 | Seattle             | 1.0         | 2            |
 +---------------------+-------------+--------------*/