The AI.KEY_DRIVERS function
This document describes the AI.KEY_DRIVERS function, which you can use to
identify segments of data that cause statistically significant changes to
a summable metric. For example, if you launch a new product, then the
following query identifies the locations where new product sales were most
unexpected:
SELECT *
FROM AI.KEY_DRIVERS(
TABLE `mydataset.sales_table`,
metric_col => 'sales',
dimension_cols => ['state', 'city'],
interest_label_col => 'is_new_product'
);
Calling the AI.KEY_DRIVERS function is similar to first
creating a contribution analysis model
and then calling the
ML.GET_INSIGHTS function
on that model. For most applications, we recommend using the AI.KEY_DRIVERS
function because it offers a simplified syntax, faster results, and automatic
pruning. The following table summarizes the differences:
AI.KEY_DRIVERS |
Contribution analysis model with ML.GET_INSIGHTS |
|---|---|
| Faster for one-off analysis | Preferred for repeated analysis |
| Supports a maximum of 12 dimensions | Supports more than 12 dimensions |
| Supports only a summable metric | Supports summable, summable by ratio, and summable by category metrics |
| Prunes redundant insights by default | Returns all insights by default |
The segment column in the output is called drivers |
The segment column in the output is called contributors |
| No model management required | Requires you to create and manage a model |
Input data
Your input table or query result to the AI.KEY_DRIVERS function must
include the following:
- A numerical metric column that contains a summable metric.
- A boolean column that indicates whether each row belongs to the interest or reference set.
- Structured data with dimension columns.
Syntax
AI.KEY_DRIVERS(
{ TABLE TABLE_NAME | (QUERY_STATEMENT) },
metric_col => 'METRIC_COL',
dimension_cols => DIMENSION_COLS,
interest_label_col => 'INTEREST_LABEL_COL',
[, min_apriori_support => MIN_APRIORI_SUPPORT]
[, top_k => TOP_K]
[, enable_pruning => ENABLE_PRUNING]
)
Arguments
The AI.KEY_DRIVERS function takes the following arguments:
TABLE_NAME: the name of the table that contains the interest and reference data to analyze.QUERY_STATEMENT: a GoogleSQL query that produces the interest and reference data to analyze.METRIC_COL: aSTRINGexpression to use to calculate a summable metric. The expression must be in the formSUM(metric_column_name)ormetric_column_name, wheremetric_column_nameis the name of a column of a numeric data type. Both expressions are treated as equivalent. The expression is case insensitive.You can't use any additional numerical computations in the contribution metric expression. For example, neither
SUM(AVG(metric_col))norAVG(SUM(round(metric_col_numerator))/(SUM(metric_col_denominator))is valid. You can perform additional computations in your query_statement if necessary.DIMENSION_COLS: anARRAY<STRING>that lists the names of the columns to use as dimensions when summarizing the metric specified in theMETRIC_COLoption. The dimension columns that you specify must have anINT64,BOOL, orSTRINGdata type. You must provide between 1 and 12 columns. You can't use the columns from theMETRIC_COLorINTEREST_LABEL_COLarguments as dimensions.INTEREST_LABEL_COL: aSTRINGvalue that contains the name of the column to use to determine whether a given row is interest group or reference group. The column that you specify must have aBOOLdata type. For more information, see choose interest and reference data.MIN_APRIORI_SUPPORT: aFLOAT64value in the range[0,1]that contains the minimum apriori support threshold for including segments in the output. All segments whoseapriori_supportvalue is less than theMIN_APRIORI_SUPPORTvalue that you specify are excluded from the output. You can't use theMIN_APRIORI_SUPPORToption withTOP_K. The default value is0.1.TOP_K: anINT64value between 1 and 1,000,000 that contains the maximum number of insights to output with the highest apriori support value. The model automatically retrieves the insights with the highest apriori support values and prunes insights with low apriori support, decreasing query runtime compared to using no apriori support thresholding. TheTOP_Koption can't be used together withMIN_APRIORI_SUPPORT.An
INT64value between 1 and 1,000,000. There is no default value. If noTOP_Kis specified, aMIN_APRIORI_SUPPORTthreshold of 0.1 is applied.ENABLE_PRUNING: aBOOLvalue that determines whether the model filters the insights generated by the model according to the following:FALSE: All insights are returned except those filtered by apriori support thresholds specified byMIN_APRIORI_SUPPORTorTOP_K.TRUE: Redundant insights are omitted from the result. Two rows are considered to contain redundant insights if the following conditions are met:- Their metric values are equal.
- The dimensions and corresponding values of one row are a subset of the dimensions and corresponding values of the other. In this case, the row that has more dimensions (the more descriptive row) is kept.
The
allinsight row is never pruned.
The default value is
TRUE.
Choose interest and reference data
Key driver analysis requires test (interest) and control (reference) data in a single table as input. The control set is used as a baseline to compare against the test set, which contains the data points of interest. The following are examples of test and control sets:
- Periods of time: Compare two different months of revenue data.
- Geographic regions: Compare retention rate across different countries.
- Product types: Compare sales per user across different product types.
- Campaign or promotion: Compare website engagement across different ad campaigns.
To set up the input data, you can create tables of test and control data separately and take the union of the two tables. For best results, we recommend having roughly equal numbers of test and control rows when using the summable metric to avoid biased results.
Output
AI.KEY_DRIVERS returns the following output columns
in addition to the dimension columns:
drivers: anARRAY<STRING>value that contains the dimension values for a given segment. The other output metrics that are returned in the same row apply to the segment described by these dimensions.metric_interest: a numeric value that contains the sum of the value of the metric column in the interest dataset for the given segment.metric_reference: a numeric value that contains the sum of the value of the metric column in the reference dataset for the given segment.difference: a numeric value that contains the difference between themetric_interestandmetric_referencevalues:metric_interest - metric_referencerelative_difference: a numeric value that contains the relative change in the segment value between the interest and reference datasets:difference / metric_referenceunexpected_difference: a numeric value that contains the unexpected difference between the segment's actualmetric_interestvalue and the segment's expectedmetric_interestvalue, which is determined by comparing the ratio of change for this segment against the complement ratio of change. Theunexpected_differencevalue is calculated as follows:Determine the
metric_interestvalue for all segments except the given segment, referred to here ascomplement_interest_change:complement_interest_change = sum(metric_interest for the population) - metric_interestDetermine the
metric_referencevalue for all segments except the given segment, referred to here ascomplement_reference_change:complement_reference_change = sum(metric_reference for the population) - metric_referenceDetermine the ratio between the
complement_interest_changeandcomplement_reference_changevalues, referred to here ascomplement_change_ratio:complement_change_ratio = complement_interest_change / complement_reference_changeDetermine the expected
metric_interestvalue for the given segment, referred to here asexpected_metric_interest:expected_metric_interest = metric_reference * complement_change_ratioDetermine the
unexpected_differencevalue:unexpected_difference = metric_interest - expected_metric_interest
relative_unexpected_difference: a numeric value that contains the ratio between theunexpected_differencevalue and theexpected_metric_interestvalue:unexpected_difference / expected_metric_interestYou can use the
relative_unexpected_differencevalue to determine if the change to this segment is smaller than expected compared to the change in all of the other segments.apriori_support: a numeric value that contains the apriori support value for the segment. The apriori support value is either the ratio between themetric_interestvalue for the segment and themetric_interestvalue for the population, or the ratio between themetric_referencevalue for the segment and themetric_referencevalue for the population, whichever is greater. The calculation is expressed as the following:GREATEST( metric_interest / SUM(metric_interest for the population), metric_reference / SUM(metric_reference for the population) )If the
apriori_supportvalue is less than the apriori support threshold value specified in the model, then the segment is considered too small to be of interest and is excluded by the model.contribution: a numeric value that contains the absolute value of thedifferencevalue:ABS(difference).
Example
The following query shows how to use the AI.KEY_DRIVERS function to analyze
publicly available Iowa liquor sale data from 2024. The reference data consists
of sales that happened in the first half of the year
2024 and the interest data consists of sales that happened in the second
half of the year 2024. The min_apriori_support argument is set to 0 so
that all segments are included in the output.
WITH InputData AS (
SELECT
CAST(sale_dollars AS BIGNUMERIC) AS sale_dollars,
city,
category_name,
vendor_name,
(date > '2024-07-01') AS IS_H2
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE EXTRACT(YEAR FROM DATE) = 2024)
SELECT * EXCEPT(city, vendor_name, category_name)
FROM AI.KEY_DRIVERS(
TABLE InputData,
metric_col => 'sale_dollars',
dimension_cols => ['city', 'vendor_name', 'category_name'],
interest_label_col => 'IS_H2',
min_apriori_support => 0
);
The first several rows of the output look similar to the following. The values are truncated to improve readability:
| drivers | metric_interest | metric_reference | difference | relative_difference | unexpected_difference | relative_unexpected_difference | apriori_support | contribution |
|---|---|---|---|---|---|---|---|---|
| [all] | 230372400 | 216802833 | 13569567 | 0.063 | 13569567 | 0.063 | 1 | 13569567 |
| [category_name=DIAGEO AMERICAS] | 50788320 | 43857330 | 6930990 | 0.158 | 5247510 | 0.115 | 0.220 | 6930990 |
| [vendor_name=TEMPORARY & SPECIALTY PACKAGES] | 11068763 | 5297807 | 5770957 | 1 | 5575616 | 1 | 0.048 | 5770957 |
| [vendor_name=TEMPORARY & SPECIALTY PACKAGES,category_name=DIAGEO AMERICAS] | 6196380 | 2771982 | 3424398 | 1 | 3293005 | 1 | 0.027 | 3424398 |
| [vendor_name=STRAIGHT BOURBON WHISKIES] | 21255149 | 19015184 | 2239965 | 0.118 | 1150744 | 0.057 | 0.092 | 2239965 |
| [city=URBANDALE] | 3661058 | 2111396 | 1549663 | 0.734 | 1431452 | 0.642 | 0.016 | 1549663 |
| [category_name=SAZERAC COMPANY INC] | 35528047 | 34155473 | 1372574 | 0.040 | -908292 | -0.025 | 0.158 | 1372574 |
| [category_name=BROWN FORMAN CORP.] | 10863109 | 9572464 | 1290644 | 0.135 | 723452 | 0.071 | 0.047 | 1290644 |
| [city=DES MOINES] | 25824531 | 27001845 | -1177314 | -0.044 | -3275264 | -0.113 | 0.125 | 1177314 |
You can infer insights such as the following from the results:
Total sales: Overall sales volume increased by 6.3% from the first half (H1) to the second half (H2) of the year, growing from a reference metric value of $216.8M to an interest metric value of $230.4M. This represents a $13.6M absolute difference.
Vendor Contribution: Diageo Americas was a significant contributor to the growth, realizing a $6.9M increase (Interest: $50.8M, Reference: $43.9M), which translates to a 15.8% relative growth. In contrast, while Sazerac's sales also grew by $1.4M (Interest: $35.5M, Reference: $34.2M), this 4% growth rate was smaller than expected.
Geographic Decline: The city of Des Moines experienced a sales decline of $1.2M (Interest: $25.8M, Reference: $27.0M), resulting in a relative difference of -4.4%. Critically, this segment of the data accounts for 12.5% (its
apriori_supportvalue) of the overall population.
What's next
- Learn more about contribution analysis.
- Learn how to create a contribution analysis model.