The AI.DETECT_ANOMALIES function
This document describes the AI.DETECT_ANOMALIES function, which lets you
detect anomalies in time series data by using
BigQuery ML's built-in TimesFM model.
For example, imagine you have historical and current data about sales of a product. You could run a query similar to the following to detect anomalous spikes or drops in sales:
SELECT *
FROM AI.DETECT_ANOMALIES(
TABLE `mydataset.history_table`,
TABLE `mydataset.target_table`,
data_col => 'units_sold',
timestamp_col => 'sales_date');
Syntax
SELECT
*
FROM
AI.DETECT_ANOMALIES(
{ TABLE HISTORY_TABLE | (HISTORY_QUERY_STATEMENT) },
{ TABLE TARGET_TABLE | (TARGET_QUERY_STATEMENT) },
data_col => 'DATA_COL',
timestamp_col => 'TIMESTAMP_COL'
[, model => 'MODEL']
[, id_cols => ID_COLS]
[, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD]
)
Description
The TimesFM model forecasts data for the DATA_COL value, based on the
historical data provided in the HISTORY_TABLE or HISTORY_QUERY_STATEMENT
argument, and using the fields contained in the SELECT statement as variables.
The data provided by the TARGET_TABLE or TARGET_QUERY_STATEMENT argument
is then compared to this forecasted data in order to detect anomalies. The
tables or queries that provide the historical and target data must use the
same column names for the DATA_COL and TIMESTAMP_COL arguments, and for the
ID_COLS argument if it is used.
Arguments
AI.DETECT_ANOMALIES takes the following arguments:
HISTORY_TABLE: the name of the table that contains historical time point data. For example,`mydataset.mytable`.If the table is in a different project, then you must prepend the project ID to the table name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[TABLE]`For example,
`myproject.mydataset.mytable`.HISTORY_QUERY_STATEMENT: the query that generates the historical data.TARGET_TABLE: the name of the table that contains the data in which you want to detect anomalies. The table's schema must match the schema of the historical data.If the table is in a different project, then you must prepend the project ID to the table name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[TABLE]`For example,
`myproject.mydataset.mytable`.TARGET_QUERY_STATEMENT: the query that generates the data in which you want to detect anomalies. The schema of the query result should match the schema of the historical data.DATA_COL: aSTRINGvalue that specifies the name of the data column. The data column contains the data to evaluate. The data column must use one of the following data types:INT64NUMERICBIGNUMERICFLOAT64
TIMESTAMP_COL: aSTRINGvalue that specifies the name of the timestamp column. The timestamp column must use one of the following data types:TIMESTAMPDATEDATETIME
MODEL_NAME: aSTRINGvalue that specifies the name of the model to use. Supported models includeTimesFM 2.0andTimesFM 2.5. The default value isTimesFM 2.0.ID_COLS: anARRAY<STRING>value that specifies the names of one or more ID columns. Each ID identifies a unique time series to evaluate. Specify one or more values for this argument in order to evaluate multiple time series using a single query. The columns that you specify must use one of the following data types:STRINGINT64ARRAY<STRING>ARRAY<INT64>
ANOMALY_PROB_THRESHOLD: aFLOAT64value that specifies the custom threshold to use for anomaly detection. The value must be in the range[0, 1). The default value is0.95.The value of the anomaly probability at each timestamp is calculated using the target time series data value, the historical time series data values, and the variance from the model training. The target time series data value at a specific timestamp is identified as anomalous if the anomaly probability exceeds the
ANOMALY_PROB_THRESHOLDvalue. TheANOMALY_PROB_THRESHOLDvalue also determines the lower and upper bounds, where a larger threshold value results in a larger interval size.
Output
AI.DETECT_ANOMALIES returns the following columns:
- All time series ID columns, as specified in the
ID_COLSargument. time_series_timestamp: aSTRINGvalue that contains the timestamp column for a time series. This value is inherited from theTIMESTAMP_COLargument.time_series_data: aSTRINGvalue that contains the data column for a time series. This value is inherited from theDATA_COLargument.is_anomaly: aBOOLvalue that indicates whether the value associated with a given time point is an anomaly.lower_bound: aFLOAT64value that contains the lower bound of the prediction result.upper_bound: aFLOAT64value that contains the upper bound of the prediction result.anomaly_probability: aFLOAT64value that contains the probability the value associated with a given time point is an anomaly.ai_detect_anomalies_status: aSTRINGvalue that contains the anomaly detection operation status. The value is empty if the operation was successful. If the operation wasn't successful, the value is the error string. A common error isThe time series data is too short.This error indicates that there wasn't enough historical data in the time series to evaluate. A minimum of 3 data points is required.
If the input is invalid, such as if the time series length is too short, then
the function returns NULL values in the is_anomaly, upper_bound,
lower_bound, and anomaly_probability columns.
Examples
The following examples detect anomalies in the number of bike trips recorded in the New York Citibike trips table for two months following July 1, 2016.
This query shows how to detect anomalies in the number of trips that happened each day:
WITH bike_trips AS (
SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date
)
SELECT *
FROM
AI.DETECT_ANOMALIES(
# Historical data from a query
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
# Target data from a query
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date');
The first few rows of output look similar to the following:
+-------------------------+------------------+------------+--------------------+--------------------+---------------------+----------------------------+ | time_series_timestamp | time_series_data | is_anomaly | lower_bound | upper_bound | anomaly_probability | ai_detect_anomalies_status | +-------------------------+------------------+------------+--------------------+--------------------+---------------------+----------------------------+ | 2016-07-02 00:00:00 UTC | 35837.0 | false | 31180.300309540704 | 53286.037518730387 | 0.77394848118403459 | null | | 2016-07-03 00:00:00 UTC | 35348.0 | false | 29203.3735854947 | 52593.933445778675 | 0.69142246991005529 | null | | 2016-07-04 00:00:00 UTC | 34860.0 | true | 37186.313388224225 | 58915.902423241641 | 0.97986699682799683 | null | +-------------------------+------------------+------------+--------------------+--------------------+---------------------+----------------------------+
Anomalies occur when the time_series_data is below the lower_bound
or above the upper_bound.

You can specify usertype in the id_cols argument to detect anomalies
broken down by the type of user, which can be Subscriber or Customer:
WITH
bike_trips AS (
SELECT EXTRACT(DATE FROM starttime) AS date, usertype, COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date, usertype
)
SELECT *
FROM
AI.DETECT_ANOMALIES(
# Historical data from a query
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
# Target data from a query
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date',
id_cols => ['usertype'])
This query breaks down the anomalies by the dimensions usertype and gender,
specifies that BigQuery should use the TimesFM 2.5 model, and
sets anomaly_prob_threshold to 0.8:
WITH bike_trips AS (
SELECT EXTRACT(DATE FROM starttime) AS date, usertype, gender, COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date, usertype, gender
)
SELECT *
FROM
AI.DETECT_ANOMALIES(
# Historical data from a query
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
# Target data from a query
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date',
id_cols => ['usertype', 'gender'],
model => "TimesFM 2.5",
anomaly_prob_threshold => 0.8);
Limitations
Only the most recent 1,024 time points are evaluated for anomalies. If you need to evaluate more data points, reach out to bqml-feedback@google.com.
Locations
AI.DETECT_ANOMALIES and the TimesFM model are available in all
supported BigQuery ML locations.
Pricing
AI.DETECT_ANOMALIES usage is billed at the evaluation, inspection, and
prediction rate documented in the BigQuery ML on-demand pricing
section of the BigQuery ML pricing page.
What's next
- For information about anomaly detection in BigQuery ML, see Anomaly detection overview.