The ML.TFDV_VALIDATE function
This document describes the ML.TFDV_VALIDATE function, which you can use to
compare the statistics for training and serving data, or two sets of
serving data, in order to identify anomalous differences between the two data
sets. Calling this function provides the same behavior as calling the
TensorFlow
validate_statistics API.
You can use the data output by this function for
model monitoring.
Syntax
ML.TFDV_VALIDATE( base_statistics, study_statistics [, detection_type] [, categorical_default_threshold] [, categorical_metric_type] [, numerical_default_threshold] [, numerical_metric_type] [, thresholds] )
Arguments
ML.TFDV_VALIDATE takes the following arguments:
base_statistics: the statistics of the training or serving data that you want to use as the baseline for comparison. This must be a TensorFlowDatasetFeatureStatisticsListprotocol buffer in JSON format. You can generate a protocol buffer in the correct format by running theML.TFDV_DESCRIBEfunction, or you can load it from outside of BigQuery.study_statistics: the statistics of the training or serving data that you want to compare to the baseline. This must be a TensorFlowDatasetFeatureStatisticsListprotocol buffer in JSON format. You can generate a protocol buffer in the correct format by running theML.TFDV_DESCRIBEfunction, or you can load it from outside of BigQuery.detection_type: aSTRINGvalue that specifies the type of comparison that you want to make. Valid values are as follows:SKEW: returns the data skew, which represents the statistical variation between training and serving data.DRIFT: returns the data drift, which represents the statistical variation between two different sets of serving data.
categorical_default_threshold: aFLOAT64value that specifies the custom threshold to use for anomaly detection for categorical andARRAY<categorical>features. The value must be in the range[0, 1). The default value is0.3.categorical_metric_type: aSTRINGvalue that specifies the metric used to compare statistics for categorical andARRAY<categorical>features. Valid values are as follows:L_INFTY: use L-infinity distance. This value is the default.JENSEN_SHANNON_DIVERGENCE: use Jensen–Shannon divergence.
numerical_default_threshold: aFLOAT64value that specifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The value must be in the range[0, 1). The default value is0.3.numerical_metric_type: aSTRINGvalue that specifies the metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The only valid value isJENSEN_SHANNON_DIVERGENCE.thresholds: anARRAY<STRUCT<STRING, FLOAT64>>value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. TheSTRINGvalue in the struct specifies the column name, and theFLOAT64value specifies the threshold. TheFLOAT64value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].
ML.TFDV_VALIDATE uses positional arguments, so if you specify an
optional argument, you must also specify all arguments prior to that argument.
For more information on argument types, see
Named arguments.
Output
ML.TFDV_VALIDATE returns a TensorFlow
Anomalies protocol buffer
in JSON format.
Examples
The following example returns the skew between training and serving data and also sets custom anomaly detection thresholds for two of the feature columns:
DECLARE stats1 JSON; DECLARE stats2 JSON; SET stats1 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.training`)); SET stats2 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)); SELECT ML.TFDV_VALIDATE( stats1, stats2, 'SKEW', .3, 'L_INFTY', .3, 'JENSEN_SHANNON_DIVERGENCE', [('feature1', 0.2), ('feature2', 0.5)] ); INSERT `myproject.mydataset.serve_stats` (t, dataset_feature_statistics_list) SELECT CURRENT_TIMESTAMP() AS t, stats1;
The following example returns the drift between two sets of serving data:
SELECT ML.TFDV_VALIDATE( (SELECT dataset_feature_statistics_list FROM `myproject.mydataset.servingJan24`), (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)), 'DRIFT' );
Limitations
The ML.TFDV_VALIDATE function doesn't conduct schema validation.
ML.TFDV_VALIDATE handles type mismatch as follows:
- If you specify
JENSEN_SHANNON_DIVERGENCEfor thecategorical_default_thresholdornumerical_default_thresholdargument, the feature isn't included in the final anomaly report. - If you specify
L_INFTYfor thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview.
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models.