The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as
feature preprocessing or model
monitoring.
Syntax
ML.DESCRIBE_DATA( { TABLE `PROJECT_ID.DATASET.TABLE_NAME` | (QUERY_STATEMENT) }, STRUCT( [NUM_QUANTILES AS num_quantiles] [, NUM_ARRAY_LENGTH_QUANTILES AS num_array_length_quantiles] [, TOP_K AS top_k]) )
Arguments
ML.DESCRIBE_DATA takes the following arguments:
- PROJECT_ID: your project ID.
- DATASET: the BigQuery dataset that contains the table.
- TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for.
- QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the- QUERY_STATEMENTclause, see GoogleSQL query syntax.
- NUM_QUANTILES: an- INT64value that specifies the number of quantiles to return for numerical,- ARRAY<numerical>, and- ARRAY<STRUCT<INT64, numerical>>columns. This affects the number of results shown in the- quantilesoutput column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The- NUM_QUANTILESvalue must be in the range- [1, 100,000]. The default value is- 2.
- NUM_ARRAY_LENGTH_QUANTILES: an- INT64value that specifies the number of quantiles to return for- ARRAYcolumns. This affects the number of results shown in the- array_length_quantilesoutput column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The- NUM_ARRAY_LENGTH_QUANTILESvalue must be in the range- [1, 100,000]. The default value is- 10.
- TOP_K: an- INT64value that specifies the number of top values to return for categorical and- ARRAY<categorical>columns. This affects the number of results shown in the- top_valuesoutput column. The top values are the values that are shown most frequently in the column. The- TOP_Kvalue must be in the range- [1, 10,000]. The default value is- 1.
Details
ML.DESCRIBE_DATA handles input columns as follows:
- ARRAYcolumns are unnested before statistics are computed on them.
- ARRAY<STRUCT<INT64, numerical>>. The- INT64value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type as- ARRAY<numerical>based on the value. The value of the dimension column in the output is- MAX(index) + 1.
- STRUCTfields are expanded, and then categorical columns are cast to- STRINGand numerical columns are cast to- FLOAT64.
- Columns of the following data types are
cast to
STRINGand return the same statistics asSTRINGcolumns:- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMPColumns of the following data types are cast to- FLOAT64and return the same statistics as- FLOAT64columns:
- INT64
- NUMERIC
- BIGNUMERIC
 
Output
ML.DESCRIBE_DATA returns one row for each column in the input data.
ML.DESCRIBE_DATA output contains the following columns:
- name: a- STRINGcolumn that contains the name of the input column.
- num_rows: an- INT64column that contains the total number of rows for the input column.
- num_nulls: an- INT64column that returns the number of- NULLvalues found in the column.
- num_zeros: an- INT64column that contains one of the following:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the number of0values found in the column.
- For categorical or ARRAY<categorical>input columns, returnsNULL.
 
- For numerical, 
- min: a- STRINGcolumn that contains the- MINvalue for the column.
- max: a- STRINGcolumn that contains the- MAXvalue for the column.
- mean: a- FLOAT64column that contains one of the following:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the mean value calculated for the column.
- For categorical or ARRAY<categorical>input columns, returnsNULL.
 
- For numerical, 
- stdev: a- FLOAT64column that contains one of the following:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the standard deviation value calculated for the column.
- For categorical or ARRAY<categorical>input columns, returnsNULL.
 
- For numerical, 
- median: a- FLOAT64column that contains one of the following:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the median value calculated for the column.
- For categorical or ARRAY<categorical>input columns, returnsNULL.
 
- For numerical, 
- quantiles: an- ARRAY<FLOAT64>column that contains information about the quantiles in an input column, as computed by the- APPROX_QUANTILESfunction. The- quantilescolumn contains one of the following values:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the quantiles computed for the column.
- For categorical or ARRAY<categorical>input columns, returnsNULL.
 
- For numerical, 
- unique: an- INT64column that contains information about the number of unique values in an input column, as computed by the- APPROX_COUNT_DISTINCTfunction. The- uniquecolumn contains one of the following values:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
- For categorical or ARRAY<categorical>input columns, returns the number of unique values in the input column.
 
- For numerical, 
- avg_string_length: a- FLOAT64column that contains one of the following:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
- For categorical or ARRAY<categorical>input columns, returns the average length of the values in the column.
 
- For numerical, 
- num_values: an- INT64column that contains the number of array elements for- ARRAYcolumns, and the number of values in the column for other types of columns.
- top_values: a- ARRAY<STRUCT<STRING, INT64>>column that contains information about the top values and number of occurrences in an input column, as computed by the- APPROX_TOP_COUNTfunction. The- top_valuescolumn contains the following fields:- top_values.value: a- STRINGfield that contains one of the following values:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
- For categorical or ARRAY<categorical>input columns, returns one of the top values in the input column.
 
- For numerical, 
- top_values.count: an- INT64field that contains one of the following values:- For numerical, ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
- For categorical or ARRAY<categorical>input columns, returns the number of times the related top value appears.
 
- For numerical, 
 
- min_array_length: an- INT64column that contains one of the following values:- For ARRAYinput columns, returns the minimum length of an array in the column.
- For other types of input columns, returns NULL.
 
- For 
- max_array_length: an- INT64column that contains one of the following values:- For ARRAYinput columns, returns the maximum length of an array in the column.
- For other types of input columns, returns NULL.
 
- For 
- avg_array_length: a- FLOAT64column that contains one of the following values:- For ARRAYinput columns, returns the average length of an array in the column.
- For other types of input columns, returns NULL.
 
- For 
- total_array_length: an- INT64column that contains one of the following values:- For ARRAYinput columns, returns the sum of the size of the arrays in the column.
- For other types of input columns, returns NULL.
 
- For 
- array_length_quantiles: an- ARRAY<INT64>column that contains the information about the quantiles for the array length in an input column, as computed by the- APPROX_QUANTILESfunction. The- array_length_quantilescolumn contains one of the following values:- For ARRAYinput columns, returns the quantiles for the array length computed for the column.
- For other types of input columns, returns 0.
 
- For 
- dimension: an- INT64column that contains one of the following:- For ARRAY<STRUCT<INT64, numerical>>input columns, returns the dimension computed for the column, which isMAX(index) + 1for sparse input.
- For other types of input columns, returns NULL.
 
- For 
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML.DESCRIBE_DATA( TABLE `myproject.mydataset.mytable`, STRUCT(5 AS num_quantiles, 3 AS top_k) );
Limitations
Input data for the ML.DESCRIBE_DATA function can only contain columns of the
following data types:
- Numeric types
- STRING
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP
- ARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor)
- STRUCTcolumns that contain any of the following types:- Numeric types
- STRING
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP
 
- ARRAYcolumns that contain any of the following types:- Numeric types
- STRING
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP
 
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview.
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models.