The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as
feature preprocessing or model
monitoring.
Syntax
ML.DESCRIBE_DATA( { TABLE `PROJECT_ID.DATASET.TABLE_NAME` | (QUERY_STATEMENT) }, STRUCT( [NUM_QUANTILES AS num_quantiles] [, NUM_ARRAY_LENGTH_QUANTILES AS num_array_length_quantiles] [, TOP_K AS top_k]) )
Arguments
ML.DESCRIBE_DATA takes the following arguments:
PROJECT_ID: your project ID.DATASET: the BigQuery dataset that contains the table.TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for.QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENTclause, see GoogleSQL query syntax.NUM_QUANTILES: anINT64value that specifies the number of quantiles to return for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. This affects the number of results shown in thequantilesoutput column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_QUANTILESvalue must be in the range[1, 100,000]. The default value is2.NUM_ARRAY_LENGTH_QUANTILES: anINT64value that specifies the number of quantiles to return forARRAYcolumns. This affects the number of results shown in thearray_length_quantilesoutput column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_ARRAY_LENGTH_QUANTILESvalue must be in the range[1, 100,000]. The default value is10.TOP_K: anINT64value that specifies the number of top values to return for categorical andARRAY<categorical>columns. This affects the number of results shown in thetop_valuesoutput column. The top values are the values that are shown most frequently in the column. TheTOP_Kvalue must be in the range[1, 10,000]. The default value is1.
Details
ML.DESCRIBE_DATA handles input columns as follows:
ARRAYcolumns are unnested before statistics are computed on them.ARRAY<STRUCT<INT64, numerical>>. TheINT64value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type asARRAY<numerical>based on the value. The value of the dimension column in the output isMAX(index) + 1.STRUCTfields are expanded, and then categorical columns are cast toSTRINGand numerical columns are cast toFLOAT64.- Columns of the following data types are
cast to
STRINGand return the same statistics asSTRINGcolumns:BOOLBYTEDATEDATETIMETIMETIMESTAMPColumns of the following data types are cast toFLOAT64and return the same statistics asFLOAT64columns:INT64NUMERICBIGNUMERIC
Output
ML.DESCRIBE_DATA returns one row for each column in the input data.
ML.DESCRIBE_DATA output contains the following columns:
name: aSTRINGcolumn that contains the name of the input column.num_rows: anINT64column that contains the total number of rows for the input column.num_nulls: anINT64column that returns the number ofNULLvalues found in the column.num_zeros: anINT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the number of0values found in the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
min: aSTRINGcolumn that contains theMINvalue for the column.max: aSTRINGcolumn that contains theMAXvalue for the column.mean: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the mean value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
stdev: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the standard deviation value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
median: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the median value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
quantiles: anARRAY<FLOAT64>column that contains information about the quantiles in an input column, as computed by theAPPROX_QUANTILESfunction. Thequantilescolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the quantiles computed for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
unique: anINT64column that contains information about the number of unique values in an input column, as computed by theAPPROX_COUNT_DISTINCTfunction. Theuniquecolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the number of unique values in the input column.
- For numerical,
avg_string_length: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the average length of the values in the column.
- For numerical,
num_values: anINT64column that contains the number of array elements forARRAYcolumns, and the number of values in the column for other types of columns.top_values: aARRAY<STRUCT<STRING, INT64>>column that contains information about the top values and number of occurrences in an input column, as computed by theAPPROX_TOP_COUNTfunction. Thetop_valuescolumn contains the following fields:top_values.value: aSTRINGfield that contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns one of the top values in the input column.
- For numerical,
top_values.count: anINT64field that contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the number of times the related top value appears.
- For numerical,
min_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the minimum length of an array in the column. - For other types of input columns, returns
NULL.
- For
max_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the maximum length of an array in the column. - For other types of input columns, returns
NULL.
- For
avg_array_length: aFLOAT64column that contains one of the following values:- For
ARRAYinput columns, returns the average length of an array in the column. - For other types of input columns, returns
NULL.
- For
total_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the sum of the size of the arrays in the column. - For other types of input columns, returns
NULL.
- For
array_length_quantiles: anARRAY<INT64>column that contains the information about the quantiles for the array length in an input column, as computed by theAPPROX_QUANTILESfunction. Thearray_length_quantilescolumn contains one of the following values:- For
ARRAYinput columns, returns the quantiles for the array length computed for the column. - For other types of input columns, returns
0.
- For
dimension: anINT64column that contains one of the following:- For
ARRAY<STRUCT<INT64, numerical>>input columns, returns the dimension computed for the column, which isMAX(index) + 1for sparse input. - For other types of input columns, returns
NULL.
- For
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML.DESCRIBE_DATA( TABLE `myproject.mydataset.mytable`, STRUCT(5 AS num_quantiles, 3 AS top_k) );
Limitations
Input data for the ML.DESCRIBE_DATA function can only contain columns of the
following data types:
- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMPARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor)STRUCTcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
ARRAYcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview.
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models.