The ML.LABEL_ENCODER function
This document describes the ML.LABEL_ENCODER function, which you can use to
encode a string expression to an INT64 value in [0, <number of categories>].
The encoding vocabulary is sorted alphabetically. NULL values and categories
that aren't in the vocabulary are encoded to 0.
When used in the
TRANSFORM clause,
the vocabulary values calculated during training, along
with the top k and frequency threshold values that you specified, are
automatically used in prediction.
You can use this function with models that support manual feature preprocessing. For more information, see the following documents:
Syntax
ML.LABEL_ENCODER(string_expression [, top_k] [, frequency_threshold]) OVER()
ML.LABEL_ENCODER takes the following arguments:
string_expression: theSTRINGexpression to encode.top_k: anINT64value that specifies the number of categories included in the encoding vocabulary. The function selects thetop_kmost frequent categories in the data and uses those; categories below this threshold are encoded to0. This value must be less than1,000,000to avoid problems due to high dimensionality. The default value is32,000.frequency_threshold: anINT64value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal tofrequency_threshold; categories below this threshold are encoded to0. The default value is5.
Output
ML.LABEL_ENCODER returns an INT64 value that represents the encoded
string expression.
Example
The following example performs label encoding on a set of string expressions. It limits the encoding vocabulary to the two categories that occur the most frequently in the data and that also occur two or more times.
SELECT f, ML.LABEL_ENCODER(f, 2, 2) OVER () AS output FROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f ORDER BY f;
The output looks similar to the following:
+------+--------+ | f | output | +------+--------+ | NULL | 0 | | a | 0 | | b | 1 | | b | 1 | | c | 2 | | c | 2 | | c | 2 | | d | 0 | | d | 0 | +------+--------+
What's next
- For information about feature preprocessing, see Feature preprocessing overview.
- For information about the supported SQL statements and functions for each