The ML.BUCKETIZE function
This document describes the ML.BUCKETIZE function, which lets you split
a numerical expression into buckets.
You can use this function with models that support manual feature preprocessing. For more information, see the following documents:
Syntax
ML.BUCKETIZE(numerical_expression, array_split_points [, exclude_boundaries] [, output_format])
Arguments
ML.BUCKETIZE takes the following arguments:
numerical_expression: the numerical expression to bucketize.array_split_points: an array of numerical values that provide the points at which to split thenumerical_expressionvalue. The numerical values in the array must be finite, so not-inf,inf, orNaN. Provide the numerical values in order, lowest to highest. The range of possible buckets is determined by the upper and lower boundaries of the array. For example, if thearray_split_pointsvalue is[1, 2, 3, 4], then there are five potential buckets that thenumerical_expressionvalue can be bucketized into.exclude_boundaries: aBOOLvalue that determines whether the upper and lower boundaries fromarray_split_pointsare used. IfTRUE, then the boundary values aren't used to create buckets. For example, if thearray_split_pointsvalue is[1, 2, 3, 4]andexclude_boundariesisTRUE, then there are three potential buckets that thenumerical_expressionvalue can be bucketized into. The default value isFALSE.output_format: aSTRINGvalue that specifies the output format of the bucket. Valid output formats are as follows:bucket_names: returns aSTRINGvalue in the formatbin_<bucket_index>. For example,bin_3. Thebucket_indexvalue starts at 1. This is the default bucket format.bucket_ranges: returns aSTRINGvalue in the format[lower_bound, upper_bound)in interval notation. For example,(-inf, 2.5),[2.5, 4.6),[4.6, +inf).bucket_ranges_json: returns a JSON-formattedSTRINGvalue in the format{"start": "lower_bound", "end": "upper_bound"}. For example,{"start": "-Infinity", "end": "2.5"},{"start": "2.5", "end": "4.6"},{"start": "4.6", "end": "Infinity"}. The inclusivity and exclusivity of the lower and upper bound follow the same pattern as thebucket_rangesoption.
Output
ML.BUCKETIZE returns a STRING value that contains the name of the bucket, in the format specified by the output_format argument.
Example
The following example bucketizes a numerical expression both with and without boundaries:
SELECT ML.BUCKETIZE(2.5, [1, 2, 3]) AS bucket, ML.BUCKETIZE(2.5, [1, 2, 3], TRUE) AS bucket_without_boundaries, ML.BUCKETIZE(2.5, [1, 2, 3], FALSE, "bucket_ranges") AS bucket_ranges, ML.BUCKETIZE(2.5, [1, 2, 3], FALSE, "bucket_ranges_json") AS bucket_ranges_json;
The output looks similar to the following:
+--------+---------------------------+---------------+----------------------------+
| bucket | bucket_without_boundaries | bucket_ranges | bucket_ranges_json |
|--------|---------------------------|---------------|----------------------------|
| bin_3 | bin_2 | [2, 3) | {"start": "2", "end": "3"} |
+--------+---------------------------+---------------+----------------------------+
What's next
- For information about feature preprocessing, see Feature preprocessing overview.