Module model_selection (2.30.0)

Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection.

Classes

KFold

KFold(n_splits: int = 5, *, random_state: typing.Optional[int] = None)

K-Fold cross-validator.

Split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import KFold
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> kf = KFold(n_splits=3, random_state=42)
>>> for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
...     print(f"Fold {i}:")
...     print(f"  X_train: {X_train}")
...     print(f"  X_test: {X_test}")
...     print(f"  y_train: {y_train}")
...     print(f"  y_test: {y_test}")
...
Fold 0:
  X_train:    feat0  feat1
1      3      4
2      5      6
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
0      1      2
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
1      2
2      3
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
0      1
<BLANKLINE>
[1 rows x 1 columns]
Fold 1:
  X_train:    feat0  feat1
0      1      2
2      5      6
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
1      3      4
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
0      1
2      3
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
1      2
<BLANKLINE>
[1 rows x 1 columns]
Fold 2:
  X_train:    feat0  feat1
0      1      2
1      3      4
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
2      5      6
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
0      1
1      2
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
2      3
<BLANKLINE>
[1 rows x 1 columns]

chain

chain(*iterables) --> chain object

Return a chain object whose .next() method returns elements from the first iterable until it is exhausted, then elements from the next iterable, until all of the iterables are exhausted.

Modules Functions

cast

cast(typ, val)

Cast a value to a type.

This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don't check anything (we want this to be as fast as possible).

cross_validate

cross_validate(estimator, X, y=None, *, cv=None)

Evaluate metric(s) by cross-validation and also record fit/score times.

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import cross_validate, KFold
>>> from bigframes.ml.linear_model import LinearRegression
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> model = LinearRegression()
>>> scores = cross_validate(model, X, y, cv=3) # doctest: +SKIP
>>> for score in scores["test_score"]: # doctest: +SKIP
...   print(score["mean_squared_error"][0])
...
5.218167286047954e-19
2.726229944928669e-18
1.6197635612324266e-17

Parameters
Name	Description
`X`	`bigframes.dataframe.DataFrame or bigframes.series.Series` The data to fit.
`y`	`bigframes.dataframe.DataFrame, bigframes.series.Series or None` The target variable to try to predict in the case of supe()rvised learning. Default to None.
`cv`	`int, bigframes.ml.model_selection.KFold or None` Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a `KFold`, - bigframes.ml.model_selection.KFold instance.

Returns
Type	Description
`Dict[str, List]`	A dict of arrays containing the score/time arrays for each scorer is returned. The keys for this `dict` are: `test_score` The score array for test scores on each cv split. `fit_time` The time for fitting the estimator on the train set for each cv split. `score_time` The time for scoring the estimator on the test set for each cv split.

train_test_split

train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    stratify=None,
    shuffle=True
)

Splits dataframes or series into random train and test subsets.

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import train_test_split
>>> X = bpd.DataFrame({"feat0": [0, 2, 4, 6, 8], "feat1": [1, 3, 5, 7, 9]})
>>> y = bpd.DataFrame({"label": [0, 1, 2, 3, 4]})
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
    feat0  feat1
0      0      1
1      2      3
4      8      9
<BLANKLINE>
[3 rows x 2 columns]
>>> y_train
    label
0      0
1      1
4      4
<BLANKLINE>
[3 rows x 1 columns]
>>> X_test
    feat0  feat1
2      4      5
3      6      7
<BLANKLINE>
[2 rows x 2 columns]
>>> y_test
    label
2      2
3      3
<BLANKLINE>
[2 rows x 1 columns]

Parameters
Name	Description
`\*arrays`	`bigframes.dataframe.DataFrame or bigframes.series.Series` A sequence of BigQuery DataFrames or Series that can be joined on their indexes.
`test_size`	`default None` The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.
`train_size`	`default None` The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.
`random_state`	`default None` A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.

Returns
Type	Description
`List[Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]`	A list of BigQuery DataFrames or Series.

Module model_selection (2.30.0) Stay organized with collections Save and categorize content based on your preferences.

Classes

KFold

chain

Modules Functions

cast

cross_validate

train_test_split

Module model_selection (2.30.0)