- 2.30.0 (latest)
- 2.29.0
- 2.28.0
- 2.27.0
- 2.26.0
- 2.25.0
- 2.24.0
- 2.23.0
- 2.22.0
- 2.21.0
- 2.20.0
- 2.19.0
- 2.18.0
- 2.17.0
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection.
Classes
KFold
KFold(n_splits: int = 5, *, random_state: typing.Optional[int] = None)K-Fold cross-validator.
Split data in train/test sets. Split dataset into k consecutive folds.
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import KFold
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> kf = KFold(n_splits=3, random_state=42)
>>> for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
... print(f"Fold {i}:")
... print(f" X_train: {X_train}")
... print(f" X_test: {X_test}")
... print(f" y_train: {y_train}")
... print(f" y_test: {y_test}")
...
Fold 0:
X_train: feat0 feat1
1 3 4
2 5 6
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
0 1 2
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
1 2
2 3
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
0 1
<BLANKLINE>
[1 rows x 1 columns]
Fold 1:
X_train: feat0 feat1
0 1 2
2 5 6
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
1 3 4
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
0 1
2 3
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
1 2
<BLANKLINE>
[1 rows x 1 columns]
Fold 2:
X_train: feat0 feat1
0 1 2
1 3 4
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
2 5 6
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
0 1
1 2
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
2 3
<BLANKLINE>
[1 rows x 1 columns]
chain
chain(*iterables) --> chain object
Return a chain object whose .next() method returns elements from the first iterable until it is exhausted, then elements from the next iterable, until all of the iterables are exhausted.
Modules Functions
cast
cast(typ, val)Cast a value to a type.
This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don't check anything (we want this to be as fast as possible).
cross_validate
cross_validate(estimator, X, y=None, *, cv=None)Evaluate metric(s) by cross-validation and also record fit/score times.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import cross_validate, KFold
>>> from bigframes.ml.linear_model import LinearRegression
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> model = LinearRegression()
>>> scores = cross_validate(model, X, y, cv=3) # doctest: +SKIP
>>> for score in scores["test_score"]: # doctest: +SKIP
... print(score["mean_squared_error"][0])
...
5.218167286047954e-19
2.726229944928669e-18
1.6197635612324266e-17
| Parameters | |
|---|---|
| Name | Description |
X |
bigframes.dataframe.DataFrame or bigframes.series.Series
The data to fit. |
y |
bigframes.dataframe.DataFrame, bigframes.series.Series or None
The target variable to try to predict in the case of supe()rvised learning. Default to None. |
cv |
int, bigframes.ml.model_selection.KFold or None
Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a |
| Returns | |
|---|---|
| Type | Description |
Dict[str, List] |
A dict of arrays containing the score/time arrays for each scorer is returned. The keys for this dict are: test_score The score array for test scores on each cv split. fit_time The time for fitting the estimator on the train set for each cv split. score_time The time for scoring the estimator on the test set for each cv split. |
train_test_split
train_test_split(
*arrays,
test_size=None,
train_size=None,
random_state=None,
stratify=None,
shuffle=True
)Splits dataframes or series into random train and test subsets.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import train_test_split
>>> X = bpd.DataFrame({"feat0": [0, 2, 4, 6, 8], "feat1": [1, 3, 5, 7, 9]})
>>> y = bpd.DataFrame({"label": [0, 1, 2, 3, 4]})
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
feat0 feat1
0 0 1
1 2 3
4 8 9
<BLANKLINE>
[3 rows x 2 columns]
>>> y_train
label
0 0
1 1
4 4
<BLANKLINE>
[3 rows x 1 columns]
>>> X_test
feat0 feat1
2 4 5
3 6 7
<BLANKLINE>
[2 rows x 2 columns]
>>> y_test
label
2 2
3 3
<BLANKLINE>
[2 rows x 1 columns]
| Parameters | |
|---|---|
| Name | Description |
\*arrays |
bigframes.dataframe.DataFrame or bigframes.series.Series
A sequence of BigQuery DataFrames or Series that can be joined on their indexes. |
test_size |
default None
The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25. |
train_size |
default None
The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size. |
random_state |
default None
A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. |
| Returns | |
|---|---|
| Type | Description |
List[Union[bigframes.dataframe.DataFrame, bigframes.series.Series]] |
A list of BigQuery DataFrames or Series. |