- 2.17.0 (latest)
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 2.0.0-dev0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection .
Classes
KFold
KFold
(
n_splits
:
int
=
5
,
*
,
random_state
:
typing
.
Optional
[
int
]
=
None
)
K-Fold cross-validator.
Split data in train/test sets. Split dataset into k consecutive folds.
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import KFold
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> kf = KFold(n_splits=3, random_state=42)
>>> for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
... print(f"Fold {i}:")
... print(f" X_train: {X_train}")
... print(f" X_test: {X_test}")
... print(f" y_train: {y_train}")
... print(f" y_test: {y_test}")
...
Fold 0:
X_train: feat0 feat1
1 3 4
2 5 6
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
0 1 2
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
1 2
2 3
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
0 1
<BLANKLINE>
[1 rows x 1 columns]
Fold 1:
X_train: feat0 feat1
0 1 2
2 5 6
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
1 3 4
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
0 1
2 3
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
1 2
<BLANKLINE>
[1 rows x 1 columns]
Fold 2:
X_train: feat0 feat1
0 1 2
1 3 4
<BLANKLINE>
[2 rows x 2 columns]
X_test: feat0 feat1
2 5 6
<BLANKLINE>
[1 rows x 2 columns]
y_train: label
0 1
1 2
<BLANKLINE>
[2 rows x 1 columns]
y_test: label
2 3
<BLANKLINE>
[1 rows x 1 columns]
n_splits
int
Number of folds. Must be at least 2. Default to 5.
random_state
Optional[int]
A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. Default to None.
Modules Functions
cross_validate
cross_validate
(
estimator
,
X
:
typing
.
Union
[
bigframes
.
dataframe
.
DataFrame
,
bigframes
.
series
.
Series
,
pandas
.
core
.
frame
.
DataFrame
,
pandas
.
core
.
series
.
Series
,
],
y
:
typing
.
Optional
[
typing
.
Union
[
bigframes
.
dataframe
.
DataFrame
,
bigframes
.
series
.
Series
,
pandas
.
core
.
frame
.
DataFrame
,
pandas
.
core
.
series
.
Series
,
]
]
=
None
,
*
,
cv
:
typing
.
Optional
[
typing
.
Union
[
int
,
bigframes
.
ml
.
model_selection
.
KFold
]]
=
None
)
-
> dict
[
str
,
list
]
Evaluate metric(s) by cross-validation and also record fit/score times.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import cross_validate
, KFold
>>> from bigframes.ml.linear_model import LinearRegression
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> model = LinearRegression()
>>> scores = cross_validate(model, X, y, cv=3) # doctest: +SKIP
>>> for score in scores["test_score"]: # doctest: +SKIP
... print(score["mean_squared_error"][0])
...
5.218167286047954e-19
2.726229944928669e-18
1.6197635612324266e-17
X
y
bigframes.dataframe.DataFrame
, bigframes.series.Series
or None
The target variable to try to predict in the case of supe()rvised learning. Default to None.
cv
int, bigframes.ml.model_selection.KFold
or None
Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a KFold
, - bigframes.ml.model_selection.KFold
instance.
Dict[str, List]
dict
are: test_score
The score array for test scores on each cv split. fit_time
The time for fitting the estimator on the train set for each cv split. score_time
The time for scoring the estimator on the test set for each cv split.train_test_split
train_test_split
(
*
arrays
:
typing
.
Union
[
bigframes
.
dataframe
.
DataFrame
,
bigframes
.
series
.
Series
,
pandas
.
core
.
frame
.
DataFrame
,
pandas
.
core
.
series
.
Series
,
],
test_size
:
typing
.
Optional
[
float
]
=
None
,
train_size
:
typing
.
Optional
[
float
]
=
None
,
random_state
:
typing
.
Optional
[
int
]
=
None
,
stratify
:
typing
.
Optional
[
bigframes
.
series
.
Series
]
=
None
)
-
> typing
.
List
[
typing
.
Union
[
bigframes
.
dataframe
.
DataFrame
,
bigframes
.
series
.
Series
]]
Splits dataframes or series into random train and test subsets.
Examples:
>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import train_test_split
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [0, 2, 4, 6, 8], "feat1": [1, 3, 5, 7, 9]})
>>> y = bpd.DataFrame({"label": [0, 1, 2, 3, 4]})
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
feat0 feat1
0 0 1
1 2 3
4 8 9
<BLANKLINE>
[3 rows x 2 columns]
>>> y_train
label
0 0
1 1
4 4
<BLANKLINE>
[3 rows x 1 columns]
>>> X_test
feat0 feat1
2 4 5
3 6 7
<BLANKLINE>
[2 rows x 2 columns]
>>> y_test
label
2 2
3 3
<BLANKLINE>
[2 rows x 1 columns]
\*arrays
bigframes.dataframe.DataFrame
or bigframes.series.Series
A sequence of BigQuery DataFrames or Series that can be joined on their indexes.
test_size
default None
The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.
train_size
default None
The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.
random_state
default None
A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.
List[Union[ bigframes.dataframe.DataFrame
, bigframes.series.Series
]]