Module model_selection (2.11.0)

Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection .

Classes

KFold

  KFold 
 ( 
 n_splits 
 : 
 int 
 = 
 5 
 , 
 * 
 , 
 random_state 
 : 
 typing 
 . 
 Optional 
 [ 
 int 
 ] 
 = 
 None 
 ) 
 

K-Fold cross-validator.

Split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Examples:

 >>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import KFold
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> kf = KFold(n_splits=3, random_state=42)
>>> for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
...     print(f"Fold {i}:")
...     print(f"  X_train: {X_train}")
...     print(f"  X_test: {X_test}")
...     print(f"  y_train: {y_train}")
...     print(f"  y_test: {y_test}")
...
Fold 0:
  X_train:    feat0  feat1
1      3      4
2      5      6
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
0      1      2
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
1      2
2      3
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
0      1
<BLANKLINE>
[1 rows x 1 columns]
Fold 1:
  X_train:    feat0  feat1
0      1      2
2      5      6
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
1      3      4
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
0      1
2      3
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
1      2
<BLANKLINE>
[1 rows x 1 columns]
Fold 2:
  X_train:    feat0  feat1
0      1      2
1      3      4
<BLANKLINE>
[2 rows x 2 columns]
  X_test:    feat0  feat1
2      5      6
<BLANKLINE>
[1 rows x 2 columns]
  y_train:    label
0      1
1      2
<BLANKLINE>
[2 rows x 1 columns]
  y_test:    label
2      3
<BLANKLINE>
[1 rows x 1 columns] 
Parameters
Name
Description
n_splits
int

Number of folds. Must be at least 2. Default to 5.

random_state
Optional[int]

A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. Default to None.

Modules Functions

cross_validate

  cross_validate 
 ( 
 estimator 
 , 
 X 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ], 
 y 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ] 
 ] 
 = 
 None 
 , 
 * 
 , 
 cv 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Union 
 [ 
 int 
 , 
 bigframes 
 . 
 ml 
 . 
 model_selection 
 . 
 KFold 
 ]] 
 = 
 None 
 ) 
 - 
> dict 
 [ 
 str 
 , 
 list 
 ] 
 

Evaluate metric(s) by cross-validation and also record fit/score times.

Examples:

 >>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import cross_validate 
, KFold
>>> from bigframes.ml.linear_model import LinearRegression
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> model = LinearRegression()
>>> scores = cross_validate(model, X, y, cv=3) # doctest: +SKIP
>>> for score in scores["test_score"]: # doctest: +SKIP
...   print(score["mean_squared_error"][0])
...
5.218167286047954e-19
2.726229944928669e-18
1.6197635612324266e-17 
Parameters
Name
Description
y
bigframes.dataframe.DataFrame , bigframes.series.Series or None

The target variable to try to predict in the case of supe()rvised learning. Default to None.

cv
int, bigframes.ml.model_selection.KFold or None

Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a KFold , - bigframes.ml.model_selection.KFold instance.

Returns
Type
Description
Dict[str, List]
A dict of arrays containing the score/time arrays for each scorer is returned. The keys for this dict are: test_score The score array for test scores on each cv split. fit_time The time for fitting the estimator on the train set for each cv split. score_time The time for scoring the estimator on the test set for each cv split.

train_test_split

  train_test_split 
 ( 
 * 
 arrays 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ], 
 test_size 
 : 
 typing 
 . 
 Optional 
 [ 
 float 
 ] 
 = 
 None 
 , 
 train_size 
 : 
 typing 
 . 
 Optional 
 [ 
 float 
 ] 
 = 
 None 
 , 
 random_state 
 : 
 typing 
 . 
 Optional 
 [ 
 int 
 ] 
 = 
 None 
 , 
 stratify 
 : 
 typing 
 . 
 Optional 
 [ 
 bigframes 
 . 
 series 
 . 
 Series 
 ] 
 = 
 None 
 ) 
 - 
> typing 
 . 
 List 
 [ 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 ]] 
 

Splits dataframes or series into random train and test subsets.

Examples:

 >>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import train_test_split
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [0, 2, 4, 6, 8], "feat1": [1, 3, 5, 7, 9]})
>>> y = bpd.DataFrame({"label": [0, 1, 2, 3, 4]})
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
    feat0  feat1
0      0      1
1      2      3
4      8      9
<BLANKLINE>
[3 rows x 2 columns]
>>> y_train
    label
0      0
1      1
4      4
<BLANKLINE>
[3 rows x 1 columns]
>>> X_test
    feat0  feat1
2      4      5
3      6      7
<BLANKLINE>
[2 rows x 2 columns]
>>> y_test
    label
2      2
3      3
<BLANKLINE>
[2 rows x 1 columns] 
Parameters
Name
Description
\*arrays
bigframes.dataframe.DataFrame or bigframes.series.Series

A sequence of BigQuery DataFrames or Series that can be joined on their indexes.

test_size
default None

The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.

train_size
default None

The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.

random_state
default None

A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.

Returns
Type
Description
A list of BigQuery DataFrames or Series.
Design a Mobile Site
View Site in Mobile | Classic
Share by: