Class KMeans (1.34.0)

  KMeans 
 ( 
 n_clusters 
 : 
 int 
 = 
 8 
 , 
 * 
 , 
 init 
 : 
 typing 
 . 
 Literal 
 [ 
 "kmeans++" 
 , 
 "random" 
 , 
 "custom" 
 ] 
 = 
 "kmeans++" 
 , 
 init_col 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 distance_type 
 : 
 typing 
 . 
 Literal 
 [ 
 "euclidean" 
 , 
 "cosine" 
 ] 
 = 
 "euclidean" 
 , 
 max_iter 
 : 
 int 
 = 
 20 
 , 
 tol 
 : 
 float 
 = 
 0.01 
 , 
 warm_start 
 : 
 bool 
 = 
 False 
 )

K-Means clustering.

Examples:

 >>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> from bigframes.ml.cluster import KMeans

>>> X = bpd.DataFrame({"feat0": [1, 1, 1, 10, 10, 10], "feat1": [2, 4, 0, 2, 4, 0]})
>>> kmeans = KMeans(n_clusters=2).fit(X)
>>> kmeans.predict(bpd.DataFrame({"feat0": [0, 12], "feat1": [0, 3]}))["CENTROID_ID"] # doctest:+SKIP
0    1
1    2
Name: CENTROID_ID, dtype: Int64

>>> kmeans.cluster_centers_ # doctest:+SKIP
centroid_id feature  numerical_value categorical_value
0            1   feat0              5.5                []
1            1   feat1              1.0                []
2            2   feat0              5.5                []
3            2   feat1              4.0                []

[4 rows x 4 columns]

Parameters

Name

Description

n_clusters

int, default 8

The number of clusters to form as well as the number of centroids to generate. Default to 8.

init

"kmeans++", "random" or "custom", default "kmeans++"

The method of initializing the clusters. Default to "kmeans++" kmeas++: Initializes a number of centroids equal to the n_clusters value by using the k-means++ algorithm. Using this approach usually trains a better model than using random cluster initialization. random: Initializes the centroids by randomly selecting a number of data points equal to the n_clusters value from the input data. custom: Initializes the centroids using a provided column of type bool. Uses the rows with a value of True as the initial centroids. You specify the column to use by using the init_col option.

init_col

str or None, default None

The name of the column to use to initialize the centroids. This column must have a type of bool. If this column contains a value of True for a given row, then uses that row as an initial centroid. The number of True rows in this column must be equal to the value you have specified for the n_clusters option. Only works with init method "custom". Default to None.

distance_type

"euclidean" or "cosine", default "euclidean"

The type of metric to use to compute the distance between two points. Default to "euclidean".

max_iter

int, default 20

The maximum number of training iterations, where one iteration represents a single pass of the entire training data. Default to 20.

tol

float, default 0.01

The minimum relative loss improvement that is necessary to continue training. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue. Default to 0.01.

warm_start

bool, default False

Determines whether to train a model with new training data, new model options, or both. Unless you explicitly override them, the initial options used to train the model are used for the warm start run. Default to False.

Properties

cluster_centers_

Information of cluster centers.

Returns

Type

Description

 bigframes.dataframe.DataFrame

DataFrame of cluster centers, containing following columns: centroid_id: An integer that identifies the centroid. feature: The column name that contains the feature. numerical_value: If feature is numeric, the value of feature for the centroid that centroid_id identifies. If feature is not numeric, the value is NULL. categorical_value: An list of mappings containing information about categorical features. Each mapping contains the following fields: categorical_value.category: The name of each category. categorical_value.value: The value of categorical_value.category for the centroid that centroid_id identifies. The output contains one row per feature per centroid.

Methods

repr

  __repr__ 
 ()

Print the estimator's constructor with all non-default parameter values.

detect_anomalies

  detect_anomalies 
 ( 
 X 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ], 
 * 
 , 
 contamination 
 : 
 float 
 = 
 0.1 
 ) 
 - 
> bigframes 
 . 
 dataframe 
 . 
 DataFrame

Detect the anomaly data points of the input.

Parameters

Name

Description

X

 bigframes.dataframe.DataFrame 
or bigframes.series.Series

Series or a DataFrame to detect anomalies.

contamination

float, default 0.1

Identifies the proportion of anomalies in the training dataset that are used to create the model. The value must be in the range [0, 0.5].

Returns

Type

Description

 bigframes.dataframe.DataFrame

detected DataFrame.

fit

  fit 
 ( 
 X 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ], 
 y 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ] 
 ] 
 = 
 None 
 , 
 ) 
 - 
> bigframes 
 . 
 ml 
 . 
 base 
 . 
 _T

Compute k-means clustering.

Parameters

Name

Description

X

 bigframes.dataframe.DataFrame 
or bigframes.series.Series 
or pandas.core.frame.DataFrame or pandas.core.series.Series

DataFrame of shape (n_samples, n_features). Training data.

y

default None

Not used, present here for API consistency by convention.

Returns

Type

Description

KMeans

Fitted estimator.

get_params

  get_params 
 ( 
 deep 
 : 
 bool 
 = 
 True 
 ) 
 - 
> typing 
 . 
 Dict 
 [ 
 str 
 , 
 typing 
 . 
 Any 
 ]

Get parameters for this estimator.

Parameter

Name

Description

deep

bool, default True

Default True . If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

Type

Description

Dictionary

A dictionary of parameter names mapped to their values.

predict

  predict 
 ( 
 X 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ] 
 ) 
 - 
> bigframes 
 . 
 dataframe 
 . 
 DataFrame

Predict the closest cluster each sample in X belongs to.

Parameter

Name

Description

X

 bigframes.dataframe.DataFrame 
or bigframes.series.Series 
or pandas.core.frame.DataFrame or pandas.core.series.Series

DataFrame of shape (n_samples, n_features). New data to predict.

Returns

Type

Description

 bigframes.dataframe.DataFrame

DataFrame of shape (n_samples, n_input_columns + n_prediction_columns). Returns predicted labels.

register

  register 
 ( 
 vertex_ai_model_id 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 ) 
 - 
> bigframes 
 . 
 ml 
 . 
 base 
 . 
 _T

After register, go to the Google Cloud console ( https://console.cloud.google.com/vertex-ai/models ) to manage the model registries. Refer to https://cloud.google.com/vertex-ai/docs/model-registry/introduction for more options.

Parameter

Name

Description

vertex_ai_model_id

Optional[str], default None

Optional string id as model id in Vertex. If not set, will default to 'bigframes_{bq_model_id}'. Vertex Ai model id will be truncated to 63 characters due to its limitation.

score

  score 
 ( 
 X 
 : 
 typing 
 . 
 Union 
 [ 
 bigframes 
 . 
 dataframe 
 . 
 DataFrame 
 , 
 bigframes 
 . 
 series 
 . 
 Series 
 , 
 pandas 
 . 
 core 
 . 
 frame 
 . 
 DataFrame 
 , 
 pandas 
 . 
 core 
 . 
 series 
 . 
 Series 
 , 
 ], 
 y 
 = 
 None 
 , 
 ) 
 - 
> bigframes 
 . 
 dataframe 
 . 
 DataFrame

Calculate evaluation metrics of the model.

Parameters

Name

Description

X

 bigframes.dataframe.DataFrame 
or bigframes.series.Series 
or pandas.core.frame.DataFrame or pandas.core.series.Series

DataFrame of shape (n_samples, n_features). New Data.

y

default None

Not used, present here for API consistency by convention.

Returns

Type

Description

 bigframes.dataframe.DataFrame

DataFrame of the metrics.

to_gbq

  to_gbq 
 ( 
 model_name 
 : 
 str 
 , 
 replace 
 : 
 bool 
 = 
 False 
 ) 
 - 
> bigframes 
 . 
 ml 
 . 
 cluster 
 . 
 KMeans

Save the model to BigQuery.

Parameters

Name

Description

model_name

str

The name of the model.

replace

bool, default False

Determine whether to replace if the model already exists. Default to False.

Returns

Type

Description

KMeans

Saved model.

Class KMeans (1.34.0) Stay organized with collections Save and categorize content based on your preferences.

Parameters

Properties

cluster_centers_

Methods

__repr__

detect_anomalies

fit

get_params

predict

register

score

to_gbq

Class KMeans (1.34.0)

repr