Use ML and AI with BigQuery DataFrames

BigQuery DataFrames provides ML and AI capabilities for BigQuery DataFrames using the bigframes.ml library.

You can preprocess data , create estimators to train models in BigQuery DataFrames, create ML pipelines , and split training and testing datasets .

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

ML locations

The bigframes.ml library supports the same locations as BigQuery ML. BigQuery ML model prediction and other ML functions are supported in all BigQuery regions. Support for model training varies by region. For more information, see BigQuery ML locations .

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module . BigQuery DataFrames offers the following transformations:

  • To bin continuous data into intervals, use the KBinsDiscretizer class in the bigframes.ml.preprocessing module.

  • To normalize the target labels as integer values, use the LabelEncoder class in the bigframes.ml.preprocessing module.

  • To scale each feature to the range [-1, 1] by its maximum absolute value, use the MaxAbsScaler class in the bigframes.ml.preprocessing module.

  • To standardize features by scaling each feature to the range [0, 1] , use the MinMaxScaler class in the bigframes.ml.preprocessing module.

  • To standardize features by removing the mean and scaling to unit variance, use the StandardScaler class in the bigframes.ml.preprocessing module.

  • To transform categorical values into numeric format, use the OneHotEncoder class in the bigframes.ml.preprocessing module.

  • To apply transformers to DataFrames columns, use the ColumnTransformer class in the bigframes.ml.compose module.

Train models

You can create estimators to train models in BigQuery DataFrames.

Clustering models

You can create estimators for clustering models by using the bigframes.ml.cluster module . To create K-means clustering models, use the KMeans class . Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.

You can use the bigframes.ml.cluster module to create estimators for clustering models.

The following code sample shows using the bigframes.ml.cluster KMeans class to create a k-means clustering model for data segmentation:

  from 
  
 bigframes.ml.cluster 
  
 import 
 KMeans 
 import 
  
 bigframes.pandas 
  
 as 
  
 bpd 
 # Load data from BigQuery 
 query_or_table 
 = 
 "bigquery-public-data.ml_datasets.penguins" 
 bq_df 
 = 
 bpd 
 . 
 read_gbq 
 ( 
 query_or_table 
 ) 
 # Create the KMeans model 
 cluster_model 
 = 
 KMeans 
 ( 
 n_clusters 
 = 
 10 
 ) 
 cluster_model 
 . 
 fit 
 ( 
 bq_df 
 [ 
 "culmen_length_mm" 
 ], 
 bq_df 
 [ 
 "sex" 
 ]) 
 # Predict using the model 
 result 
 = 
 cluster_model 
 . 
 predict 
 ( 
 bq_df 
 ) 
 # Score the model 
 score 
 = 
 cluster_model 
 . 
 score 
 ( 
 bq_df 
 ) 
 

Decomposition models

You can create estimators for decomposition models by using the bigframes.ml.decomposition module . To create principal component analysis (PCA) models, use the PCA class . Use these models for computing principal components and using them to perform a change of basis on the data. Using the PCA class provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

Ensemble models

You can create estimators for ensemble models by using the bigframes.ml.ensemble module .

  • To create random forest classifier models, use the RandomForestClassifier class . Use these models for constructing multiple learning method decision trees for classification.

  • To create random forest regression models, use the RandomForestRegressor class . Use these models for constructing multiple learning method decision trees for regression.

  • To create gradient boosted tree classifier models, use the XGBClassifier class . Use these models for additively constructing multiple learning method decision trees for classification.

  • To create gradient boosted tree regression models, use the XGBRegressor class . Use these models for additively constructing multiple learning method decision trees for regression.

Forecasting models

You can create estimators for forecasting models by using the bigframes.ml.forecasting module . To create time series forecasting models, use the ARIMAPlus class .

Imported models

You can create estimators for imported models by using the bigframes.ml.imported module .

Linear models

Create estimators for linear models by using the bigframes.ml.linear_model module .

  • To create linear regression models, use the LinearRegression class . Use these models for forecasting, such as forecasting the sales of an item on a given day.

  • To create logistic regression models, use the LogisticRegression class . Use these models for the classification of two or more possible values such as whether an input is low-value , medium-value , or high-value .

The following code sample shows using bigframes.ml to do the following:

  from 
  
 bigframes.ml.linear_model 
  
 import 
 LinearRegression 
 import 
  
 bigframes.pandas 
  
 as 
  
 bpd 
 # Load data from BigQuery 
 query_or_table 
 = 
 "bigquery-public-data.ml_datasets.penguins" 
 bq_df 
 = 
 bpd 
 . 
 read_gbq 
 ( 
 query_or_table 
 ) 
 # Filter down to the data to the Adelie Penguin species 
 adelie_data 
 = 
 bq_df 
 [ 
 bq_df 
 . 
 species 
 == 
 "Adelie Penguin (Pygoscelis adeliae)" 
 ] 
 # Drop the species column 
 adelie_data 
 = 
 adelie_data 
 . 
 drop 
 ( 
 columns 
 = 
 [ 
 "species" 
 ]) 
 # Drop rows with nulls to get training data 
 training_data 
 = 
 adelie_data 
 . 
 dropna 
 () 
 # Specify your feature (or input) columns and the label (or output) column: 
 feature_columns 
 = 
 training_data 
 [ 
 [ 
 "island" 
 , 
 "culmen_length_mm" 
 , 
 "culmen_depth_mm" 
 , 
 "flipper_length_mm" 
 , 
 "sex" 
 ] 
 ] 
 label_columns 
 = 
 training_data 
 [[ 
 "body_mass_g" 
 ]] 
 test_data 
 = 
 adelie_data 
 [ 
 adelie_data 
 . 
 body_mass_g 
 . 
 isnull 
 ()] 
 # Create the linear model 
 model 
 = 
 LinearRegression 
 () 
 model 
 . 
 fit 
 ( 
 feature_columns 
 , 
 label_columns 
 ) 
 # Score the model 
 score 
 = 
 model 
 . 
 score 
 ( 
 feature_columns 
 , 
 label_columns 
 ) 
 # Predict using the model 
 result 
 = 
 model 
 . 
 predict 
 ( 
 test_data 
 ) 
 

Large language models

You can create estimators for LLMs by using the bigframes.ml.llm module .

The following code sample shows using the bigframes.ml.llm GeminiTextGenerator class to create a Gemini model for code generation:

  from 
  
 bigframes.ml.llm 
  
 import 
 GeminiTextGenerator 
 import 
  
 bigframes.pandas 
  
 as 
  
 bpd 
 # Create the Gemini LLM model 
 session 
 = 
 bpd 
 . 
 get_global_session 
 () 
 connection 
 = 
 f 
 " 
 { 
 PROJECT_ID 
 } 
 . 
 { 
 REGION 
 } 
 . 
 { 
 CONN_NAME 
 } 
 " 
 model 
 = 
 GeminiTextGenerator 
 ( 
 session 
 = 
 session 
 , 
 connection_name 
 = 
 connection 
 , 
 model_name 
 = 
 "gemini-2.0-flash-001" 
 ) 
 df_api 
 = 
 bpd 
 . 
 read_csv 
 ( 
 "gs://cloud-samples-data/vertex-ai/bigframe/df.csv" 
 ) 
 # Prepare the prompts and send them to the LLM model for prediction 
 df_prompt_prefix 
 = 
 "Generate Pandas sample code for DataFrame." 
 df_prompt 
 = 
 df_prompt_prefix 
 + 
 df_api 
 [ 
 "API" 
 ] 
 # Predict using the model 
 df_pred 
 = 
 model 
 . 
 predict 
 ( 
 df_prompt 
 . 
 to_frame 
 (), 
 max_output_tokens 
 = 
 1024 
 ) 
 

Remote models

To use BigQuery DataFrames ML remote models ( bigframes.ml.remote or bigframes.ml.llm ), you must enable the following APIs:

When you use BigQuery DataFrames ML remote models, you need the Project IAM Admin role ( roles/resourcemanager.projectIamAdmin ) if you use a default BigQuery connection, or the Browser role ( roles/browser ) if you use a pre-configured connection. You can avoid this requirement by setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option to True , in which case the connection (default or pre-configured) is used as-is without any existence or permission check. If you use the pre-configured connection and skip the connection check, verify the following:

  • The connection is created in the right location.
  • If you use BigQuery DataFrames ML remote models, the service account has the Vertex AI User role ( roles/aiplatform.user ) on the project.

Creating a remote model in BigQuery DataFrames creates a BigQuery connection . By default, a connection of the name bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped. The service account for the default connection is granted the Vertex AI User role ( roles/aiplatform.user ) on the project.

Create pipelines

You can create ML pipelines by using bigframes.ml.pipeline module . Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and lets you deploy data preprocessing steps and an estimator together.

To create a pipeline of transforms with a final estimator, use the Pipeline class .

Select models

To split your training and testing datasets and select the best models, use the bigframes.ml.model_selection module module:

  • To split the data into training and testing (evaluation sets), as shown in the following code sample, use the train_test_split function :

      X_train 
     , 
     X_test 
     , 
     y_train 
     , 
     y_test 
     = 
     train_test_split 
     ( 
     X 
     , 
     y 
     , 
     test_size 
     = 
     0.2 
     ) 
     
    
  • To create multi-fold training and testing sets to train and evaluate models, as shown in the following code sample, use the KFold class and the KFold.split method . This feature is valuable for small datasets.

      kf 
     = 
     KFold 
     ( 
     n_splits 
     = 
     5 
     ) 
     for 
     i 
     , 
     ( 
     X_train 
     , 
     X_test 
     , 
     y_train 
     , 
     y_test 
     ) 
     in 
     enumerate 
     ( 
     kf 
     . 
     split 
     ( 
     X 
     , 
     y 
     )): 
     # Train and evaluate models with training and testing sets 
     
    
  • To automatically create multi-fold training and testing sets, train and evaluate the model, and get the result of each fold, as shown in the following code sample, use the cross_validate function :

      scores 
     = 
     cross_validate 
     ( 
     model 
     , 
     X 
     , 
     y 
     , 
     cv 
     = 
     5 
     ) 
     
    

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: