Apache Beam RunInference for scikit-learn

This notebook demonstrates the use of the RunInference transform for scikit-learn , also called sklearn. Apache Beam RunInference has implementations of the ModelHandler class prebuilt for scikit-learn. For more information about using RunInference, see Get started with AI/ML pipelines in the Apache Beam documentation.

You can choose the appropriate model handler based on your input data type:

With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your scikit-learn pipeline or model.

This notebook demonstrates the following common RunInference patterns:

Generate predictions.
Postprocess results after RunInference.
Run inference with multiple models in the same pipeline.

The linear regression models used in these samples are trained on data that correspondes to the 5 and 10 times tables; that is, y = 5x and y = 10x respectively.

Before you begin

Complete the following setup steps:

Install dependencies for Apache Beam.
Authenticate with Google Cloud.
Specify your project and bucket. You use the project and bucket to save and load models.

 pip  
install  
google-api-core  
--quiet 
 pip  
install  
google-cloud-pubsub  
google-cloud-bigquery-storage  
--quiet 
 pip  
install  
apache-beam [ 
gcp,dataframe ] 
  
--quiet

About scikit-learn versions

scikit-learn is a build-dependency of Apache Beam. If you need to install a different version of sklearn , use %pip install scikit-learn==<version>

  from 
  
 google.colab 
  
 import 
 auth 
 auth 
 . 
 authenticate_user 
 ()

  import 
  
 pickle 
 from 
  
 sklearn 
  
 import 
 linear_model 
 from 
  
 typing 
  
 import 
 Tuple 
 import 
  
 numpy 
  
 as 
  
 np 
 import 
  
 apache_beam 
  
 as 
  
 beam 
 from 
  
 apache_beam.ml.inference.sklearn_inference 
  
 import 
 ModelFileType 
 from 
  
 apache_beam.ml.inference.sklearn_inference 
  
 import 
 SklearnModelHandlerNumpy 
 from 
  
 apache_beam.ml.inference.base 
  
 import 
 KeyedModelHandler 
 from 
  
 apache_beam.ml.inference.base 
  
 import 
 PredictionResult 
 from 
  
 apache_beam.ml.inference.base 
  
 import 
 RunInference 
 from 
  
 apache_beam.options.pipeline_options 
  
 import 
 PipelineOptions 
 # NOTE: If an error occurs, restart your runtime.

  import 
  
 os 
 # Constants 
 project 
 = 
 "<PROJECT_ID>" 
 # @param {type:'string'} 
 bucket 
 = 
 "<BUCKET_NAME>" 
 # @param {type:'string'} 
 # To avoid warnings, set the project. 
 os 
 . 
 environ 
 [ 
 'GOOGLE_CLOUD_PROJECT' 
 ] 
 = 
 project

Create the data and the scikit-learn model

This section demonstrates the following steps:

Create the data to train the scikit-learn linear regression model.
Train the linear regression model.
Save the scikit-learn model using pickle .

In this example, you create two models, one with the 5 times model and a second with the 10 times model.

  # Input data to train the sklearn model for the 5 times table. 
 x 
 = 
 np 
 . 
 arange 
 ( 
 0 
 , 
 100 
 , 
 dtype 
 = 
 np 
 . 
 float32 
 ) 
 . 
 reshape 
 ( 
 - 
 1 
 , 
 1 
 ) 
 y 
 = 
 ( 
 x 
 * 
 5 
 ) 
 . 
 reshape 
 ( 
 - 
 1 
 , 
 1 
 ) 
 def 
  
 train_and_save_model 
 ( 
 x 
 , 
 y 
 , 
 model_file_name 
 ): 
 regression 
 = 
 linear_model 
 . 
 LinearRegression 
 () 
 regression 
 . 
 fit 
 ( 
 x 
 , 
 y 
 ) 
 with 
 open 
 ( 
 model_file_name 
 , 
 'wb' 
 ) 
 as 
 f 
 : 
 pickle 
 . 
 dump 
 ( 
 regression 
 , 
 f 
 ) 
 five_times_model_filename 
 = 
 'sklearn_5x_model.pkl' 
 train_and_save_model 
 ( 
 x 
 , 
 y 
 , 
 five_times_model_filename 
 ) 
 # Change y to be 10 times, and output a 10 times table. 
 ten_times_model_filename 
 = 
 'sklearn_10x_model.pkl' 
 train_and_save_model 
 ( 
 x 
 , 
 y 
 , 
 ten_times_model_filename 
 ) 
 y 
 = 
 ( 
 x 
 * 
 10 
 ) 
 . 
 reshape 
 ( 
 - 
 1 
 , 
 1 
 ) 
 train_and_save_model 
 ( 
 x 
 , 
 y 
 , 
 'sklearn_10x_model.pkl' 
 )

Create a scikit-learn RunInference pipeline

This section demonstrates how to do the following:

Define a scikit-learn model handler that accepts an array_like object as input.
Read the data from BigQuery.
Use the scikit-learn trained model and the scikit-learn RunInference transform on unkeyed data.

  % 
 pip 
 install 
 -- 
 upgrade 
 google 
 - 
 cloud 
 - 
 bigquery 
 -- 
 quiet

 gcloud  
config  
 set 
  
project  
 $project

Updated property [core/project].

  # Populated BigQuery table 
 from 
  
 google.cloud 
  
 import 
  bigquery 
 
 client 
 = 
  bigquery 
 
 . 
  Client 
 
 ( 
 project 
 = 
 project 
 ) 
 # Make sure the dataset_id is unique in your project. 
 dataset_id 
 = 
 ' 
 {project} 
 .maths' 
 . 
 format 
 ( 
 project 
 = 
 project 
 ) 
 dataset 
 = 
  bigquery 
 
 . 
  Dataset 
 
 ( 
 dataset_id 
 ) 
 # Modify the location based on your project configuration. 
 dataset 
 . 
 location 
 = 
 'US' 
 dataset 
 = 
 client 
 . 
  create_dataset 
 
 ( 
 dataset 
 , 
 exists_ok 
 = 
 True 
 ) 
 # Table name in the BigQuery dataset. 
 table_name 
 = 
 'maths_problems_1' 
 query 
 = 
 """ 
 CREATE OR REPLACE TABLE 
  
 {project} 
 .maths. 
 {table} 
 ( key STRING OPTIONS(description="A unique key for the maths problem"), 
 value FLOAT64 OPTIONS(description="Our maths problem" ) ); 
 INSERT INTO maths. 
 {table} 
 VALUES 
 ("first_example", 105.00), 
 ("second_example", 108.00), 
 ("third_example", 1000.00), 
 ("fourth_example", 1013.00) 
 """ 
 . 
 format 
 ( 
 project 
 = 
 project 
 , 
 table 
 = 
 table_name 
 ) 
 create_job 
 = 
 client 
 . 
  query 
 
 ( 
 query 
 ) 
  create_job 
 
 . 
 result 
 ()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f97abb4e850>

  sklearn_model_handler 
 = 
 SklearnModelHandlerNumpy 
 ( 
 model_uri 
 = 
 five_times_model_filename 
 ) 
 pipeline_options 
 = 
 PipelineOptions 
 () 
 . 
 from_dictionary 
 ( 
 { 
 'temp_location' 
 : 
 f 
 'gs:// 
 { 
 bucket 
 } 
 /tmp' 
 }) 
 # Define the BigQuery table specification. 
 table_name 
 = 
 'maths_problems_1' 
 table_spec 
 = 
 f 
 ' 
 { 
 project 
 } 
 :maths. 
 { 
 table_name 
 } 
 ' 
 with 
 beam 
 . 
 Pipeline 
 ( 
 options 
 = 
 pipeline_options 
 ) 
 as 
 p 
 : 
 ( 
 p 
 | 
 "ReadFromBQ" 
>> beam 
 . 
 io 
 . 
 ReadFromBigQuery 
 ( 
 table 
 = 
 table_spec 
 ) 
 | 
 "ExtractInputs" 
>> beam 
 . 
 Map 
 ( 
 lambda 
 x 
 : 
 [ 
 x 
 [ 
 'value' 
 ]]) 
 | 
 "RunInferenceSklearn" 
>> RunInference 
 ( 
 model_handler 
 = 
 sklearn_model_handler 
 ) 
 | 
 beam 
 . 
 Map 
 ( 
 print 
 ) 
 )

PredictionResult(example=[1000.0], inference=array([5000.]))
PredictionResult(example=[1013.0], inference=array([5065.]))
PredictionResult(example=[108.0], inference=array([540.]))
PredictionResult(example=[105.0], inference=array([525.]))

Use sklearn RunInference on keyed inputs