Deploy a model on Vertex AI to get inferences

After training a model on a Ray cluster on Vertex AI, deploy the model for online inference requests using the following process:

Export the model from the Ray checkpoint .
Upload the model to Vertex AI Model Registry .
Deploy the model to an endpoint.
Make inference requests.

Before you begin, make sure to read the Ray on Vertex AI overview and set up all the prerequisite tools you need.

The steps in this section assume that you use the Ray on Vertex AI SDK in an interactive Python environment.

Vertex AI online inference and Ray inference compared

Feature	Vertex AI online inference (Recommended)	Ray Inference (Ray Serve)
Scalability	Autoscaling based on traffic (highly scalable even for LLM models)	Highly scalable with distributed backends and custom resource management
Infrastructure Management	Fully managed by Google Cloud, less operational overhead	Requires more manual setup and management on your infrastructure or Kubernetes cluster
API/Supported Features	REST and gRPC APIs, online and batch inferences, explainability features, batching, caching, streaming	REST and gRPC APIs, real-time and batch inference, model composition, batching, caching, streaming
Model Format	Supports various frameworks such as TensorFlow, PyTorch, scikit-learn, XGBoost using prebuilt containers or any custom container	Supports various frameworks such as TensorFlow, PyTorch, scikit-learn.
Ease of Use	Easier to set up and manage, integrated with other Vertex AI features	More flexible and customizable, but requires deeper knowledge of Ray
Cost	Cost depends on machine types, accelerators, and number of replicas	Cost depends on your infrastructure choices
Specialized Features	Model monitoring, A/B testing, traffic splitting, Vertex AI Model Registry and Vertex AI Pipelines integration	Advanced model composition, ensemble models, custom inference logic, integration with Ray ecosystem

Import and initialize Ray on Vertex AI client

If you're already connected to your Ray cluster on Vertex AI, restart your kernel and run the following code. The runtime_env variable is necessary at connection time to run online inference commands.

 import 
  
 ray 
 import 
  
  vertexai 
 
 # The CLUSTER_RESOURCE_NAME is the one returned from vertex_ray.create_ray_cluster. 
 address 
 = 
 'vertex_ray:// 
 {} 
 ' 
 . 
 format 
 ( 
 CLUSTER_RESOURCE_NAME 
 ) 
 # Initialize Vertex AI to retrieve projects for downstream operations. 
  vertexai 
 
 . 
 init 
 ( 
 staging_bucket 
 = 
 BUCKET_URI 
 ) 
 # Shutdown cluster and reconnect with required dependencies in the runtime_env. 
 ray 
 . 
 shutdown 
 ()

Where:

CLUSTER_RESOURCE_NAME : The full resource name for the Ray on Vertex AI cluster that must be unique across your project.
BUCKET_URI is the Cloud Storage bucket to store the model artifacts.

Train and export the model to Vertex AI Model Registry

Export the Vertex AI model from the Ray checkpoint and upload the model to Vertex AI Model Registry.

TensorFlow

 import 
  
 numpy 
  
 as 
  
 np 
 from 
  
 ray.air 
  
 import 
 session 
 , 
 CheckpointConfig 
 , 
 ScalingConfig 
 from 
  
 ray.air.config 
  
 import 
 RunConfig 
 from 
  
 ray.train 
  
 import 
 SyncConfig 
 from 
  
 ray.train.tensorflow 
  
 import 
 TensorflowCheckpoint 
 , 
 TensorflowTrainer 
 from 
  
 ray 
  
 import 
 train 
 import 
  
 tensorflow 
  
 as 
  
 tf 
 from 
  
 vertex_ray.predict 
  
 import 
 tensorflow 
 # Required dependencies at runtime 
 runtime_env 
 = 
 { 
 "pip" 
 : 
 [ 
 "ray==2.47.1" 
 , 
 # pin the Ray version to prevent it from being overwritten 
 "tensorflow" 
 , 
 "IPython" 
 , 
 "numpy" 
 , 
 ], 
 } 
 # Initialize  Ray on Vertex AI client for remote cluster connection 
 ray 
 . 
 init 
 ( 
 address 
 = 
 address 
 , 
 runtime_env 
 = 
 runtime_env 
 ) 
 # Define a TensorFlow model. 
 def 
  
 create_model 
 (): 
 model 
 = 
 tf 
 . 
 keras 
 . 
 Sequential 
 ([ 
 tf 
 . 
 keras 
 . 
 layers 
 . 
 Dense 
 ( 
 1 
 , 
 activation 
 = 
 "linear" 
 , 
 input_shape 
 = 
 ( 
 4 
 ,))]) 
 model 
 . 
 compile 
 ( 
 optimizer 
 = 
 "Adam" 
 , 
 loss 
 = 
 "mean_squared_error" 
 , 
 metrics 
 = 
 [ 
 "mse" 
 ]) 
 return 
 model 
 def 
  
 train_func 
 ( 
 config 
 ): 
 n 
 = 
 100 
 # Create a fake dataset 
 # data   : X - dim = (n, 4) 
 # target : Y - dim = (n, 1) 
 X 
 = 
 np 
 . 
 random 
 . 
 normal 
 ( 
 0 
 , 
 1 
 , 
 size 
 = 
 ( 
 n 
 , 
 4 
 )) 
 Y 
 = 
 np 
 . 
 random 
 . 
 uniform 
 ( 
 0 
 , 
 1 
 , 
 size 
 = 
 ( 
 n 
 , 
 1 
 )) 
 strategy 
 = 
 tf 
 . 
 distribute 
 . 
 experimental 
 . 
 MultiWorkerMirroredStrategy 
 () 
 with 
 strategy 
 . 
 scope 
 (): 
 model 
 = 
 create_model 
 () 
 print 
 ( 
 model 
 ) 
 for 
 epoch 
 in 
 range 
 ( 
 config 
 [ 
 "num_epochs" 
 ]): 
 model 
 . 
 fit 
 ( 
 X 
 , 
 Y 
 , 
 batch_size 
 = 
 20 
 ) 
 tf 
 . 
 saved_model 
 . 
 save 
 ( 
 model 
 , 
 "temp/my_model" 
 ) 
 checkpoint 
 = 
 TensorflowCheckpoint 
 . 
 from_saved_model 
 ( 
 "temp/my_model" 
 ) 
 train 
 . 
 report 
 ({}, 
 checkpoint 
 = 
 checkpoint 
 ) 
 trainer 
 = 
 TensorflowTrainer 
 ( 
 train_func 
 , 
 train_loop_config 
 = 
 { 
 "num_epochs" 
 : 
 5 
 }, 
 scaling_config 
 = 
 ScalingConfig 
 ( 
 num_workers 
 = 
 1 
 ), 
 run_config 
 = 
 RunConfig 
 ( 
 storage_path 
 = 
 f 
 ' 
 { 
 BUCKET_URI 
 } 
 /ray_results/tensorflow' 
 , 
 checkpoint_config 
 = 
 CheckpointConfig 
 ( 
 num_to_keep 
 = 
 1 
 # Keep all checkpoints. 
 ), 
 sync_config 
 = 
 SyncConfig 
 ( 
 sync_artifacts 
 = 
 True 
 , 
 ), 
 ), 
 ) 
 # Train the model. 
 result 
 = 
 trainer 
 . 
 fit 
 () 
 # Register the trained model to Vertex AI Model Registry. 
 vertex_model 
 = 
 tensorflow 
 . 
 register_tensorflow 
 ( 
 result 
 . 
 checkpoint 
 , 
 )

sklearn

 from 
  
 vertex_ray.predict 
  
 import 
 sklearn 
 from 
  
 ray.train.sklearn 
  
 import 
 SklearnCheckpoint 
 vertex_model 
 = 
 sklearn 
 . 
 register_sklearn 
 ( 
 result 
 . 
 checkpoint 
 , 
 )

XGBoost

 from 
  
 vertex_ray.predict 
  
 import 
 xgboost 
 from 
  
 ray.train.xgboost 
  
 import 
 XGBoostTrainer 
 # Initialize  Ray on Vertex AI client for remote cluster connection 
 ray 
 . 
 init 
 ( 
 address 
 = 
 address 
 , 
 runtime_env 
 = 
 runtime_env 
 ) 
 # Define an XGBoost model. 
 train_dataset 
 = 
 ray 
 . 
 data 
 . 
 from_pandas 
 ( 
 pd 
 . 
 DataFrame 
 ([{ 
 "x" 
 : 
 x 
 , 
 "y" 
 : 
 x 
 + 
 1 
 } 
 for 
 x 
 in 
 range 
 ( 
 32 
 )])) 
 run_config 
 = 
 RunConfig 
 ( 
 storage_path 
 = 
 f 
 ' 
 { 
 BUCKET_URI 
 } 
 /ray_results/xgboost' 
 , 
 checkpoint_config 
 = 
 CheckpointConfig 
 ( 
 num_to_keep 
 = 
 1 
 # Keep all checkpoints. 
 ), 
 sync_config 
 = 
 SyncConfig 
 ( 
 sync_artifacts 
 = 
 True 
 ), 
 ) 
 trainer 
 = 
 XGBoostTrainer 
 ( 
 label_column 
 = 
 "y" 
 , 
 params 
 = 
 { 
 "objective" 
 : 
 "reg:squarederror" 
 }, 
 scaling_config 
 = 
 ScalingConfig 
 ( 
 num_workers 
 = 
 3 
 ), 
 datasets 
 = 
 { 
 "train" 
 : 
 train_dataset 
 }, 
 run_config 
 = 
 run_config 
 , 
 ) 
 # Train the model. 
 result 
 = 
 trainer 
 . 
 fit 
 () 
 # Register the trained model to Vertex AI Model Registry. 
 vertex_model 
 = 
 xgboost 
 . 
 register_xgboost 
 ( 
 result 
 . 
 checkpoint 
 , 
 )

PyTorch

Convert the Ray checkpoints to a model.
Build model.mar .
Create LocalModel using model.mar .
Upload to Vertex AI Model Registry.

Deploy the model for online inferences

Deploy the model to the online endpoint. For more information, see Deploy the model to an endpoint .

 DEPLOYED_NAME 
 = 
 model 
 . 
 display_name 
 + 
 "-endpoint" 
 TRAFFIC_SPLIT 
 = 
 { 
 "0" 
 : 
 100 
 } 
 MACHINE_TYPE 
 = 
 "n1-standard-4" 
 endpoint 
 = 
 vertex_model 
 . 
 deploy 
 ( 
 deployed_model_display_name 
 = 
 DEPLOYED_NAME 
 , 
 traffic_split 
 = 
 TRAFFIC_SPLIT 
 , 
 machine_type 
 = 
 MACHINE_TYPE 
 , 
 )