Resource: Endpoint
Models are deployed into it, and afterwards Endpoint is called to obtain predictions and explanations.
name 
 
  string 
 
Output only. The resource name of the Endpoint.
displayName 
 
  string 
 
Required. The display name of the Endpoint. The name can be up to 128 characters long and can consist of any UTF-8 characters.
description 
 
  string 
 
The description of the Endpoint.
deployedModels[] 
 
  object (  DeployedModel 
 
) 
 
Output only. The models deployed in this Endpoint. To add or remove DeployedModels use  EndpointService.DeployModel 
 
and  EndpointService.UndeployModel 
 
respectively.
trafficSplit 
 
  map (key: string, value: integer) 
 
A map from a DeployedModel's id to the percentage of this Endpoint's traffic that should be forwarded to that DeployedModel.
If a DeployedModel's id is not listed in this map, then it receives no traffic.
The traffic percentage values must add up to 100, or map must be empty if the Endpoint is to not accept any traffic at a moment.
etag 
 
  string 
 
Used to perform consistent read-modify-write updates. If not set, a blind "overwrite" update happens.
labels 
 
  map (key: string, value: string) 
 
The labels with user-defined metadata to organize your endpoints.
label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed.
See https://goo.gl/xmQnxf for more information and examples of labels.
createTime 
 
  string (  Timestamp 
 
format) 
 
Output only. timestamp when this Endpoint was created.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z" 
, "2014-10-02T15:01:23.045123456Z" 
or "2014-10-02T15:01:23+05:30" 
.
updateTime 
 
  string (  Timestamp 
 
format) 
 
Output only. timestamp when this Endpoint was last updated.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z" 
, "2014-10-02T15:01:23.045123456Z" 
or "2014-10-02T15:01:23+05:30" 
.
encryptionSpec 
 
  object (  EncryptionSpec 
 
) 
 
Customer-managed encryption key spec for an Endpoint. If set, this Endpoint and all sub-resources of this Endpoint will be secured by this key.
network 
 
  string 
 
Optional. The full name of the Google Compute Engine network to which the Endpoint should be peered.
Private services access must already be configured for the network. If left unspecified, the Endpoint is not peered with any network.
Only one of the fields,  network 
 
or  enablePrivateServiceConnect 
 
, can be set.
 Format 
: projects/{project}/global/networks/{network} 
. Where {project} 
is a project number, as in 12345 
, and {network} 
is network name.
enablePrivateServiceConnect
 (deprecated) 
 
 
  boolean 
 
Deprecated: If true, expose the Endpoint via private service connect.
Only one of the fields,  network 
 
or  enablePrivateServiceConnect 
 
, can be set.
privateServiceConnectConfig 
 
  object (  PrivateServiceConnectConfig 
 
) 
 
Optional. Configuration for private service connect.
  network 
 
and  privateServiceConnectConfig 
 
are mutually exclusive.
modelDeploymentMonitoringJob 
 
  string 
 
Output only. Resource name of the Model Monitoring job associated with this Endpoint if monitoring is enabled by  JobService.CreateModelDeploymentMonitoringJob 
 
. Format: projects/{project}/locations/{location}/modelDeploymentMonitoringJobs/{modelDeploymentMonitoringJob} 
predictRequestResponseLoggingConfig 
 
  object (  PredictRequestResponseLoggingConfig 
 
) 
 
Configures the request-response logging for online prediction.
dedicatedEndpointEnabled 
 
  boolean 
 
If true, the endpoint will be exposed through a dedicated DNS [Endpoint.dedicated_endpoint_dns]. Your request to the dedicated DNS will be isolated from other users' traffic and will have better performance and reliability. Note: Once you enabled dedicated endpoint, you won't be able to send request to the shared DNS {region}-aiplatform.googleapis.com. The limitation will be removed soon.
dedicatedEndpointDns 
 
  string 
 
Output only. DNS of the dedicated endpoint. Will only be populated if dedicatedEndpointEnabled is true. Depending on the features enabled, uid might be a random number or a string. For example, if fast_tryout is enabled, uid will be fasttryout. Format: https://{endpointId}.{region}-{uid}.prediction.vertexai.goog 
.
clientConnectionConfig 
 
  object (  ClientConnectionConfig 
 
) 
 
Configurations that are applied to the endpoint for online prediction.
satisfiesPzs 
 
  boolean 
 
Output only. reserved for future use.
satisfiesPzi 
 
  boolean 
 
Output only. reserved for future use.
genAiAdvancedFeaturesConfig 
 
  object (  GenAiAdvancedFeaturesConfig 
 
) 
 
Optional. Configuration for GenAiAdvancedFeatures. If the endpoint is serving GenAI models, advanced features like native RAG integration can be configured. Currently, only Model Garden models are supported.
| JSON representation | 
|---|
| { "name" : string , "displayName" : string , "description" : string , "deployedModels" : [ { object ( | 
DeployedModel
A deployment of a Model. endpoints contain one or more DeployedModels.
id 
 
  string 
 
Immutable. The id of the DeployedModel. If not provided upon deployment, Vertex AI will generate a value for this id.
This value should be 1-10 characters, and valid characters are /[0-9]/ 
.
model 
 
  string 
 
The resource name of the Model that this is the deployment of. Note that the Model may be in a different location than the DeployedModel's Endpoint.
The resource name may contain version id or version alias to specify the version.  Example: projects/{project}/locations/{location}/models/{model}@2 
or projects/{project}/locations/{location}/models/{model}@golden 
if no version is specified, the default version will be deployed.
gdcConnectedModel 
 
  string 
 
GDC pretrained / Gemini model name. The model name is a plain model name, e.g. gemini-1.5-flash-002.
modelVersionId 
 
  string 
 
Output only. The version id of the model that is deployed.
displayName 
 
  string 
 
The display name of the DeployedModel. If not provided upon creation, the Model's displayName is used.
createTime 
 
  string (  Timestamp 
 
format) 
 
Output only. timestamp when the DeployedModel was created.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z" 
, "2014-10-02T15:01:23.045123456Z" 
or "2014-10-02T15:01:23+05:30" 
.
explanationSpec 
 
  object (  ExplanationSpec 
 
) 
 
Explanation configuration for this DeployedModel.
When deploying a Model using  EndpointService.DeployModel 
 
, this value overrides the value of  Model.explanation_spec 
 
. All fields of  explanationSpec 
 
are optional in the request. If a field of  explanationSpec 
 
is not populated, the value of the same field of  Model.explanation_spec 
 
is inherited. If the corresponding  Model.explanation_spec 
 
is not populated, all fields of the  explanationSpec 
 
will be used for the explanation configuration.
disableExplanations 
 
  boolean 
 
If true, deploy the model without explainable feature, regardless the existence of  Model.explanation_spec 
 
or  explanationSpec 
 
.
serviceAccount 
 
  string 
 
The service account that the DeployedModel's container runs as. Specify the email address of the service account. If this service account is not specified, the container runs as a service account that doesn't have access to the resource project.
Users deploying the Model must have the iam.serviceAccounts.actAs 
permission on this service account.
disableContainerLogging 
 
  boolean 
 
For custom-trained Models and AutoML Tabular Models, the container of the DeployedModel instances will send stderr 
and stdout 
streams to Cloud Logging by default. Please note that the logs incur cost, which are subject to Cloud Logging pricing 
.
user can disable container logging by setting this flag to true.
enableAccessLogging 
 
  boolean 
 
If true, online prediction access logs are sent to Cloud Logging. These logs are like standard server access logs, containing information like timestamp and latency for each prediction request.
Note that logs may incur a cost, especially if your project receives prediction requests at a high queries per second rate (QPS). Estimate your costs before enabling this option.
privateEndpoints 
 
  object (  PrivateEndpoints 
 
) 
 
Output only. Provide paths for users to send predict/explain/health requests directly to the deployed model services running on Cloud via private services access. This field is populated if  network 
 
is configured.
fasterDeploymentConfig 
 
  object (  FasterDeploymentConfig 
 
) 
 
Configuration for faster model deployment.
status 
 
  object (  Status 
 
) 
 
Output only. Runtime status of the deployed model.
systemLabels 
 
  map (key: string, value: string) 
 
System labels to apply to Model Garden deployments. System labels are managed by Google for internal use only.
checkpointId 
 
  string 
 
The checkpoint id of the model.
speculativeDecodingSpec 
 
  object (  SpeculativeDecodingSpec 
 
) 
 
Optional. Spec for configuring speculative decoding.
prediction_resources 
 
  Union type 
 
  Model.supported_deployment_resources_types 
 
. Required except for Large Model Deploy use cases. prediction_resources 
can be only one of the following:dedicatedResources 
 
  object (  DedicatedResources 
 
) 
 
A description of resources that are dedicated to the DeployedModel, and that need a higher degree of manual configuration.
automaticResources 
 
  object (  AutomaticResources 
 
) 
 
A description of resources that to large degree are decided by Vertex AI, and require only a modest additional configuration.
| JSON representation | 
|---|
| { "id" : string , "model" : string , "gdcConnectedModel" : string , "modelVersionId" : string , "displayName" : string , "createTime" : string , "explanationSpec" : { object ( | 
PrivateEndpoints
PrivateEndpoints proto is used to provide paths for users to send requests privately. To send request via private service access, use predictHttpUri, explainHttpUri or healthHttpUri. To send request via private service connect, use serviceAttachment.
predictHttpUri 
 
  string 
 
Output only. Http(s) path to send prediction requests.
explainHttpUri 
 
  string 
 
Output only. Http(s) path to send explain requests.
healthHttpUri 
 
  string 
 
Output only. Http(s) path to send health check requests.
| JSON representation | 
|---|
| { "predictHttpUri" : string , "explainHttpUri" : string , "healthHttpUri" : string , "serviceAttachment" : string } | 
FasterDeploymentConfig
Configuration for faster model deployment.
fastTryoutEnabled 
 
  boolean 
 
If true, enable fast tryout feature for this deployed model.
| JSON representation | 
|---|
| { "fastTryoutEnabled" : boolean } | 
Status
Runtime status of the deployed model.
lastUpdateTime 
 
  string (  Timestamp 
 
format) 
 
Output only. The time at which the status was last updated.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z" 
, "2014-10-02T15:01:23.045123456Z" 
or "2014-10-02T15:01:23+05:30" 
.
availableReplicaCount 
 
  integer 
 
Output only. The number of available replicas of the deployed model.
| JSON representation | 
|---|
| { "message" : string , "lastUpdateTime" : string , "availableReplicaCount" : integer } | 
SpeculativeDecodingSpec
Configuration for Speculative Decoding.
speculativeTokenCount 
 
  integer 
 
The number of speculative tokens to generate at each step.
speculation 
 
  Union type 
 
 speculation 
can be only one of the following:draftModelSpeculation 
 
  object (  DraftModelSpeculation 
 
) 
 
draft model speculation.
ngramSpeculation 
 
  object (  NgramSpeculation 
 
) 
 
N-Gram speculation.
| JSON representation | 
|---|
| { "speculativeTokenCount" : integer , // speculation "draftModelSpeculation" : { object ( | 
DraftModelSpeculation
Draft model speculation works by using the smaller model to generate candidate tokens for speculative decoding.
draftModel 
 
  string 
 
Required. The resource name of the draft model.
| JSON representation | 
|---|
| { "draftModel" : string } | 
NgramSpeculation
N-Gram speculation works by trying to find matching tokens in the previous prompt sequence and use those as speculation for generating new tokens.
ngramSize 
 
  integer 
 
The number of last N input tokens used as ngram to search/match against the previous prompt sequence. This is equal to the N in N-Gram. The default value is 3 if not specified.
| JSON representation | 
|---|
| { "ngramSize" : integer } | 
PredictRequestResponseLoggingConfig
Configuration for logging request-response to a BigQuery table.
enabled 
 
  boolean 
 
If logging is enabled or not.
samplingRate 
 
  number 
 
Percentage of requests to be logged, expressed as a fraction in range(0,1].
bigqueryDestination 
 
  object (  BigQueryDestination 
 
) 
 
BigQuery table for logging. If only given a project, a new dataset will be created with name logging_<endpoint-display-name>_<endpoint-id> 
where request_response_logging 
| JSON representation | 
|---|
|  { 
 "enabled" 
 : 
 boolean 
 , 
 "samplingRate" 
 : 
 number 
 , 
 "bigqueryDestination" 
 : 
 { 
 object (  | 
ClientConnectionConfig
Configurations (e.g. inference timeout) that are applied on your endpoints.
inferenceTimeout 
 
  string (  Duration 
 
format) 
 
Customizable online prediction request timeout.
A duration in seconds with up to nine fractional digits, ending with ' s 
'. Example: "3.5s" 
.
| JSON representation | 
|---|
| { "inferenceTimeout" : string } | 
GenAiAdvancedFeaturesConfig
RagConfig
Configuration for Retrieval Augmented Generation feature.
enableRag 
 
  boolean 
 
If true, enable Retrieval Augmented Generation in ChatCompletion request. Once enabled, the endpoint will be identified as GenAI endpoint and Arthedain router will be used.
| JSON representation | 
|---|
| { "enableRag" : boolean } | 
| Methods | |
|---|---|
|   | Creates an Endpoint. | 
|   | Deletes an Endpoint. | 
|   | Deploys a Model into this Endpoint, creating a DeployedModel within it. | 
|   | Perform an unary online prediction request to a gRPC model server for Vertex first-party products and frameworks. | 
|   | Perform an unary online prediction request to a gRPC model server for custom containers. | 
|   | Perform an online explanation. | 
|   | Gets an Endpoint. | 
|   | Lists Endpoints in a Location. | 
|   | Updates an existing deployed model. | 
|   | Updates an Endpoint. | 
|   | Perform an online prediction. | 
|   | |
|   | Perform an online prediction with an arbitrary HTTP payload. | 
|   | Perform a server-side streaming online prediction request for Vertex LLM streaming. | 
|   | Perform a streaming online prediction with an arbitrary HTTP payload. | 
|   | Undeploys a Model from an Endpoint, removing a DeployedModel from it, and freeing all resources it's using. | 
|   | Updates an Endpoint with a long running operation. | 

