To deploy a model for online inference, you need an endpoint. Endpoints can be divided into the following types:
-
Public endpoints can be accessed over the public internet. They are easier to use, because no private network infrastructure is required. There are two types of public endpoints: dedicated and shared. A dedicated public endpoint is a faster endpoint providing production isolation, support for larger payload sizes, and longer request timeouts than a shared public endpoint . Also, when you send an inference request to a dedicated public endpoint, it is isolated from other users' traffic. For these reasons, dedicated public endpoints are recommended as a best practice.
-
Dedicated private endpoints using Private Service Connect provide a secure connection for private communication between on-premises and Google Cloud. They can be used to control Google API traffic through the use of Private Service Connect APIs. They are recommended as a best practice.
-
Private endpoints also provide a secure connection to your model and can also be used for private communication between on-premises and Google Cloud. They use private services access over a VPC Network Peering connection.
For more information about deploying a model to an endpoint, see Deploy a model to an endpoint .
The following table compares the supported endpoint types for serving Vertex AI online inferences.
Dedicated public endpoint (recommended) | Shared public endpoint | Dedicated private endpoint using Private Service Connect (recommended) | Private endpoint | |
---|---|---|---|---|
Purpose
|
Default networking experience. Enables submitting requests from public internet. | Default networking experience. Enables submitting requests from public internet. | Recommended for production enterprise applications. Improves network latency and security by making sure requests and responses are routed privately. | Recommended for production enterprise applications. Improves network latency and security by making sure requests and responses are routed privately. |
Networking access
|
Public internet using dedicated networking plane | Public internet using shared networking plane | Private networking using Private Service Connect endpoint | Private networking using Private services access (VPC Network Peering) |
VPC Service Controls
|
Not supported. Use a dedicated private endpoint instead. | Supported | Supported | Supported |
Cost
|
Vertex AI Inference | Vertex AI Inference | Vertex AI Inference + Private Service Connect endpoint | Vertex AI Inference + Private services access (see: "Using a Private Service Connect endpoint (forwarding rule) to access a published service") |
Network latency
|
Optimized | Unoptimized | Optimized | Optimized |
Encryption in transit
|
TLS with CA-signed certificate | TLS with CA-signed certificate | Optional TLS with self-signed certificate | None |
Inference timeout
|
Configurable up to 1 hour | 60 seconds | Configurable up to 1 hour | 60 seconds |
Payload size limit
|
10 MB | 1.5 MB | 10 MB | 10 MB |
QPM quota
|
Unlimited | 30,000 | Unlimited | Unlimited |
Protocol support
|
HTTP or gRPC | HTTP | HTTP or gRPC | HTTP |
Streaming support
|
Yes (SSE) | No | Yes (SSE) | No |
Traffic split
|
Yes | Yes | Yes | No |
Request and response logging
|
Yes | Yes | Yes | No |
Access logging
|
Yes | Yes | Yes | No |
Tuned Gemini model deployment
|
No | Yes | No | No |
AutoML models and explainability
|
No | Yes | No | No |
Client libraries supported
|
Vertex AI SDK for Python | Vertex AI client libraries , Vertex AI SDK for Python | Vertex AI SDK for Python | Vertex AI SDK for Python |
What's next
- Learn more about deploying a model to an endpoint .