gcloud beta ai model-garden models deploy

NAME: gcloud beta ai model-garden models deploy - deploy a model in Model Garden to a Vertex AI endpoint
SYNOPSIS: gcloud beta ai model-garden models deploy --model = MODEL [ --accelerator-count = ACCELERATOR_COUNT ] [ --accelerator-type = ACCELERATOR_TYPE ] [ --accept-eula ] [ --asynchronous ] [ --container-args =[ ARG , …]] [ --container-command =[ COMMAND , …]] [ --container-deployment-timeout-seconds = CONTAINER_DEPLOYMENT_TIMEOUT_SECONDS ] [ --container-env-vars =[ KEY = VALUE , …]] [ --container-grpc-ports =[ PORT , …]] [ --container-health-probe-exec =[ HEALTH_PROBE_EXEC , …]] [ --container-health-probe-period-seconds = CONTAINER_HEALTH_PROBE_PERIOD_SECONDS ] [ --container-health-probe-timeout-seconds = CONTAINER_HEALTH_PROBE_TIMEOUT_SECONDS ] [ --container-health-route = CONTAINER_HEALTH_ROUTE ] [ --container-image-uri = CONTAINER_IMAGE_URI ] [ --container-ports =[ PORT , …]] [ --container-predict-route = CONTAINER_PREDICT_ROUTE ] [ --container-shared-memory-size-mb = CONTAINER_SHARED_MEMORY_SIZE_MB ] [ --container-startup-probe-exec =[ STARTUP_PROBE_EXEC , …]] [ --container-startup-probe-period-seconds = CONTAINER_STARTUP_PROBE_PERIOD_SECONDS ] [ --container-startup-probe-timeout-seconds = CONTAINER_STARTUP_PROBE_TIMEOUT_SECONDS ] [ --disable-dedicated-endpoint ] [ --enable-fast-tryout ] [ --endpoint-display-name = ENDPOINT_DISPLAY_NAME ] [ --hugging-face-access-token = HUGGING_FACE_ACCESS_TOKEN ] [ --machine-type = MACHINE_TYPE ] [ --region = REGION ] [ --reservation-affinity =[ key = KEY ], [ reservation-affinity-type = RESERVATION-AFFINITY-TYPE ], [ values = VALUES ]] [ --spot ] [ --system-labels =[ KEY = VALUE , …]] [ --use-dedicated-endpoint ] [ GCLOUD_WIDE_FLAG … ]
EXAMPLES: To deploy a Model Garden model google/gemma2/gemma2-9b under project example in region us-central1 , run:
gcloud ai model-garden models deploy --model = google/gemma2@gemma-2-9b --project = example --region = us-central1

To deploy a Hugging Face model meta-llama/Meta-Llama-3-8B under project example in region us-central1 , run:

gcloud ai model-garden models deploy --model = meta-llama/Meta-Llama-3-8B --hugging-face-access-token ={ hf_token } --project = example --region = us-central1
REQUIRED FLAGS: --model = MODEL

The model to be deployed. If it is a Model Garden model, it should be in the format of {publisher_name}/{model_name}@{model_version_name}, e.g. google/gemma2@gemma-2-2b . If it is a Hugging Face model, it should be in the convention of Hugging Face models, e.g. meta-llama/Meta-Llama-3-8B . If it is a Custom Weights model, it should be in the format of gs://{gcs_bucket_uri} , e.g. gs://-model-garden-public-us/llama3.1/Meta-Llama-3.1-8B-Instruct .
OPTIONAL FLAGS: --accelerator-count = ACCELERATOR_COUNT

The accelerator count to serve the model. Accelerator count should be non-negative.

--accelerator-type = ACCELERATOR_TYPE

The accelerator type to serve the model. It should be a supported accelerator type from the verified deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported accelerator types.

--accept-eula

When set, the user accepts the End User License Agreement (EULA) of the model.

--asynchronous

If set to true, the command will terminate immediately and not keep polling the operation status.

--container-args =[ ARG ,…]

Comma-separated arguments passed to the command run by the container image. If not specified and no --command is provided, the container image's default command is used.

--container-command =[ COMMAND ,…]

Entrypoint for the container image. If not specified, the container image's default entrypoint is run.

--container-deployment-timeout-seconds = CONTAINER_DEPLOYMENT_TIMEOUT_SECONDS

Deployment timeout in seconds.

--container-env-vars =[ KEY = VALUE ,…]

List of key-value pairs to set as environment variables.

--container-grpc-ports =[ PORT ,…]

Container ports to receive grpc requests at. Must be a number between 1 and 65535, inclusive.

--container-health-probe-exec =[ HEALTH_PROBE_EXEC ,…]

Exec specifies the action to take. Used by health probe. An example of this argument would be ["cat", "/tmp/healthy"].

--container-health-probe-period-seconds = CONTAINER_HEALTH_PROBE_PERIOD_SECONDS

How often (in seconds) to perform the health probe. Default to 10 seconds. Minimum value is 1.

--container-health-probe-timeout-seconds = CONTAINER_HEALTH_PROBE_TIMEOUT_SECONDS

Number of seconds after which the health probe times out. Defaults to 1 second. Minimum value is 1.

--container-health-route = CONTAINER_HEALTH_ROUTE

HTTP path to send health checks to inside the container.

--container-image-uri = CONTAINER_IMAGE_URI

URI of the Model serving container file in the Container Registry (e.g. gcr.io/myproject/server:latest).

--container-ports =[ PORT ,…]

Container ports to receive http requests at. Must be a number between 1 and 65535, inclusive.

--container-predict-route = CONTAINER_PREDICT_ROUTE

HTTP path to send prediction requests to inside the container.

--container-shared-memory-size-mb = CONTAINER_SHARED_MEMORY_SIZE_MB

The amount of the VM memory to reserve as the shared memory for the model in megabytes.

--container-startup-probe-exec =[ STARTUP_PROBE_EXEC ,…]

Exec specifies the action to take. Used by startup probe. An example of this argument would be ["cat", "/tmp/healthy"].

--container-startup-probe-period-seconds = CONTAINER_STARTUP_PROBE_PERIOD_SECONDS

How often (in seconds) to perform the startup probe. Default to 10 seconds. Minimum value is 1.

--container-startup-probe-timeout-seconds = CONTAINER_STARTUP_PROBE_TIMEOUT_SECONDS

Number of seconds after which the startup probe times out. Defaults to 1 second. Minimum value is 1.

--disable-dedicated-endpoint

If true, the dedicated endpoint will be disabled and the deployed model will be exposed through the shared DNS.

--enable-fast-tryout

If True, model will be deployed using faster deployment path. Useful for quick experiments. Not for production workloads. Only available for most popular models with certain machine types.

--endpoint-display-name = ENDPOINT_DISPLAY_NAME

Display name of the endpoint with the deployed model.

--hugging-face-access-token = HUGGING_FACE_ACCESS_TOKEN

The access token from Hugging Face needed to read the model artifacts of gated models. It is only needed when the Hugging Face model to deploy is gated.

--machine-type = MACHINE_TYPE

The machine type to deploy the model to. It should be a supported machine type from the deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported machine types.

Region resource - Cloud region to deploy the model. This represents a Cloud resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways.
To set the project attribute:

provide the argument --region on the command line with a fully specified name;

set the property ai/region with a fully specified name;

choose one from the prompted list of available regions with a fully specified name;

provide the argument --project on the command line;

set the property core/project .

--region = REGION

ID of the region or fully qualified identifier for the region.
To set the region attribute:

provide the argument --region on the command line;

set the property ai/region ;

choose one from the prompted list of available regions.

--reservation-affinity =[ key = KEY ],[ reservation-affinity-type = RESERVATION-AFFINITY-TYPE ],[ values = VALUES ]

A ReservationAffinity can be used to configure a Vertex AI resource (e.g., a DeployedModel) to draw its Compute Engine resources from a Shared Reservation, or exclusively from on-demand capacity.

--spot

If true, schedule the deployment workload on Spot VM.

--system-labels =[ KEY = VALUE ,…]

System labels for Model Garden deployments.

--use-dedicated-endpoint

If true, the endpoint will be exposed through a dedicated DNS. Your request to the dedicated DNS will be isolated from other users' traffic and will have better performance and reliability.
GCLOUD WIDE FLAGS: These flags are available to all commands: --access-token-file , --account , --billing-project , --configuration , --flags-file , --flatten , --format , --help , --impersonate-service-account , --log-http , --project , --quiet , --trace-token , --user-output-enabled , --verbosity .
Run $ gcloud help for details.
NOTES: This command is currently in beta and might change without notice. These variants are also available:
gcloud ai model-garden models deploy

gcloud alpha ai model-garden models deploy

gcloud beta ai model-garden models deploy Stay organized with collections Save and categorize content based on your preferences.

gcloud beta ai model-garden models deploy