Configure an extension to call a Google service

Service Extensions enables supported Application Load Balancers to configure extensions by using callouts to Google services. This page shows you how to configure such extensions.

For an overview, see Integration with Google services .

Configure a traffic extension to call the Model Armor service

You can configure a traffic extension to call Model Armor to uniformly enforce security policies on generative AI inference traffic on application load balancers, including GKE Inference Gateway .

A traffic extension groups related extension services into one or more chains. You can configure both plugins and callouts in the same extension chain. Each extension chain selects the traffic to act on by using Common Expression Language (CEL) match conditions. The load balancer evaluates a request against each chain's match condition in a sequential manner. When a request matches the conditions defined by a chain, all extensions in the chain act on the request. Only one chain matches a given request.

The extension references the load balancer forwarding rule to attach to. After you configure the resource, the load balancer starts sending matching requests to the Model Armor service.

Before you begin

Identify a suitable project where you have either a project owner or editor role or the following Compute Engine IAM roles :
- To create instances: Compute Instance Admin (v1) ( roles/compute.instanceAdmin.v1 )
- To create Cloud Load Balancing components: Compute Network Admin ( roles/compute.networkAdmin )
Enable the required APIs.
Console
1. In the Google Cloud console, go to the Enable access to APIspage.
  
  Go to Enable access to APIs
2. Follow the instructions to enable the required APIs, which include the Compute Engine API, the Model Armor API, and the Network Services API.
gcloud

Use the gcloud services enable command :
```
gcloud services enable compute.googleapis.com modelarmor.googleapis.com networkservices.googleapis.com
```
Create the required Model Armor templates .
If you don't already have a Hugging Face account, create it.

Note: You must sign the consent agreement to use Gemma in the Hugging Face repository .
Enable access to the Gemma model through a Hugging Face access token. If you don't have such a token, create one . Specify the role as at least Read.

Make a note of the access token for use later.

Set up your GKE infrastructure

Setting up your GKE infrastructure entails the deployment of an LLM inference endpoint.

Subject to a few limitations , the following OpenAI API endpoints are supported: Assistants , Chat Completions , Completions (legacy) , Embeddings , Messages , and Threads .

In this example, we use an OpenAI-compatible vLLM server to serve the google/gemma-2b model and expose a service within a GKE cluster through a GKE Gateway.

If your project doesn't have the default VPC network, create it:

gcloud compute networks create default \
    --subnet-mode=auto

Create a proxy-only subnet named us-central1-subnet :

gcloud compute networks subnets create us-central1-subnet \
    --purpose=REGIONAL_MANAGED_PROXY \
    --role=ACTIVE \
    --region=us-central1 \
    --network=default \
    --range=10.0.0.0/23 \
    --project= PROJECT_ID

Replace PROJECT_ID with the project ID .

Configure a cluster.

Create a GKE cluster named vllm-cluster-gc with the following configuration:

gcloud container clusters create vllm-cluster-gc \
    --gateway-api=standard \
    --region=us-central1-a \
    --release-channel=rapid \
    --num-nodes=1 \
    --project= PROJECT_ID

To verify that the workload runs on machines that are equipped with the necessary GPUs, create a new GPU node pool :

gcloud container node-pools create gpu-pool \
    --cluster=vllm-cluster-gc \
    --region=us-central1-a \
    --machine-type=g2-standard-24 \
    --accelerator=type=nvidia-l4,count=2 \
    --num-nodes=2 \
    --enable-autoscaling \
    --min-nodes=1 \
    --max-nodes=3 \
    --scopes=cloud-platform \
    --project= PROJECT_ID

Then, delete the default node pool that was automatically created for the cluster:

gcloud container node-pools delete default-pool \
    --cluster=vllm-cluster-gc \
    --region=us-central1-a

Create and deploy GKE API resources . For detailed information, see Deploy a regional internal GKE Gateway .

Create a Kubernetes Secret that contains the Hugging Face access token that you created earlier:

 kubectl  
create  
secret  
generic  
hf-secret  
--namespace  
completions-gemma  
 \ 
  
--from-literal = 
 hf_api_token 
 = 
 ${ 
 HF_TOKEN 
 } 
  
 \ 
  
--dry-run = 
client  
-o  
yaml  
 | 
  
kubectl  
apply  
-f  
-

Replace HF_TOKEN with your Hugging Face access token.

Create a Kubernetes namespace:

 kubectl  
create  
namespace  
completions-gemma

In a manifest file named vllm-cluster.yaml , save the following configuration for these resources: Deployment , Service , Gateway , HTTPRoute , and HealthCheckPolicy .

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Deployment 
 metadata 
 : 
  
 name 
 : 
  
 completions-gemma-deployment 
  
 namespace 
 : 
  
 completions-gemma 
 spec 
 : 
  
 replicas 
 : 
  
 1 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 completions-gemma 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 app 
 : 
  
 completions-gemma 
  
 ai.gke.io/model 
 : 
  
 gemma-2b 
  
 ai.gke.io/inference-server 
 : 
  
 vllm 
  
 examples.ai.gke.io/source 
 : 
  
 user-guide 
  
 spec 
 : 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-accelerator 
 : 
  
 nvidia-l4 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 vllm-server 
  
 image 
 : 
  
 vllm/vllm-openai:v0.5.4 
  
 imagePullPolicy 
 : 
  
 IfNotPresent 
  
 resources 
 : 
  
 requests 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 "14Gi" 
  
 ephemeral-storage 
 : 
  
 "20Gi" 
  
 nvidia.com/gpu 
 : 
  
 1 
  
 limits 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 "14Gi" 
  
 ephemeral-storage 
 : 
  
 "20Gi" 
  
 nvidia.com/gpu 
 : 
  
 1 
  
 command 
 : 
  
 [ 
 "python3" 
 , 
  
 "-m" 
 , 
  
 "vllm.entrypoints.openai.api_server" 
 ] 
  
 args 
 : 
  
 - 
  
 --model=$(MODEL_ID) 
  
 - 
  
 --tensor-parallel-size=1 
  
 - 
  
 --gpu-memory-utilization=0.95 
  
 - 
  
 --disable-log-requests 
  
 - 
  
 --trust-remote-code 
  
 - 
  
 --port=8000 
  
 ports 
 : 
  
 - 
  
 containerPort 
 : 
  
 8000 
  
 name 
 : 
  
 http 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 MODEL_ID 
  
 value 
 : 
  
 google/gemma-2b 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 volumeMounts 
 : 
  
 - 
  
 mountPath 
 : 
  
 /dev/shm 
  
 name 
 : 
  
 dshm 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
 --- 
 apiVersion 
 : 
  
 v1 
 kind 
 : 
  
 Service 
 metadata 
 : 
  
 name 
 : 
  
 completions-gemma-service 
  
 namespace 
 : 
  
 completions-gemma 
  
 annotations 
 : 
  
 cloud.google.com/neg 
 : 
  
 '{"ingress":true}' 
  
 labels 
 : 
  
 app 
 : 
  
 completions-gemma 
 spec 
 : 
  
 type 
 : 
  
 NodePort 
  
 ports 
 : 
  
 - 
  
 port 
 : 
  
 8000 
  
 targetPort 
 : 
  
 http 
  
 name 
 : 
  
 http 
  
 selector 
 : 
  
 app 
 : 
  
 completions-gemma 
 --- 
 apiVersion 
 : 
  
 gateway.networking.k8s.io/v1beta1 
 kind 
 : 
  
 Gateway 
 metadata 
 : 
  
 name 
 : 
  
 completions-gemma-gateway 
  
 namespace 
 : 
  
 completions-gemma 
 spec 
 : 
  
 gatewayClassName 
 : 
  
 gke-l7-rilb 
  
 # Equivalent to gce-internal 
  
 listeners 
 : 
  
 - 
  
 name 
 : 
  
 http 
  
 protocol 
 : 
  
 HTTP 
  
 port 
 : 
  
 80 
 --- 
 apiVersion 
 : 
  
 gateway.networking.k8s.io/v1beta1 
 kind 
 : 
  
 HTTPRoute 
 metadata 
 : 
  
 name 
 : 
  
 completions-gemma-route 
  
 namespace 
 : 
  
 completions-gemma 
 spec 
 : 
  
 parentRefs 
 : 
  
 - 
  
 kind 
 : 
  
 Gateway 
  
 name 
 : 
  
 completions-gemma-gateway 
  
 namespace 
 : 
  
 completions-gemma 
  
 rules 
 : 
  
 - 
  
 matches 
 : 
  
 - 
  
 path 
 : 
  
 type 
 : 
  
 PathPrefix 
  
 value 
 : 
  
 /v1/chat/completions 
  
 backendRefs 
 : 
  
 - 
  
 name 
 : 
  
 completions-gemma-service 
  
 namespace 
 : 
  
 completions-gemma 
  
 port 
 : 
  
 8000 
 --- 
 apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 HealthCheckPolicy 
 metadata 
 : 
  
 name 
 : 
  
 gke-custom-hcp 
  
 namespace 
 : 
  
 completions-gemma 
 spec 
 : 
  
 default 
 : 
  
 checkIntervalSec 
 : 
  
 5 
  
 config 
 : 
  
 httpHealthCheck 
 : 
  
 port 
 : 
  
 8000 
  
 portName 
 : 
  
 health 
  
 requestPath 
 : 
  
 /health 
  
 type 
 : 
  
 TCP 
  
 healthyThreshold 
 : 
  
 5 
  
 logConfig 
 : 
  
 enabled 
 : 
  
 false 
  
 timeoutSec 
 : 
  
 2 
  
 unhealthyThreshold 
 : 
  
 2 
  
 targetRef 
 : 
  
 group 
 : 
  
 "" 
  
 kind 
 : 
  
 Service 
  
 name 
 : 
  
 completions-gemma-service 
  
 namespace 
 : 
  
 completions-gemma

Apply the manifest:

 kubectl  
apply  
-f  
vllm-cluster.yaml

Set up a way to send test requests to your service—for example, by running curl.

Limitations when configuring an OpenAI API endpoint

When configuring an OpenAI API endpoint for your GKE infrastructure, consider the following limitations pertaining to sanitizing prompts and responses :

Streaming API responses aren't supported for any API. If you use a mix of streaming and non-streaming APIs, then, when you configure the traffic extension , set failOpen to true . Model Armor sanitizes the non-streaming responses and ignores the streaming responses.
When sanitizing prompts and responses, only the following operations are supported:
- Assistants API: Create , List , and Retrieve
- Chat Completions API: Create
- Completions (legacy) API: Create
- Embeddings API: Create
- Messages API: Create , List , and Retrieve
- Threads API: Create and Retrieve
For API calls that return multiple choices in the response (such as POST https://api.openai.com/v1/chat/completions ), only the first item in the list of choices is sanitized.

Configure the traffic extension

Check the behavior before the extension is configured.
1. Establish an SSH connection to the client VM.
  Console
  1. In the Google Cloud console, go to the VM instancespage.
    Go to VM instances
  2. In the list of virtual machine instances, click SSHin the row of the instance that you want to connect to.
  Note: When you connect to VMs by using the Google Cloud console, Compute Engine creates an ephemeral SSH key for you. For more information about SSH keys, see About SSH connections .
  gcloud
  
  Use the gcloud compute ssh command .
  gcloud compute ssh CLIENT_VM \ --zone= ZONE
  Replace the following:
  - CLIENT_VM : the name of the client VM
  - ZONE : the zone of the VM
2. Get the load balancer's exposed IP address:
```
  IP 
 = 
 $( 
kubectl  
get  
gateway/completions-gemma-gateway  
-n  
completions-gemma  
-o  
 jsonpath 
 = 
 '{.status.addresses[0].value}' 
 ) 
 
```
3. Send the following curl command to the load balancer by using the load balancer's exposed IP address, for example test.example.com :
```
 curl  
-v  
http:// ${ 
 IP 
 } 
/v1/chat/completions
-H  
 "Content-Type: application/json" 
-d  
 '{"model": "google/gemma-2b", 
 "prompt": "Can you remember my ITIN: 123-45-6789", 
 "max_tokens": 250, 
 "temperature": 0.1}' 
 
```
  The request generates an HTTP 200 OK status code although sensitive data has been shared.
Configure a traffic extension for Model Armor.
Console
1. In the Google Cloud console, go to the Service Extensionspage.
  
  Go to Service Extensions
2. Click Create extension. A wizard opens to guide you through some initial steps.
3. For the product, select Load Balancing. Then, click Continue. A list of supported Application Load Balancers appears.
4. Select a load balancer type.
5. Specify the region as us-central1 . Click Continue.
6. For the extension type, select Traffic extensions, and then click Continue.
7. To open the Create extensionform, click Continue. In the Create extensionform, notice that the preceding selections aren't editable.
8. In the Basicssection, do the following:
  1. Specify a unique name for the extension.
    
    The name must start with a lowercase letter followed by up to 62 lowercase letters, numbers, or hyphens and must not end with a hyphen.
  2. Optional: Enter a brief description about the extension by using up to 1,024 characters.
  3. Optional: In the Labelssection, click Add label. Then, in the row that appears, do the following:
    
    For Key, enter a key name.
    
    For Value, enter a value for the key.
    
    To add more key-value pairs, click Add label. You can add a maximum of 64 key-value pairs.
    
    For more information about labels, see Create and update labels for projects .
9. For Forwarding rules, select one or more forwarding rules to associate with the extension—for example, l7-ilb-forwarding-rule . Forwarding rules that are already associated with another extension can't be selected and appear unavailable.
10. For Extension chains, add one or more extension chains to execute for a matching request.
  
  To add an extension chain, do the following, and then click Done:
  - For New extension chain name, specify a unique name.
    
    The name must conform with RFC-1034, use only lowercase letters, numbers, and hyphens, and have a maximum length of 63 characters. Additionally, the first character must be a letter and the last character must be a letter or a number.
  - To match requests for which the extension chain is executed, for Match condition, specify a Common Expression Language (CEL) expression—for example, request.path == "/v1/completions" .
    
    For more information about CEL expressions, click Get syntax helpor see CEL matcher language reference .
  - Add one or more extensions to execute for a matching request.
    
    For each extension, under Extensions, do the following, and then click Done:
    
    For Extension name, specify a unique name.
    
    The name must conform with RFC-1034, use only lowercase letters, numbers, and hyphens, and have a maximum length of 63 characters. Additionally, the first character must be a letter and the last character must be a letter or a number.
    
    For Programmability type, select Google servicesand then select a Model Armor service endpoint—for example modelarmor.us-central1.rep.googleapis.com .
    
    For Timeout, specify a value between 10 and 1000 milliseconds after which a message on the stream times out. Consider that Model Armor has a latency of approximately 250 milliseconds.
    
    For Events, select all HTTP event types.
    
    For Forward headers, click Add header, and then add HTTP headers to forward to the extension (from the client or the backend). If a header isn't specified, all headers are sent.
    
    For Fail open, select Enabled. If the call to the extension fails or times out, request or response processing continues without error. Any subsequent extensions in the extension chain are also run.
    
    By default, the Fail openfield isn't selected. In this case, if response headers haven't been delivered to the downstream client, a generic 500 status code is returned to the client. If response headers have been delivered, the HTTP stream to the downstream client is reset.
    
    For Metadata, click Add metadatato specify the Model Armor templates to be used to screen prompts and responses corresponding to specific models.
    
    For Key, specify model_armor_settings . For Value, specify the templates as a JSON string, such as the following:
    
    [{ "model" : " MODEL_NAME " , "model_response_template_id" : "projects/ PROJECT_ID /locations/ LOCATION /templates/ RESPONSE_TEMPLATE " , "user_prompt_template_id" : "projects/ PROJECT_ID /locations/ LOCATION /templates/ PROMPT_TEMPLATE " }]
    
    Replace the following:
    
    MODEL_NAME : the name of the model as configured in the manifest file —for example, google/gemma-2b
    
    PROJECT_ID : the project ID
    
    LOCATION : the location of the Model Armor template—for example, us-central1
    
    RESPONSE_TEMPLATE : the response template for the model to use
    
    PROMPT_TEMPLATE : the prompt template for the model to use
    
    A default template can additionally be specified for use when a request doesn't exactly match a model. To configure a default template, specify MODEL_NAME as default .
    
    If you don't want to screen prompt or response traffic, create and include an empty filter template.
    
    The total size of metadata must be less than 1 KiB. The total number of keys in the metadata must be less than 20. The length of each key must be less than 64 characters. The length of each value must be less than 1,024 characters. All values must be strings.
11. Click Create extension.
gcloud
1. Define the callout in a YAML file and associate it with the forwarding rule. Use the sample values provided.
  cat >traffic_callout_service.yaml <<EOF name : traffic-ext forwardingRules : - https://www.googleapis.com/compute/v1/projects/ PROJECT_ID /regions/us-central1/forwardingRules/l7-ilb-forwarding-rule loadBalancingScheme : INTERNAL_MANAGED extensionChains : - name : "chain1-model-armor" matchCondition : celExpression : 'request.path == "/v1/completions"' extensions : - name : extension-chain-1 metadata : model_armor_settings : '[ { "model": " MODEL_NAME ", "model_response_template_id": "projects/ PROJECT_ID /locations/ LOCATION /templates/ RESPONSE_TEMPLATE ", "user_prompt_template_id": "projects/ PROJECT_ID /locations/ LOCATION /templates/ PROMPT_TEMPLATE " } ]' service : modelarmor.us-central1.rep.googleapis.com failOpen : true supportedEvents : - REQUEST_HEADERS - REQUEST_BODY - RESPONSE_BODY - RESPONSE_HEADERS - REQUEST_TRAILERS - RESPONSE_TRAILERS timeout : 1s
  Replace the following:
  - MODEL_NAME : the name of the model as configured in the manifest file —for example, google/gemma-2b
  - PROJECT_ID : the project ID
  - LOCATION : the location of the Model Armor template—for example, us-central1
  - RESPONSE_TEMPLATE : the response template for the model to use
  - PROMPT_TEMPLATE : the prompt template for the model to use
  In the metadata field, specify the Model Armor settings and templates to be used while screening prompts and responses corresponding to specific models.
  
  A default template can additionally be specified for use when a request doesn't exactly match a model. To configure a default template, specify MODEL_NAME as default .
  
  If you don't want to screen prompt or response traffic, create and include an empty filter template.
  
  The total size of metadata must be less than 1 KiB. The total number of keys in the metadata must be less than 16. The length of each key must be less than 64 characters. The length of each value must be less than 1,024 characters. All values must be strings.
2. Import the traffic extension. Use the gcloud service-extensions lb-traffic-extensions import command with the following sample values.
  gcloud service-extensions lb-traffic-extensions import traffic-ext \ --source=traffic_callout_service.yaml \ --location=us-central1

Grant the required roles to the Service Extensions service account. Use the gcloud projects add-iam-policy-binding command :

gcloud projects add-iam-policy-binding PROJECT_ID 
\
    --member=serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dep.iam.gserviceaccount.com \
    --role=roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID 
\
    --member=serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dep.iam.gserviceaccount.com \
    --role=roles/modelarmor.calloutUser
gcloud projects add-iam-policy-binding PROJECT_ID 
\
    --member=serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dep.iam.gserviceaccount.com \
    --role=roles/serviceusage.serviceUsageConsumer
gcloud projects add-iam-policy-binding PROJECT_ID 
\
    --member=serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dep.iam.gserviceaccount.com \
    --role=roles/modelarmor.user

Replace the following:

PROJECT_ID : the ID of the project
PROJECT_NUMBER : the project number

These values are listed in the Project infopanel in the Google Cloud console for your project.

To verify that the traffic extension works as expected, run the same curl command:

 curl  
-v  
http:// ${ 
 IP 
 } 
/v1/chat/completions  
-H  
 "Content-Type: application/json" 
  
-d  
 '{"model": "google/gemma-2b", 
 "prompt": "Can you remember my ITIN: 123-45-6789", 
 "max_tokens": 250, 
 "temperature": 0.1}' 
  
 ``` 
  
With  
the  
service  
extension  
configured,  
a  
request  
with  
sensitive  
data  
generates  
an  
HTTP  
 ` 
 403 
  
Forbidden ` 
  
status  
code,  
logs  
an  
error  
message  
as  
configured  
 in 
  
the  
template,  
and  
closes  
the  
connection.  
When  
the  
request  
is  
safe,  
it  
generates  
an  
HTTP  
 ` 
 200 
  
OK ` 
  
status  
code  
and  
returns  
the  
LLM  
response  
to  
the  
user.

What's next

Manage extensions

Configure an extension to call a Google service Stay organized with collections Save and categorize content based on your preferences.

Configure a traffic extension to call the Model Armor service

Before you begin

Console

gcloud

Set up your GKE infrastructure

Limitations when configuring an OpenAI API endpoint

Configure the traffic extension

Console

gcloud

Console

gcloud

What's next

Configure an extension to call a Google service