Context Caching for Fine-tuned Gemini Models

You can use context caching for your fine-tuned Gemini models to improve performance and reduce costs for prompts that include large amounts of context. By caching frequently used context, you avoid re-sending large amounts of data with each request to your fine-tuned model.

The management operations ( Read , Update , Delete ) of context cache for tuned Gemini remain consistent with base models. Only cached content creation and inference requires specific adjustment, which is detailed in the following.

Prerequisites

Fine-tuning a Gemini Model:You need a deployed fine-tuned Gemini model based on a supported base model (see Context caching overview ). For details on how to fine-tune a Gemini model, see Fine-tune a Gemini model . To get the endpoint for your deployed tuned model, see Deploy a tuned model .

Make sure that you have the following information:

  • The ID and the version of the tuned Gemini model
  • The endpoint resource name for the deployed fine-tuned model

Create a context cache for a fine-tuned model

The procedure for creating a context cache for a fine-tuned model largely follows the steps outlined in Create a context cache . Consult the linked documentation for the general process; this guide focuses on the difference of creating context cache for fine-tuned Gemini models.

Instead of using the base model in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL} , you must use your fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/models/{MODEL}@{VERSION} .

The following examples show how to create a context cache with a tuned Gemini model.

REST

You can use REST to create a context cache by using the Vertex AI API to send a POST request to the publisher model endpoint. The following example shows how to create a context cache using a file stored in a Cloud Storage bucket.

Before using any of the request data, make the following replacements:

  • PROJECT_ID : Your project ID .
  • LOCATION : The region to process the request and where the cached content is stored. For a list of supported regions, see Available regions .
  • MODEL_ID : The fine-tuned Gemini model ID.
  • MODEL_VERSION : The fine-tuned Gemini model version.
  • CACHE_DISPLAY_NAME : A meaningful display name to describe and to help you identify each context cache.
  • MIME_TYPE : The MIME type of the content to cache.
  • CONTENT_TO_CACHE_URI : The Cloud Storage URI of the content to cache.

HTTP method and URL:

POST https:// LOCATION 
-aiplatform.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/cachedContents

Request JSON body:

{
  "model": "projects/ PROJECT_ID 
/locations/ LOCATION 
/models/ MODEL_ID 
@ MODEL_VERSION 
",
  "displayName": " CACHE_DISPLAY_NAME 
",
  "contents": [{
    "role": "user",
      "parts": [{
        "fileData": {
          "mimeType": " MIME_TYPE 
",
          "fileUri": " CONTENT_TO_CACHE_URI 
"
        }
      }]
  },
  {
    "role": "model",
      "parts": [{
        "text": "This is sample text to demonstrate explicit caching."
      }]
  }]
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json , and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /cachedContents"

PowerShell

Save the request body in a file named request.json , and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /cachedContents" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

Example curl command

  LOCATION 
 = 
 "us-central1" 
 MODEL_ID 
 = 
 "model-id" 
 PROJECT_ID 
 = 
 "test-project" 
 MODEL_VERSION 
 = 
 1 
 MIME_TYPE 
 = 
 "video/mp4" 
 CACHED_CONTENT_URI 
 = 
 "gs://path-to-bucket/video-file-name.mp4" 
curl  
-X  
POST  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json" 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents  
-d  
 \ 
 '{ 
 "model":"projects/${PROJECT_ID}/locations/${LOCATION}/models/${MODEL_ID}@${MODEL_VERSION}", 
 "contents": [ 
 { 
 "role": "user", 
 "parts": [ 
 { 
 "fileData": { 
 "mimeType": "${MIME_TYPE}", 
 "fileUri": "${CACHED_CONTENT_URI}" 
 } 
 } 
 ] 
 } 
 ] 
 }' 
 

Use a context cache for a fine-tuned model

The procedure for using a context cache for a fine-tuned model largely follows the steps outlined in Use a context cache . Consult the linked documentation for the general process; this guide focuses on the difference of using context cache for fine-tuned Gemini models.

Instead of sending the request to the base model endpoint in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL} , you must send it to the endpoint of your deployed fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/endpoints/{ENDPOINT_ID} .

The following code example shows how to use a context cache with a tuned Gemini model.

When you use a context cache, you can't specify the following properties:

  • GenerativeModel.system_instructions
  • GenerativeModel.tool_config
  • GenerativeModel.tools

REST

You can use REST to specify a context cache with a prompt by using the Vertex AI API to send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • PROJECT_ID : Your project ID .
  • LOCATION : The region where the request to create the context cache was processed.
  • ENDPOINT_ID : The endpoint where the fine-tuned model is deployed.
  • MIME_TYPE : The text prompt to submit to the model.

HTTP method and URL:

POST https:// LOCATION 
-aiplatform.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/endpoints/ ENDPOINT_ID 
:generateContent

Request JSON body:

{
  "cachedContent": "projects/ PROJECT_NUMBER 
/locations/ LOCATION 
/cachedContents/ CACHE_ID 
",
  "contents": [
      {"role":"user","parts":[{"text":" PROMPT_TEXT 
"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
      {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_HARASSMENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      }
  ],
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json , and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /endpoints/ ENDPOINT_ID :generateContent"

PowerShell

Save the request body in a file named request.json , and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /endpoints/ ENDPOINT_ID :generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Example curl command

  LOCATION 
 = 
 "us-central1" 
 PROJECT_ID 
 = 
 "test-project" 
 ENDPOINT_ID 
 = 
 987654321 
curl  
-X  
POST  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json" 
  
 \ 
 "https:// 
 ${ 
 LOCATION 
 } 
 -aiplatform.googleapis.com/v1/projects/ 
 ${ 
 PROJECT_ID 
 } 
 /locations/ 
 ${ 
 LOCATION 
 } 
 /endpoints/ 
 ${ 
 ENDPOINT_ID 
 } 
 :generateContent" 
  
-d  
 \ 
 '{ 
 "cachedContent": "projects/${PROJECT_NUMBER}/locations/${LOCATION}/cachedContents/${CACHE_ID}", 
 "contents": [ 
 {"role":"user","parts":[{"text":"What are the benefits of exercise?"}]} 
 ], 
 "generationConfig": { 
 "maxOutputTokens": 8192, 
 "temperature": 1, 
 "topP": 0.95, 
 }, 
 "safetySettings": [ 
 { 
 "category": "HARM_CATEGORY_HATE_SPEECH", 
 "threshold": "BLOCK_MEDIUM_AND_ABOVE" 
 }, 
 { 
 "category": "HARM_CATEGORY_DANGEROUS_CONTENT", 
 "threshold": "BLOCK_MEDIUM_AND_ABOVE" 
 }, 
 { 
 "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", 
 "threshold": "BLOCK_MEDIUM_AND_ABOVE" 
 }, 
 { 
 "category": "HARM_CATEGORY_HARASSMENT", 
 "threshold": "BLOCK_MEDIUM_AND_ABOVE" 
 } 
 ], 
 }' 
 
Design a Mobile Site
View Site in Mobile | Classic
Share by: