You can use context caching for your fine-tuned Gemini models to improve performance and reduce costs for prompts that include large amounts of context. By caching frequently used context, you avoid re-sending large amounts of data with each request to your fine-tuned model.
The management operations ( Read
, Update
, Delete
) of context cache for tuned Gemini remain consistent with base models. Only cached content creation and inference requires specific adjustment, which is detailed in the following.
Prerequisites
Fine-tuning a Gemini Model:You need a deployed fine-tuned Gemini model based on a supported base model (see Context caching overview ). For details on how to fine-tune a Gemini model, see Fine-tune a Gemini model . To get the endpoint for your deployed tuned model, see Deploy a tuned model .
Make sure that you have the following information:
- The ID and the version of the tuned Gemini model
- The endpoint resource name for the deployed fine-tuned model
Create a context cache for a fine-tuned model
The procedure for creating a context cache for a fine-tuned model largely follows the steps outlined in Create a context cache . Consult the linked documentation for the general process; this guide focuses on the difference of creating context cache for fine-tuned Gemini models.
Instead of using the base model in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL}
, you must use your fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/models/{MODEL}@{VERSION}
.
The following examples show how to create a context cache with a tuned Gemini model.
REST
You can use REST to create a context cache by using the Vertex AI API to send a POST request to the publisher model endpoint. The following example shows how to create a context cache using a file stored in a Cloud Storage bucket.
Before using any of the request data, make the following replacements:
- PROJECT_ID : Your project ID .
- LOCATION : The region to process the request and where the cached content is stored. For a list of supported regions, see Available regions .
- MODEL_ID : The fine-tuned Gemini model ID.
- MODEL_VERSION : The fine-tuned Gemini model version.
- CACHE_DISPLAY_NAME : A meaningful display name to describe and to help you identify each context cache.
- MIME_TYPE : The MIME type of the content to cache.
- CONTENT_TO_CACHE_URI : The Cloud Storage URI of the content to cache.
HTTP method and URL:
POST https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /cachedContents
Request JSON body:
{ "model": "projects/ PROJECT_ID /locations/ LOCATION /models/ MODEL_ID @ MODEL_VERSION ", "displayName": " CACHE_DISPLAY_NAME ", "contents": [{ "role": "user", "parts": [{ "fileData": { "mimeType": " MIME_TYPE ", "fileUri": " CONTENT_TO_CACHE_URI " } }] }, { "role": "model", "parts": [{ "text": "This is sample text to demonstrate explicit caching." }] }] }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /cachedContents"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /cachedContents" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
Example curl command
LOCATION
=
"us-central1"
MODEL_ID
=
"model-id"
PROJECT_ID
=
"test-project"
MODEL_VERSION
=
1
MIME_TYPE
=
"video/mp4"
CACHED_CONTENT_URI
=
"gs://path-to-bucket/video-file-name.mp4"
curl
-X
POST
\
-H
"Authorization: Bearer
$(
gcloud
auth
print-access-token )
"
\
-H
"Content-Type: application/json"
\
https:// ${
LOCATION
}
-aiplatform.googleapis.com/v1/projects/ ${
PROJECT_ID
}
/locations/ ${
LOCATION
}
/cachedContents
-d
\
'{
"model":"projects/${PROJECT_ID}/locations/${LOCATION}/models/${MODEL_ID}@${MODEL_VERSION}",
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"mimeType": "${MIME_TYPE}",
"fileUri": "${CACHED_CONTENT_URI}"
}
}
]
}
]
}'
Use a context cache for a fine-tuned model
The procedure for using a context cache for a fine-tuned model largely follows the steps outlined in Use a context cache . Consult the linked documentation for the general process; this guide focuses on the difference of using context cache for fine-tuned Gemini models.
Instead of sending the request to the base model endpoint in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL}
, you must send it to the endpoint of your deployed fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}
.
The following code example shows how to use a context cache with a tuned Gemini model.
When you use a context cache, you can't specify the following properties:
-
GenerativeModel.system_instructions -
GenerativeModel.tool_config -
GenerativeModel.tools
REST
You can use REST to specify a context cache with a prompt by using the Vertex AI API to send a POST request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- PROJECT_ID : Your project ID .
- LOCATION : The region where the request to create the context cache was processed.
- ENDPOINT_ID : The endpoint where the fine-tuned model is deployed.
- MIME_TYPE : The text prompt to submit to the model.
HTTP method and URL:
POST https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /endpoints/ ENDPOINT_ID :generateContent
Request JSON body:
{ "cachedContent": "projects/ PROJECT_NUMBER /locations/ LOCATION /cachedContents/ CACHE_ID ", "contents": [ {"role":"user","parts":[{"text":" PROMPT_TEXT "}]} ], "generationConfig": { "maxOutputTokens": 8192, "temperature": 1, "topP": 0.95, }, "safetySettings": [ { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE" }, { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE" } ], }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /endpoints/ ENDPOINT_ID :generateContent"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https:// LOCATION -aiplatform.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /endpoints/ ENDPOINT_ID :generateContent" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Example curl command
LOCATION
=
"us-central1"
PROJECT_ID
=
"test-project"
ENDPOINT_ID
=
987654321
curl
-X
POST
\
-H
"Authorization: Bearer
$(
gcloud
auth
print-access-token )
"
\
-H
"Content-Type: application/json"
\
"https://
${
LOCATION
}
-aiplatform.googleapis.com/v1/projects/
${
PROJECT_ID
}
/locations/
${
LOCATION
}
/endpoints/
${
ENDPOINT_ID
}
:generateContent"
-d
\
'{
"cachedContent": "projects/${PROJECT_NUMBER}/locations/${LOCATION}/cachedContents/${CACHE_ID}",
"contents": [
{"role":"user","parts":[{"text":"What are the benefits of exercise?"}]}
],
"generationConfig": {
"maxOutputTokens": 8192,
"temperature": 1,
"topP": 0.95,
},
"safetySettings": [
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
}
],
}'

