Context caching in Firebase AI Logic

For your AI feature, you might pass the same input tokens (content) over and over to a model. For these use cases, you can instead cache this content, meaning that you pass the content to the model once , store it, and reference it in subsequent requests.

Context caching can significantly reduce latency and cost for repetitive tasks involving a large amount of content , like large amounts of text, an audio file, or a video file. Some common use cases for cached content include detailed persona documents, codebases, or manuals.

Gemini models offer two different caching mechanisms:

  • Implicit caching : automatically enabled on most models, no guaranteed cost savings

  • Explicit caching : can be optionally and manually enabled on most models, usually results in cost savings

Explicit caching is useful in cases where you want to more likely guarantee cost savings, but with some added developer work.

For both implicit and explicit caching, the cachedContentTokenCount field in your response's metadata indicates the number of tokens in the cached part of your input. For explicit caching, make sure to review pricing information at the bottom of this page.

Supported models

Caching is supported when using the following models:

  • gemini-3.1-pro-preview
  • gemini-3-flash-preview
  • gemini-3.1-flash-lite-preview
  • gemini-2.5-pro
  • gemini-2.5-flash
  • gemini-2.5-flash-lite

Media-generating models (for example, the Nana Banana models like gemini-3.1-flash-image-preview ), do not support context caching.

Cached content size limits

Each model has a minimum token count requirement for cached content. The maximum is dictated by the model's context window.

  • Gemini Pro models: 4096 tokens minimum
  • Gemini Flash models: 1024 tokens minimum

Additionally, the maximum size of content you can cache using a blob or text is 10 MB.



Implicit caching

Implicit caching is enabled by defaultand available for most Gemini models.

Google automatically passes on cost savings if your request hits the cached content. Here are some ways to increase the chance that your request uses implicit caching:

  • Try putting large and common content at the beginning of your prompt.
  • Try to send requests with a similar prefix in a short amount of time.

The number of tokens in the cached part of your input is provided in the cachedContentTokenCount field in the metadata of a response.



Explicit caching

Explicit caching is not enabled by default, and it's an optional capability of the Gemini models.

Here's how you can set up and work with explicit content caches:

Note that explicit content caches interact with implicit caching, potentially leading to additional caching beyond the explicit cached content. You can prevent cache data retention by disabling implicit caching and not creating explicit caches. For more information, see Enable and disable caching .



Create and use an explicit cache

Creating and using an explicit content cache requires the following:

  1. Create an explicit cache.

  2. Reference the cache in a server prompt template.

  3. Reference the server prompt template in a prompt request from your app.

Important information about creating and using an explicit cache

Your cache must be aligned with your app's prompt requests and your server prompt template:

  • The cache is specific to a Gemini API provider. Your app's prompt request must use the same provider.
    For Firebase AI Logic , we strongly recommend using explicit content caches only with the Vertex AI Gemini API . All the information and examples on this page are specific to that Gemini API provider.

  • The cache is specific to a Gemini model. Your app's prompt request must use the same model.

  • The cache is specific to a location when using the Vertex AI Gemini API .
    The location for the explicit cache must match the location of the server prompt template and the location where you access the model in your app's prompt request.

Also, be aware of the following limitations and requirements for explicit caching:

  • Once an explicit cache is created, you can't change anything about the cache except the TTL or expiration time.

  • You can cache any supported input file MIME type or even just text provided within the cache creation request.

  • If you want to include a file in the cache, you must provide the file as a Cloud Storage URI. It can't be a browser URL or YouTube URL.

    Additionally, access restrictions to the file are checked at cache-creation-time , and access restrictions are notchecked again at user-request-time . For this reason, make sure that any data included in the explicit cache is suitable for any user making a request that includes that cache.

  • If you want to use system instructions or tools (like code execution, URL context, or grounding with Google Search), then the cache itself must contain their configurations. They cannot be configured in the server prompt template or in your app's prompt request. Note that server prompt templates do not yet support function calling (or chat). For details about how to configure system instructions and tools in your cache, see the REST API of the Vertex AI Gemini API .

Step 1: Create the cache

Create the cache by directly using the REST API of the Vertex AI Gemini API .

The following is an example that creates an explicit cache of a PDF file as its content.

Syntax:

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 MODEL_ID 
 = 
 " GEMINI_MODEL 
" 
  
 # for example, gemini-3-flash-preview 
 LOCATION 
 = 
 " LOCATION 
" 
  
 # location for both the cache and the model 
 MIME_TYPE 
 = 
 " MIME_TYPE 
" 
 CACHED_CONTENT_URI 
 = 
 " CLOUD_STORAGE_FILE_URI 
" 
  
 # must be a Cloud Storage URI 
 CACHE_DISPLAY_NAME 
 = 
 " CACHE_DISPLAY_NAME 
" 
  
 # optional 
 TTL 
 = 
 " CACHE_TIME_TO_LIVE 
" 
  
 # optional (if not specified, defaults to 3600s) 
curl  
 \ 
-X  
POST  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json" 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents  
 \ 
-d  
@-  
<<EOF { 
  
 "model" 
: "projects/ 
 ${ 
 PROJECT_ID 
 } 
 /locations/ 
 ${ 
 LOCATION 
 } 
 /publishers/google/models/ 
 ${ 
 MODEL_ID 
 } 
 " 
,  
 "contents" 
:  
 [ 
  
 { 
  
 "role" 
:  
 "user" 
,  
 "parts" 
:  
 [ 
  
 { 
  
 "fileData" 
:  
 { 
  
 "mimeType" 
:  
 " 
 ${ 
 MIME_TYPE 
 } 
 " 
,  
 "fileUri" 
:  
 " 
 ${ 
 CACHED_CONTENT_URI 
 } 
 " 
  
 } 
  
 } 
  
 ] 
  
 } 
  
 ] 
,  
 "displayName" 
:  
 " 
 ${ 
 CACHE_DISPLAY_NAME 
 } 
 " 
,  
 "ttl" 
:  
 " 
 ${ 
 TTL 
 } 
 " 
 } 
EOF 

Example request:

  PROJECT_ID 
 = 
 "my-amazing-app" 
 MODEL_ID 
 = 
 "gemini-3-flash-preview" 
 LOCATION 
 = 
 "global" 
 MIME_TYPE 
 = 
 "application/pdf" 
 CACHED_CONTENT_URI 
 = 
 "gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf" 
 CACHE_DISPLAY_NAME 
 = 
 "Gemini - A Family of Highly Capable Multimodal Model (PDF)" 
 TTL 
 = 
 "7200s" 
curl  
 \ 
-X  
POST  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json" 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents  
 \ 
-d  
@-  
<<EOF { 
  
 "model" 
: "projects/ 
 ${ 
 PROJECT_ID 
 } 
 /locations/ 
 ${ 
 LOCATION 
 } 
 /publishers/google/models/ 
 ${ 
 MODEL_ID 
 } 
 " 
,  
 "contents" 
:  
 [ 
  
 { 
  
 "role" 
:  
 "user" 
,  
 "parts" 
:  
 [ 
  
 { 
  
 "fileData" 
:  
 { 
  
 "mimeType" 
:  
 " 
 ${ 
 MIME_TYPE 
 } 
 " 
,  
 "fileUri" 
:  
 " 
 ${ 
 CACHED_CONTENT_URI 
 } 
 " 
  
 } 
  
 } 
  
 ] 
  
 } 
  
 ] 
,  
 "displayName" 
:  
 " 
 ${ 
 CACHE_DISPLAY_NAME 
 } 
 " 
,  
 "ttl" 
:  
 " 
 ${ 
 TTL 
 } 
 " 
 } 
EOF 

Example Response:

The response includes a fully-qualified resource name which is globally unique to the cache (note that the last segment is the cache ID). You'll use this entire name value in the next step of the workflow.

  { 
  
 "name" 
 : 
  
 "projects/861083271981/locations/global/cachedContents/4545031458888089601" 
 , 
  
 "model" 
 : 
  
 "projects/my-amazing-app/locations/global/publishers/google/models/gemini-3-flash-preview" 
 , 
  
 "createTime" 
 : 
  
 "2024-06-04T01:11:50.808236Z" 
 , 
  
 "updateTime" 
 : 
  
 "2024-06-04T01:11:50.808236Z" 
 , 
  
 "expireTime" 
 : 
  
 "2024-06-04T02:11:50.794542Z" 
 } 
 

Step 2: Reference the cache in a server prompt template

After creating the cache, reference it by name within the cachedContent property of a server prompt template .

Make sure you follow these requirements when creating your server prompt template:

  • Use the fully-qualified resource name from the response when you created the cache. This is not the optional display name that you specified in the request.

  • The location for the server prompt template must match the location of the cache.

  • To use system instructions or tools, they must be configured as part of the cache and not as part of the server prompt template.

Syntax:

 {{cachedContent name="YOUR_CACHE_RESOURCE_NAME"}}

{{role "user"}}
{{userPrompt}} 

Example:

 {{cachedContent name="projects/861083271981/locations/global/cachedContents/4545031458888089601"}}

{{role "user"}}
{{userPrompt}} 

Alternatively, the value of the name parameter in the server prompt template can be a dynamic input variable . For example, {{cachedContent name=someVariable}} lets you to include the name of the cache as an input for the request from your app.

Step 3: Reference the server prompt template in the request from your app

Be very careful about the following when writing your request:

  • Use the Vertex AI Gemini API since the cache was created with that Gemini API provider.

  • The location where you access the model in your app's prompt request must match the location of the server prompt template and the cache.

Swift

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Create a `TemplateGenerativeModel` instance 
 // Make sure to specify the same location as the server prompt template and the cache 
 let 
  
 model 
  
 = 
  
 FirebaseAI 
 . 
 firebaseAI 
 ( 
 backend 
 : 
  
 . 
 vertexAI 
 ( 
 location 
 : 
  
 " LOCATION 
" 
 )) 
  
 . 
 templateGenerativeModel 
 () 
 do 
  
 { 
  
 let 
  
 response 
  
 = 
  
 try 
  
 await 
  
 model 
 . 
 generateContent 
 ( 
  
 // Specify your template ID 
  
 templateID 
 : 
  
 " TEMPLATE_ID 
" 
  
 ) 
  
 if 
  
 let 
  
 text 
  
 = 
  
 response 
 . 
 text 
  
 { 
  
 print 
 ( 
 "Response Text: 
 \( 
 text 
 ) 
 " 
 ) 
  
 } 
 } 
  
 catch 
  
 { 
  
 print 
 ( 
 "An error occurred: 
 \( 
 error 
 ) 
 " 
 ) 
 } 
 print 
 ( 
 " 
 \n 
 " 
 ) 
 

Kotlin

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Create a `TemplateGenerativeModel` instance 
 // Make sure to specify the same location as the server prompt template and the cache 
 val 
  
 model 
  
 = 
  
 Firebase 
 . 
 ai 
 ( 
 backend 
  
 = 
  
 GenerativeBackend 
 . 
 vertexAI 
 ( 
 location 
  
 = 
  
 " LOCATION 
" 
 )) 
  
 . 
 templateGenerativeModel 
 () 
 val 
  
 response 
  
 = 
  
 model 
 . 
 generateContent 
 ( 
  
 // Specify your template ID 
  
 " TEMPLATE_ID 
" 
 , 
 ) 
 val 
  
 text 
  
 = 
  
 response 
 . 
 text 
 println 
 ( 
 text 
 ) 
 

Java

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Create a `TemplateGenerativeModel` instance 
 // Make sure to specify the same location as the server prompt template and the cache 
 TemplateGenerativeModel 
  
 generativeModel 
  
 = 
  
 FirebaseAI 
 . 
 getInstance 
 (). 
 templateGenerativeModel 
 (); 
 TemplateGenerativeModelFutures 
  
 model 
  
 = 
  
 TemplateGenerativeModelFutures 
 . 
 from 
 ( 
 generativeModel 
 ); 
 Future<GenerateContentResponse> 
  
 response 
  
 = 
  
 model 
 . 
 generateContent 
 ( 
  
 // Specify your template ID 
  
 " TEMPLATE_ID 
" 
 ); 
 addCallback 
 ( 
 response 
 , 
  
 new 
  
 FutureCallback<GenerateContentResponse> 
 () 
  
 { 
  
 public 
  
 void 
  
 onSuccess 
 ( 
 GenerateContentResponse 
  
 result 
 ) 
  
 { 
  
 System 
 . 
 out 
 . 
 println 
 ( 
 result 
 . 
 getText 
 ()); 
  
 } 
  
 public 
  
 void 
  
 onFailure 
 ( 
 Throwable 
  
 t 
 ) 
  
 { 
  
 reportError 
 ( 
 t 
 ); 
  
 } 
  
 } 
 executor 
 ); 
 

Web

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Make sure to specify the same location as the server prompt template and the cache 
 const 
  
 ai 
  
 = 
  
 getAI 
 ( 
 app 
 , 
  
 { 
  
 backend 
 : 
  
 new 
  
 VertexAIBackend 
 ( 
 ' LOCATION 
' 
 ) 
  
 }); 
 // Create a `TemplateGenerativeModel` instance 
 const 
  
 model 
  
 = 
  
 getTemplateGenerativeModel 
 ( 
 ai 
 ); 
 const 
  
 result 
  
 = 
  
 await 
  
 model 
 . 
 generateContent 
 ( 
  
 // Specify your template ID 
  
 ' TEMPLATE_ID 
' 
 ); 
 const 
  
 response 
  
 = 
  
 result 
 . 
 response 
 ; 
 const 
  
 text 
  
 = 
  
 response 
 . 
 text 
 (); 
 

Dart

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Create a `TemplateGenerativeModel` instance 
 // Make sure to specify the same location as the server prompt template and the cache 
 var 
  
 _model 
  
 = 
  
 FirebaseAI 
 . 
 vertexAI 
 ( 
 location: 
  
 ' LOCATION 
' 
 ). 
 templateGenerativeModel 
 () 
 var 
  
 response 
  
 = 
  
 await 
  
 _model 
 . 
 generateContent 
 ( 
  
 // Specify your template ID 
  
 ' TEMPLATE_ID 
' 
 , 
  
 ); 
 var 
  
 text 
  
 = 
  
 response 
 ? 
 . 
 text 
 ; 
 print 
 ( 
 text 
 ); 
 

Unity

  // ... 
 // Initialize the Vertex AI Gemini API backend service 
 // Make sure to specify the same location as the server prompt template and the cache 
 var 
  
 firebaseAI 
  
 = 
  
 FirebaseAI 
 . 
 GetInstance 
 ( 
 FirebaseAI 
 . 
 Backend 
 . 
 VertexAI 
 ( 
 location 
 : 
  
 " LOCATION 
" 
 )); 
 // Create a `TemplateGenerativeModel` instance 
 var 
  
 model 
  
 = 
  
 firebaseAI 
 . 
 GetTemplateGenerativeModel 
 (); 
 try 
 { 
  
 var 
  
 response 
  
 = 
  
 await 
  
 model 
 . 
 GenerateContentAsync 
 ( 
  
 // Specify your template ID 
  
 " TEMPLATE_ID 
" 
  
 ); 
  
 Debug 
 . 
 Log 
 ( 
 $"Response Text: {response.Text}" 
 ); 
 } 
 catch 
  
 ( 
 Exception 
  
 e 
 ) 
  
 { 
  
 Debug 
 . 
 LogError 
 ( 
 $"An error occurred: {e.Message}" 
 ); 
 } 
 



Manage explicit caches

This section describes managing explicit content caches, including how to list all caches , get metadata about a cache , update the TTL or expiration time of a cache , and delete a cache .

You manage explicit caches using the REST API of the Vertex AI Gemini API .

Once an explicit content cache is created, you can't change anything about the cache except the TTL or expiration time.

List all caches

You can list all the explicit caches available for your project. This command will only return the caches in the specified location.

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 LOCATION 
 = 
 " LOCATION 
" 
curl  
 \ 
-X  
GET  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents 

It's not possible to retrieve or view the actual cached content. However, you can retrieve metadata about an explicit cache, including name , model , display_name , usage_metadata , create_time , update_time , and expire_time .

You need to provide the CACHE_ID , which is the final segment in the fully-qualified resource name of the cache.

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 LOCATION 
 = 
 " LOCATION 
" 
 CACHE_ID 
 = 
 " CACHE_ID 
" 
  
 # the final segment in the `name` of the cache 
curl  
 \ 
-X  
GET  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents/ ${ 
 CACHE_ID 
 } 
 

Update the TTL or expiration time for a cache

When you create an explicit cache, you can optionally set the ttl or the expire_time .

  • ttl : The TTL (time-to-live) for the cache, specifically the number of seconds and nanoseconds that the cache lives after it's created or after the ttl is updated before it expires. When you set the ttl , the expireTime of the cache is automatically updated.

  • expire_time : A Timestamp (like 2024-06-30T09:00:00.000000Z ) that specifies the absolute date and time when the cache expires.

If you don't set either of these values, the default TTL is 1 hour. There are no minimum or maximum bounds on the TTL.

For existing explicit caches, you can add or update the ttl or expire_time . You need to provide the CACHE_ID , which is the final segment in the fully-qualified resource name of the cache.

Update ttl

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 LOCATION 
 = 
 " LOCATION 
" 
 CACHE_ID 
 = 
 " CACHE_ID 
" 
  
 # the final segment in the `name` of the cache 
 TTL 
 = 
 " CACHE_TIME_TO_LIVE 
" 
curl  
 \ 
-X  
PATCH  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json; charset=utf-8" 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents/ ${ 
 CACHE_ID 
 } 
  
-d  
 \ 
 '{ 
 "ttl": "' 
 $TTL 
 '" 
 }' 
 

Update expire_time

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 LOCATION 
 = 
 " LOCATION 
" 
 CACHE_ID 
 = 
 " CACHE_ID 
" 
  
 # the final segment in the `name` of the cache 
 EXPIRE_TIME 
 = 
 " ABSOLUTE_TIME_CACHE_EXPIRES 
" 
curl  
 \ 
-X  
PATCH  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
-H  
 "Content-Type: application/json; charset=utf-8" 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents/ ${ 
 CACHE_ID 
 } 
  
-d  
 \ 
 '{ 
 "expire_time": "' 
 $EXPIRE_TIME 
 '" 
 }' 
 

Delete a cache

When an explicit cache is no longer needed, you can delete it.

You need to provide the CACHE_ID , which is the final segment in the fully-qualified resource name of the cache.

  PROJECT_ID 
 = 
 " PROJECT_ID 
" 
 LOCATION 
 = 
 " LOCATION 
" 
 CACHE_ID 
 = 
 " CACHE_ID 
" 
  
 # the final segment in the `name` of the cache 
curl  
 \ 
-X  
DELETE  
 \ 
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
print-access-token ) 
 " 
  
 \ 
https:// ${ 
 LOCATION 
 } 
-aiplatform.googleapis.com/v1beta1/projects/ ${ 
 PROJECT_ID 
 } 
/locations/ ${ 
 LOCATION 
 } 
/cachedContents/ ${ 
 CACHE_ID 
 } 
 



Pricing for explicit caching

Explicit caching is a paid feature designed to reduce cost. Pricing is based on the following factors:

  • Input tokens for cache creation: For both implicit and explicit caching, you're billed for the input tokens used to create the cache at the standard input token price.

  • Storage of cache: For explicit caching, there are also storage costs based on how long caches are stored. There are no storage costs for implicit caching. For more information, see the pricing for the Vertex AI Gemini API .

  • Usage of cached content: Explicit caching ensures a discount when explicit caches are referenced, meaning you get a discount on the input tokens when they reference an existing cache. For Gemini 2.5 and later models, this discount is 90%.

The number of tokens in the cached part of your input is provided in the cachedContentTokenCount field in the metadata of a response.

Create a Mobile Website
View Site in Mobile | Classic
Share by: