Use generateContent 
or streamGenerateContent 
to generate content with
Gemini.
The Gemini model family includes models that work with multimodal prompt requests. The term multimodal indicates that you can use more than one modality, or type of input, in a prompt. Models that aren't multimodal accept prompts only with text. Modalities can include text, audio, video, and more.
Create a Google Cloud account to get started
To start using the Gemini API in Vertex AI, create a Google Cloud account .
After creating your account, use this document to review the Gemini model request body , model parameters , response body , and some sample requests .
When you're ready, see the Gemini API in Vertex AI quickstart to learn how to send a request to the Gemini API in Vertex AI using a programming language SDK or the REST API.
Supported models
All Gemini models support content generation.
Parameter list
See examples for implementation details.
Request body
{ "cachedContent" : s tr i n g , "contents" : [ { "role" : s tr i n g , "parts" : [ { // Union field data can be only one of the following: "text" : s tr i n g , "inlineData" : { "mimeType" : s tr i n g , "data" : s tr i n g }, "fileData" : { "mimeType" : s tr i n g , "fileUri" : s tr i n g }, // End of list of possible types for union field data. "videoMetadata" : { "startOffset" : { "seconds" : i nte ger , "nanos" : i nte ger }, "endOffset" : { "seconds" : i nte ger , "nanos" : i nte ger }, "fps" : double } } ] } ], "systemInstruction" : { "role" : s tr i n g , "parts" : [ { "text" : s tr i n g } ] }, "tools" : [ { "functionDeclarations" : [ { "name" : s tr i n g , "description" : s tr i n g , "parameters" : { objec t ( Ope n API Objec t Schema) } } ] } ], "safetySettings" : [ { "category" : e nu m (HarmCa te gory) , "threshold" : e nu m (HarmBlockThreshold) } ], "generationConfig" : { "temperature" : nu mber , "topP" : nu mber , "topK" : nu mber , "candidateCount" : i nte ger , "maxOutputTokens" : i nte ger , "presencePenalty" : fl oa t , "frequencyPenalty" : fl oa t , "stopSequences" : [ s tr i n g ], "responseMimeType" : s tr i n g , "responseSchema" : schema , "seed" : i nte ger , "responseLogprobs" : boolea n , "logprobs" : i nte ger , "audioTimestamp" : boolea n , "thinkingConfig" : { "thinkingBudget" : i nte ger } }, "labels" : { s tr i n g : s tr i n g } }
The request body contains data with the following parameters:
 cachedContent 
Optional: string 
The name of the cached content used as context to
    serve the prediction. Format: projects/{project}/locations/{location}/cachedContents/{cachedContent} 
 contents 
Required: Content 
The content of the current conversation with the model.
For single-turn queries, this is a single instance. For multi-turn queries, this is a repeated field that contains conversation history and the latest request.
 systemInstruction 
Optional: Content 
Available for gemini-2.0-flash 
and gemini-2.0-flash-lite 
.
Instructions for the model to steer it toward better performance. For example, "Answer as concisely as possible" or "Don't use technical terms in your response".
The text 
strings count toward the token limit.
The role 
field of systemInstruction 
is ignored and doesn't affect the performance of the model.
 tools 
Optional. A piece of code that enables the system to interact with external systems to perform an action, or set of actions, outside of knowledge and scope of the model. See Function calling .
 toolConfig 
Optional. See Function calling .
 safetySettings 
Optional: SafetySetting 
Per request settings for blocking unsafe content.
Enforced on GenerateContentResponse.candidates 
.
 generationConfig 
Optional: GenerationConfig 
Generation configuration settings.
 labels 
Optional: string 
Metadata that you can add to the API call in the format of key-value pairs.
 contents 
 
 The base structured data type containing multi-part content of a message.
This class consists of two main properties: role 
and parts 
. The role 
property denotes the individual producing the content, while the parts 
property contains multiple elements, each representing a segment of data within
a message.
 role 
 string 
The identity of the entity that creates the message. The following values are supported:
-  user: This indicates that the message is sent by a real person, typically a user-generated message.
-  model: This indicates that the message is generated by the model.
The model 
value is used to insert messages from the model into the conversation during multi-turn conversations.
 parts 
 Part 
A list of ordered parts that make up a single message. Different parts may have different IANA MIME types .
For limits on the inputs, such as the maximum number of tokens or the number of images, see the model specifications on the Google models page.
To compute the number of tokens in your request, see Get token count .
 parts 
 
 A data type containing media that is part of a multi-part Content 
message.
 text 
Optional: string 
A text prompt or code snippet.
 inlineData 
Optional: Blob 
Inline data in raw bytes.
For gemini-2.0-flash-lite 
and gemini-2.0-flash 
, you can specify up to 3000 images by using inlineData 
.
 fileData 
Optional: fileData 
Data stored in a file.
 functionCall 
Optional: FunctionCall 
.
It contains a string representing the FunctionDeclaration.name 
field and a structured JSON object containing any parameters for the function call predicted by the model.
See Function calling .
 functionResponse 
Optional: FunctionResponse 
.
The result output of a FunctionCall 
that contains a string representing the FunctionDeclaration.name 
field and a structured JSON object containing any output from the function call. It is used as context to the model.
See Function calling .
 videoMetadata 
Optional: VideoMetadata 
For video input, the start and end offset of the video in Duration format, and the frame rate of the video . For example, to specify a 10 second clip starting at 1:00 with a frame rate of 10 frames per second, set the following:
-  "startOffset": { "seconds": 60 }
-  "endOffset": { "seconds": 70 }
-  "fps": 10.0
The metadata should only be specified while the video data is presented
        in inlineData 
or fileData 
.
 blob 
 
 Content blob. If possible send as text rather than raw bytes.
 mimeType 
 string 
data 
or fileUri 
fields. Acceptable values include the following: Click to expand MIME types
-  application/pdf
-  audio/mpeg
-  audio/mp3
-  audio/wav
-  image/png
-  image/jpeg
-  image/webp
-  text/plain
-  video/mov
-  video/mpeg
-  video/mp4
-  video/mpg
-  video/avi
-  video/wmv
-  video/mpegps
-  video/flv
For gemini-2.0-flash-lite 
and gemini-2.0-flash 
, the maximum length of an audio file is 8.4 hours and the maximum length of a video file (without audio) is one hour. For more information, see Gemini audio 
and video 
requirements.
Text files must be UTF-8 encoded. The contents of the text file count toward the token limit.
There is no limit on image resolution.
 data 
 bytes 
The base64 encoding 
of the image, PDF, or video
to include inline in the prompt. When including media inline, you must also specify the media
type ( mimeType 
) of the data.
Size limit: 20MB
FileData
URI or web-URL data.
 mimeType 
 string 
IANA MIME type of the data.
 fileUri 
 string 
The URI or URL of the file to include in the prompt. Acceptable values include the following:
-  Cloud Storage bucket URI: 
The object must either be publicly readable or reside in
    the same Google Cloud project that's sending the request. For gemini-2.0-flashandgemini-2.0-flash-lite, the size limit is 2 GB.
- HTTP URL: The file URL must be publicly readable. You can specify one video file, one audio file, and up to 10 image files per request. Audio files, video files, and documents can't exceed 15 MB.
- YouTube video URL: The YouTube video must be either owned by the account that you used to sign in to the Google Cloud console or is public. Only one YouTube video URL is supported per request.
When specifying a fileURI 
, you must also specify the media type
  ( mimeType 
) of the file. If VPC Service Controls is enabled, specifying a media file
  URL for fileURI 
is not supported.
 functionCall 
 
 A predicted functionCall 
returned from the model that contains a string
representing the functionDeclaration.name 
and a structured JSON object
containing the parameters and their values.
 name 
 string 
The name of the function to call.
 args 
 Struct 
The function parameters and values in JSON object format.
See Function calling for parameter details.
 functionResponse 
 
 The resulting output from a FunctionCall 
that contains a string representing the FunctionDeclaration.name 
. Also contains a structured JSON object with the
output from the function (and uses it as context for the model). This should contain the
result of a FunctionCall 
made based on model prediction.
 name 
 string 
The name of the function to call.
 response 
 Struct 
The function response in JSON object format.
 videoMetadata 
 
 Metadata describing the input video content.
 startOffset 
Optional: google.protobuf.Duration 
The start offset of the video.
 endOffset 
Optional: google.protobuf.Duration 
The end offset of the video.
 fps 
Optional: double 
The frame rate of the video sent to the model. Defaults to 1.0 
if not specified. The minimum accepted value is as low
        as, but not including, 0.0 
. The maximum value is 24.0 
.
 safetySetting 
 
 Safety settings.
 category 
Optional: HarmCategory 
The safety category to configure a threshold for. Acceptable values include the following:
Click to expand safety categories
-  HARM_CATEGORY_SEXUALLY_EXPLICIT
-  HARM_CATEGORY_HATE_SPEECH
-  HARM_CATEGORY_HARASSMENT
-  HARM_CATEGORY_DANGEROUS_CONTENT
 threshold 
Optional: HarmBlockThreshold 
The threshold for blocking responses that could belong to the specified safety category based on probability.
-  OFF
-  BLOCK_NONE
-  BLOCK_LOW_AND_ABOVE
-  BLOCK_MEDIUM_AND_ABOVE
-  BLOCK_ONLY_HIGH
 method 
Optional: HarmBlockMethod 
Specify if the threshold is used for probability or severity score. If not specified, the threshold is used for probability score.
 harmCategory 
 
 Harm categories that block content.
 HARM_CATEGORY_UNSPECIFIED 
The harm category is unspecified.
 HARM_CATEGORY_HATE_SPEECH 
The harm category is hate speech.
 HARM_CATEGORY_DANGEROUS_CONTENT 
The harm category is dangerous content.
 HARM_CATEGORY_HARASSMENT 
The harm category is harassment.
 HARM_CATEGORY_SEXUALLY_EXPLICIT 
The harm category is sexually explicit content.
 harmBlockThreshold 
 
 Probability thresholds levels used to block a response.
 HARM_BLOCK_THRESHOLD_UNSPECIFIED 
Unspecified harm block threshold.
 BLOCK_LOW_AND_ABOVE 
Block low threshold and higher (i.e. block more).
 BLOCK_MEDIUM_AND_ABOVE 
Block medium threshold and higher.
 BLOCK_ONLY_HIGH 
Block only high threshold (i.e. block less).
 BLOCK_NONE 
Block none.
 OFF 
Switches off safety if all categories are turned OFF
 harmBlockMethod 
 
 A probability threshold that blocks a response based on a combination of probability and severity.
 HARM_BLOCK_METHOD_UNSPECIFIED 
The harm block method is unspecified.
 SEVERITY 
The harm block method uses both probability and severity scores.
 PROBABILITY 
The harm block method uses the probability score.
 generationConfig 
 
 Configuration settings used when generating the prompt.
 temperature 
Optional: float 
The temperature is used for sampling during response generation, which occurs when topP 
and topK 
are applied. Temperature controls the degree of randomness in token selection.
Lower temperatures are good for prompts that require a less open-ended or creative response, while
higher temperatures can lead to more diverse or creative results. A temperature of 0 
means that the highest probability tokens are always selected. In this case, responses for a given
prompt are mostly deterministic, but a small amount of variation is still possible.
If the model returns a response that's too generic, too short, or the model gives a fallback
response, try increasing the temperature. If the model enters infinite generation, increasing the
temperature to at least 0.1 
may lead to improved results.
1.0 
is the
recommended starting value for temperature. - Range for gemini-2.0-flash-lite:0.0 - 2.0(default:1.0)
- Range for gemini-2.0-flash:0.0 - 2.0(default:1.0)
For more information, see Content generation parameters .
 topP 
Optional: float 
If specified, nucleus sampling is used.
 Top-P 
changes how the model selects tokens for output. Tokens are selected from the most (see top-K) to least probable until the sum of their probabilities equals the top-P value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-P value is 0.5 
, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.
Specify a lower value for less random responses and a higher value for more random responses.
- Range: 0.0 - 1.0
- Default for gemini-2.0-flash-lite:0.95
- Default for gemini-2.0-flash:0.95
 candidateCount 
Optional: int 
The number of response variations to return. For each request, you're charged for the output tokens of all candidates, but are only charged once for the input tokens.
Specifying multiple candidates is a Preview feature that works with generateContent 
( streamGenerateContent 
is not supported). The following models are supported:
-  gemini-2.0-flash-lite:1-8, default:1
-  gemini-2.0-flash:1-8, default:1
 maxOutputTokens 
Optional: int
Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
For more information, see Content generation parameters .
 stopSequences 
Optional: List[string] 
Specifies a list of strings that tells the model to stop generating text if one
of the strings is encountered in the response. If a string appears multiple
times in the response, then the response truncates where it's first encountered.
The strings are case-sensitive.
For example, if the following is the returned response when stopSequences 
isn't specified: public
static string reverse(string myString) 
Then the returned response with stopSequences 
set to ["Str",
  "reverse"] 
is: public static string 
Maximum 5 items in the list.
For more information, see Content generation parameters .
 presencePenalty 
Optional: float 
Positive penalties.
Positive values penalize tokens that already appear in the generated text, increasing the probability of generating more diverse content.
The maximum value for presencePenalty 
is up to, but not including, 2.0 
. Its minimum value is -2.0 
.
 frequencyPenalty 
Optional: float 
Positive values penalize tokens that repeatedly appear in the generated text, decreasing the probability of repeating content.
This maximum value for frequencyPenalty 
is up to, but not including, 2.0 
. Its minimum value is -2.0 
.
 responseMimeType 
Optional: string (enum) 
The output response MIME type of the generated candidate text.
The following MIME types are supported:
-  application/json: JSON response in the candidates.
-  text/plain(default): Plain text output.
-  text/x.enum: For classification tasks, output an enum value as defined in the response schema.
Specify the appropriate response type to avoid unintended behaviors. For
    example, if you require a JSON-formatted response, specify application/json 
and not text/plain 
.
text/plain 
isn't supported for use with responseSchema 
. responseSchema 
Optional: schema
The schema that generated candidate text must follow. For more information, see Control generated output .
To use this parameter, you must specify a supported mime type other
      than text/plain 
for the responseMimeType 
parameter.
 seed 
Optional: int 
When seed is fixed to a specific value, the model makes a best effort to provide the same response for repeated requests. Deterministic output isn't guaranteed. Also, changing the model or parameter settings, such as the temperature, can cause variations in the response even when you use the same seed value. By default, a random seed value is used.
 responseLogprobs 
Optional: boolean 
If true, returns the log probabilities of the tokens that were chosen
      by the model at each step. By default, this parameter is set to false 
.
 logprobs 
Optional: int 
Returns the log probabilities of the top candidate tokens at each generation step. The model's
chosen token might not be the same as the top candidate token at each step. Specify the number of
candidates to return by using an integer value in the range of 1 
- 20 
.
You must enable  responseLogprobs 
 
to use
         this parameter.
 audioTimestamp 
Optional: boolean 
Available for the following models:
- Gemini 2.0 Flash-Lite
- Gemini 2.0 Flash
Enables timestamp understanding for audio-only files.
This is a preview feature.
 thinkingConfig 
Optional: object 
Configuration for the model's thinking process for Gemini 2.5 models.
The thinkingConfig 
object contains the following field:
-  thinkingBudget:integer. By default, the model automatically controls how much it thinks up to a maximum of8,192tokens.
Response body
{ "candidates" : [ { "content" : { "parts" : [ { "text" : string } ] }, "finishReason" : enum ( FinishReason ), "safetyRatings" : [ { "category" : enum ( HarmCategory ), "probability" : enum ( HarmProbability ), "blocked" : boolean } ], "citationMetadata" : { "citations" : [ { "startIndex" : integer , "endIndex" : integer , "uri" : string , "title" : string , "license" : string , "publicationDate" : { "year" : integer , "month" : integer , "day" : integer } } ] }, "avgLogprobs" : double , "logprobsResult" : { "topCandidates" : [ { "candidates" : [ { "token" : string , "logProbability" : float } ] } ], "chosenCandidates" : [ { "token" : string , "logProbability" : float } ] } } ], "usageMetadata" : { "promptTokenCount" : integer , "candidatesTokenCount" : integer , "totalTokenCount" : integer }, "modelVersion" : string }
modelVersion 
gemini-2.0-flash-lite-001 
.text 
finishReason 
-  FINISH_REASON_STOP: Natural stop point of the model or provided stop sequence.
-  FINISH_REASON_MAX_TOKENS: The maximum number of tokens as specified in the request was reached.
-  FINISH_REASON_SAFETY: Token generation was stopped because the response was flagged for safety reasons. Note thatCandidate.contentis empty if content filters block the output.
-  FINISH_REASON_RECITATION: The token generation was stopped because the response was flagged for unauthorized citations.
-  FINISH_REASON_BLOCKLIST: Token generation was stopped because the response includes blocked terms.
-  FINISH_REASON_PROHIBITED_CONTENT: Token generation was stopped because the response was flagged for prohibited content, such as child sexual abuse material (CSAM).
-  FINISH_REASON_SPII: Token generation was stopped because the response was flagged for sensitive personally identifiable information (SPII).
-  FINISH_REASON_MALFORMED_FUNCTION_CALL: Candidates were blocked because of malformed and unparsable function call.
-  FINISH_REASON_OTHER: All other reasons that stopped the token
-  FINISH_REASON_UNSPECIFIED: The finish reason is unspecified.
category 
Click to expand safety categories
-  HARM_CATEGORY_SEXUALLY_EXPLICIT
-  HARM_CATEGORY_HATE_SPEECH
-  HARM_CATEGORY_HARASSMENT
-  HARM_CATEGORY_DANGEROUS_CONTENT
probability 
-  HARM_PROBABILITY_UNSPECIFIED
-  NEGLIGIBLE
-  LOW
-  MEDIUM
-  HIGH
blocked 
startIndex 
content 
. The startIndex 
is in bytes and calculated from the response encoded in UTF-8.endIndex 
content 
. The endIndex 
is in bytes and calculated from the response encoded in UTF-8.url 
title 
license 
publicationDate 
YYYY 
, YYYY-MM 
, and YYYY-MM-DD 
.avgLogprobs 
logprobsResult 
topCandidates 
) and the
     actual chosen tokens ( chosenCandidates 
) at each step.token 
logProbability 
promptTokenCount 
candidatesTokenCount 
totalTokenCount 
Examples
Text Generation
Generate a text response from a text input.
Gen AI SDK for Python
Python (OpenAI)
You can call the Inference API by using the OpenAI library. For more information, see Call Vertex AI models by using the OpenAI library .
Go
Using multimodal prompt
Generate a text response from a multimodal input, such as text and an image.
Gen AI SDK for Python
Python (OpenAI)
You can call the Inference API by using the OpenAI library. For more information, see Call Vertex AI models by using the OpenAI library .
Go
Streaming text response
Generate a streaming model response from a text input.
Gen AI SDK for Python
Python (OpenAI)
You can call the Inference API by using the OpenAI library. For more information, see Call Vertex AI models by using the OpenAI library .
Go
Model versions
To use the auto-updated version 
,
specify the model name without the trailing version number, for example gemini-2.0-flash 
instead of gemini-2.0-flash-001 
.
For more information, see Gemini model versions and lifecycle .
What's next
- Learn more about the Gemini API in Vertex AI .
- Learn more about Function calling .
- Learn more about Grounding responses for Gemini models .

