Stay organized with collectionsSave and categorize content based on your preferences.
You can ask aGeminimodel to analyze audio files that you provide
either inline (base64-encoded) or via URL. When you useFirebase AI Logic,
you can make this request directly from your app.
With this capability, you can do things like:
Describe, summarize, or answer questions about audio content
Transcribe audio content
Analyze specific segments of audio using timestamps
Click yourGemini APIprovider to view provider-specific content
and code on this page.
If you haven't already, complete thegetting started guide, which describes how to
set up your Firebase project, connect your app to Firebase, add the SDK,
initialize the backend service for your chosenGemini APIprovider, and
create aGenerativeModelinstance.
For testing and iterating on your prompts and even
getting a generated code snippet, we recommend usingGoogle AI Studio.
Need a sample audio file?
You can use this publicly available file with a MIME type ofaudio/mp3(view or download file).https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3
Generate text from audio files (base64-encoded)
Before trying this sample, complete theBefore you beginsection of this guide
to set up your project and app. In that section, you'll also click a button for your chosenGemini APIprovider so that you see provider-specific content
on this page.
You can ask aGeminimodel to
generate text by prompting with text and audio—providing the
input file'smimeTypeand the file itself. Findrequirements and recommendations for input fileslater on this page.
Swift
You can callgenerateContent()to generate text from multimodal input of text and a single audio file.
importFirebaseAI// Initialize the Gemini Developer API backend serviceletai=FirebaseAI.firebaseAI(backend:.googleAI())// Create a `GenerativeModel` instance with a model that supports your use caseletmodel=ai.generativeModel(modelName:"gemini-2.5-flash")// Provide the audio as `Data`guardletaudioData=try?Data(contentsOf:audioURL)else{print("Error loading audio data.")return// Or handle the error appropriately}// Specify the appropriate audio MIME typeletaudio=InlineDataPart(data:audioData,mimeType:"audio/mpeg")// Provide a text prompt to include with the audioletprompt="Transcribe what's said in this audio recording."// To generate text output, call `generateContent` with the audio and text promptletresponse=tryawaitmodel.generateContent(audio,prompt)// Print the generated text, handling the case where it might be nilprint(response.text??"No text in response.")
Kotlin
You can callgenerateContent()to generate text from multimodal input of text and a single audio file.
For Kotlin, the methods in this SDK are suspend functions and need to be called
from aCoroutine scope.
// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use casevalmodel=Firebase.ai(backend=GenerativeBackend.googleAI()).generativeModel("gemini-2.5-flash")valcontentResolver=applicationContext.contentResolvervalinputStream=contentResolver.openInputStream(audioUri)if(inputStream!=null){// Check if the audio loaded successfullyinputStream.use{stream->valbytes=stream.readBytes()// Provide a prompt that includes the audio specified above and textvalprompt=content{inlineData(bytes,"audio/mpeg")// Specify the appropriate audio MIME typetext("Transcribe what's said in this audio recording.")}// To generate text output, call `generateContent` with the promptvalresponse=generativeModel.generateContent(prompt)// Log the generated text, handling the case where it might be nullLog.d(TAG,response.text?:"")}}else{Log.e(TAG,"Error getting input stream for audio.")// Handle the error appropriately}
Java
You can callgenerateContent()to generate text from multimodal input of text and a single audio file.
// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use caseGenerativeModelai=FirebaseAI.getInstance(GenerativeBackend.googleAI()).generativeModel("gemini-2.5-flash");// Use the GenerativeModelFutures Java compatibility layer which offers// support for ListenableFuture and Publisher APIsGenerativeModelFuturesmodel=GenerativeModelFutures.from(ai);ContentResolverresolver=getApplicationContext().getContentResolver();try(InputStreamstream=resolver.openInputStream(audioUri)){FileaudioFile=newFile(newURI(audioUri.toString()));intaudioSize=(int)audioFile.length();byteaudioBytes=newbyte[audioSize];if(stream!=null){stream.read(audioBytes,0,audioBytes.length);stream.close();// Provide a prompt that includes the audio specified above and textContentprompt=newContent.Builder().addInlineData(audioBytes,"audio/mpeg")// Specify the appropriate audio MIME type.addText("Transcribe what's said in this audio recording.").build();// To generate text output, call `generateContent` with the promptListenableFuture<GenerateContentResponse>response=model.generateContent(prompt);Futures.addCallback(response,newFutureCallback<GenerateContentResponse>(){@OverridepublicvoidonSuccess(GenerateContentResponseresult){Stringtext=result.getText();Log.d(TAG,(text==null)?"":text);}@OverridepublicvoidonFailure(Throwablet){Log.e(TAG,"Failed to generate a response",t);}},executor);}else{Log.e(TAG,"Error getting input stream for file.");// Handle the error appropriately}}catch(IOExceptione){Log.e(TAG,"Failed to read the audio file",e);}catch(URISyntaxExceptione){Log.e(TAG,"Invalid audio file",e);}
Web
You can callgenerateContent()to generate text from multimodal input of text and a single audio file.
import{initializeApp}from"firebase/app";import{getAI,getGenerativeModel,GoogleAIBackend}from"firebase/ai";// TODO(developer) Replace the following with your app's Firebase configuration// See: https://firebase.google.com/docs/web/learn-more#config-objectconstfirebaseConfig={// ...};// Initialize FirebaseAppconstfirebaseApp=initializeApp(firebaseConfig);// Initialize the Gemini Developer API backend serviceconstai=getAI(firebaseApp,{backend:newGoogleAIBackend()});// Create a `GenerativeModel` instance with a model that supports your use caseconstmodel=getGenerativeModel(ai,{model:"gemini-2.5-flash"});// Converts a File object to a Part object.asyncfunctionfileToGenerativePart(file){constbase64EncodedDataPromise=newPromise((resolve)=>{constreader=newFileReader();reader.onloadend=()=>resolve(reader.result.split(','));reader.readAsDataURL(file);});return{inlineData:{data:awaitbase64EncodedDataPromise,mimeType:file.type},};}asyncfunctionrun(){// Provide a text prompt to include with the audioconstprompt="Transcribe what's said in this audio recording.";// Prepare audio for inputconstfileInputEl=document.querySelector("input[type=file]");constaudioPart=awaitfileToGenerativePart(fileInputEl.files);// To generate text output, call `generateContent` with the text and audioconstresult=awaitmodel.generateContent([prompt,audioPart]);// Log the generated text, handling the case where it might be undefinedconsole.log(result.response.text()??"No text in response.");}run();
Dart
You can callgenerateContent()to generate text from multimodal input of text and a single audio file.
import'package:firebase_ai/firebase_ai.dart';import'package:firebase_core/firebase_core.dart';import'firebase_options.dart';// Initialize FirebaseAppawaitFirebase.initializeApp(options:DefaultFirebaseOptions.currentPlatform,);// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use casefinalmodel=FirebaseAI.googleAI().generativeModel(model:'gemini-2.5-flash');// Provide a text prompt to include with the audiofinalprompt=TextPart("Transcribe what's said in this audio recording.");// Prepare audio for inputfinalaudio=awaitFile('audio0.mp3').readAsBytes();// Provide the audio as `Data` with the appropriate audio MIME typefinalaudioPart=InlineDataPart('audio/mpeg',audio);// To generate text output, call `generateContent` with the text and audiofinalresponse=awaitmodel.generateContent([Content.multi([prompt,audioPart])]);// Print the generated textprint(response.text);
Unity
You can callGenerateContentAsync()to generate text from multimodal input of text and a single audio file.
usingFirebase;usingFirebase.AI;// Initialize the Gemini Developer API backend servicevarai=FirebaseAI.GetInstance(FirebaseAI.Backend.GoogleAI());// Create a `GenerativeModel` instance with a model that supports your use casevarmodel=ai.GetGenerativeModel(modelName:"gemini-2.5-flash");// Provide a text prompt to include with the audiovarprompt=ModelContent.Text("Transcribe what's said in this audio recording.");// Provide the audio as `data` with the appropriate audio MIME typevaraudio=ModelContent.InlineData("audio/mpeg",System.IO.File.ReadAllBytes(System.IO.Path.Combine(UnityEngine.Application.streamingAssetsPath,"audio0.mp3")));// To generate text output, call `GenerateContentAsync` with the text and audiovarresponse=awaitmodel.GenerateContentAsync(new[]{prompt,audio});// Print the generated textUnityEngine.Debug.Log(response.Text??"No text in response.");
Learn how to choose amodelappropriate for your use case and app.
Stream the response
Before trying this sample, complete theBefore you beginsection of this guide
to set up your project and app. In that section, you'll also click a button for your chosenGemini APIprovider so that you see provider-specific content
on this page.
You can achieve faster interactions by not waiting for the entire result from
the model generation, and instead use streaming to handle partial results.
To stream the response, callgenerateContentStream.
View example: Stream generated text from audio files
Swift
You can callgenerateContentStream()to stream generated text from multimodal input of text and a single audio file.
importFirebaseAI// Initialize the Gemini Developer API backend serviceletai=FirebaseAI.firebaseAI(backend:.googleAI())// Create a `GenerativeModel` instance with a model that supports your use caseletmodel=ai.generativeModel(modelName:"gemini-2.5-flash")// Provide the audio as `Data`guardletaudioData=try?Data(contentsOf:audioURL)else{print("Error loading audio data.")return// Or handle the error appropriately}// Specify the appropriate audio MIME typeletaudio=InlineDataPart(data:audioData,mimeType:"audio/mpeg")// Provide a text prompt to include with the audioletprompt="Transcribe what's said in this audio recording."// To stream generated text output, call `generateContentStream` with the audio and text promptletcontentStream=trymodel.generateContentStream(audio,prompt)// Print the generated text, handling the case where it might be nilfortryawaitchunkincontentStream{iflettext=chunk.text{print(text)}}
Kotlin
You can callgenerateContentStream()to stream generated text from multimodal input of text and a single audio file.
For Kotlin, the methods in this SDK are suspend functions and need to be called
from aCoroutine scope.
// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use casevalmodel=Firebase.ai(backend=GenerativeBackend.googleAI()).generativeModel("gemini-2.5-flash")valcontentResolver=applicationContext.contentResolvervalinputStream=contentResolver.openInputStream(audioUri)if(inputStream!=null){// Check if the audio loaded successfullyinputStream.use{stream->valbytes=stream.readBytes()// Provide a prompt that includes the audio specified above and textvalprompt=content{inlineData(bytes,"audio/mpeg")// Specify the appropriate audio MIME typetext("Transcribe what's said in this audio recording.")}// To stream generated text output, call `generateContentStream` with the promptvarfullResponse=""generativeModel.generateContentStream(prompt).collect{chunk->// Log the generated text, handling the case where it might be nullLog.d(TAG,chunk.text?:"")fullResponse+=chunk.text?:""}}}else{Log.e(TAG,"Error getting input stream for audio.")// Handle the error appropriately}
Java
You can callgenerateContentStream()to stream generated text from multimodal input of text and a single audio file.
For Java, the streaming methods in this SDK return aPublishertype from theReactive Streams library.
// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use caseGenerativeModelai=FirebaseAI.getInstance(GenerativeBackend.googleAI()).generativeModel("gemini-2.5-flash");// Use the GenerativeModelFutures Java compatibility layer which offers// support for ListenableFuture and Publisher APIsGenerativeModelFuturesmodel=GenerativeModelFutures.from(ai);ContentResolverresolver=getApplicationContext().getContentResolver();try(InputStreamstream=resolver.openInputStream(audioUri)){FileaudioFile=newFile(newURI(audioUri.toString()));intaudioSize=(int)audioFile.length();byteaudioBytes=newbyte[audioSize];if(stream!=null){stream.read(audioBytes,0,audioBytes.length);stream.close();// Provide a prompt that includes the audio specified above and textContentprompt=newContent.Builder().addInlineData(audioBytes,"audio/mpeg")// Specify the appropriate audio MIME type.addText("Transcribe what's said in this audio recording.").build();// To stream generated text output, call `generateContentStream` with the promptPublisher<GenerateContentResponse>streamingResponse=model.generateContentStream(prompt);StringBuilderfullResponse=newStringBuilder();streamingResponse.subscribe(newSubscriber<GenerateContentResponse>(){@OverridepublicvoidonNext(GenerateContentResponsegenerateContentResponse){Stringchunk=generateContentResponse.getText();Stringtext=(chunk==null)?"":chunk;Log.d(TAG,text);fullResponse.append(text);}@OverridepublicvoidonComplete(){Log.d(TAG,fullResponse.toString());}@OverridepublicvoidonError(Throwablet){Log.e(TAG,"Failed to generate a response",t);}@OverridepublicvoidonSubscribe(Subscriptions){}});}else{Log.e(TAG,"Error getting input stream for file.");// Handle the error appropriately}}catch(IOExceptione){Log.e(TAG,"Failed to read the audio file",e);}catch(URISyntaxExceptione){Log.e(TAG,"Invalid audio file",e);}
Web
You can callgenerateContentStream()to stream generated text from multimodal input of text and a single audio file.
import{initializeApp}from"firebase/app";import{getAI,getGenerativeModel,GoogleAIBackend}from"firebase/ai";// TODO(developer) Replace the following with your app's Firebase configuration// See: https://firebase.google.com/docs/web/learn-more#config-objectconstfirebaseConfig={// ...};// Initialize FirebaseAppconstfirebaseApp=initializeApp(firebaseConfig);// Initialize the Gemini Developer API backend serviceconstai=getAI(firebaseApp,{backend:newGoogleAIBackend()});// Create a `GenerativeModel` instance with a model that supports your use caseconstmodel=getGenerativeModel(ai,{model:"gemini-2.5-flash"});// Converts a File object to a Part object.asyncfunctionfileToGenerativePart(file){constbase64EncodedDataPromise=newPromise((resolve)=>{constreader=newFileReader();reader.onloadend=()=>resolve(reader.result.split(','));reader.readAsDataURL(file);});return{inlineData:{data:awaitbase64EncodedDataPromise,mimeType:file.type},};}asyncfunctionrun(){// Provide a text prompt to include with the audioconstprompt="Transcribe what's said in this audio recording.";// Prepare audio for inputconstfileInputEl=document.querySelector("input[type=file]");constaudioPart=awaitfileToGenerativePart(fileInputEl.files);// To stream generated text output, call `generateContentStream` with the text and audioconstresult=awaitmodel.generateContentStream([prompt,audioPart]);// Log the generated textforawait(constchunkofresult.stream){constchunkText=chunk.text();console.log(chunkText);}}run();
Dart
You can callgenerateContentStream()to stream generated text from multimodal input of text and a single audio file.
import'package:firebase_ai/firebase_ai.dart';import'package:firebase_core/firebase_core.dart';import'firebase_options.dart';// Initialize FirebaseAppawaitFirebase.initializeApp(options:DefaultFirebaseOptions.currentPlatform,);// Initialize the Gemini Developer API backend service// Create a `GenerativeModel` instance with a model that supports your use casefinalmodel=FirebaseAI.googleAI().generativeModel(model:'gemini-2.5-flash');// Provide a text prompt to include with the audiofinalprompt=TextPart("Transcribe what's said in this audio recording.");// Prepare audio for inputfinalaudio=awaitFile('audio0.mp3').readAsBytes();// Provide the audio as `Data` with the appropriate audio MIME typefinalaudioPart=InlineDataPart('audio/mpeg',audio);// To stream generated text output, call `generateContentStream` with the text and audiofinalresponse=awaitmodel.generateContentStream([Content.multi([prompt,audioPart])]);// Print the generated textawaitfor(finalchunkinresponse){print(chunk.text);}
Unity
You can callGenerateContentStreamAsync()to stream generated text from multimodal input of text and a single audio file.
usingFirebase;usingFirebase.AI;// Initialize the Gemini Developer API backend servicevarai=FirebaseAI.GetInstance(FirebaseAI.Backend.GoogleAI());// Create a `GenerativeModel` instance with a model that supports your use casevarmodel=ai.GetGenerativeModel(modelName:"gemini-2.5-flash");// Provide a text prompt to include with the audiovarprompt=ModelContent.Text("Transcribe what's said in this audio recording.");// Provide the audio as `data` with the appropriate audio MIME typevaraudio=ModelContent.InlineData("audio/mpeg",System.IO.File.ReadAllBytes(System.IO.Path.Combine(UnityEngine.Application.streamingAssetsPath,"audio0.mp3")));// To stream generated text output, call `GenerateContentStreamAsync` with the text and audiovarresponseStream=model.GenerateContentStreamAsync(new[]{prompt,audio});// Print the generated textawaitforeach(varresponseinresponseStream){if(!string.IsNullOrWhiteSpace(response.Text)){UnityEngine.Debug.Log(response.Text);}}
Learn how to choose amodelappropriate for your use case and app.
Requirements and recommendations for input audio files
Note that a file provided as inline data is encoded to base64 in transit, which
increases the size of the request. You get an HTTP 413 error if a request is
too large.
See "Supported input files and requirements" page to learn detailed information
about the following:
Geminimultimodal models support the following audio MIME types:
AAC -audio/aac
FLAC -audio/flac
MP3 -audio/mp3
MPA -audio/m4a
MPEG -audio/mpeg
MPGA -audio/mpga
MP4 -audio/mp4
OPUS -audio/opus
PCM -audio/pcm
WAV -audio/wav
WEBM -audio/webm
Limits per request
Maximum files per request: 1 audio file
What else can you do?
Learn how tocount tokensbefore sending long prompts to the model.
Set upCloud Storage for Firebaseso that you can include large files in your multimodal requests and have a
more managed solution for providing files in prompts.
Files can include images, PDFs, video, and audio.
Start thinking about preparing for production (see theproduction checklist),
including:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-05 UTC."],[],[],null,[]]