Analyze audio files using the Gemini API

You can ask a Gemini model to analyze audio files that you provide either inline (base64-encoded) or via URL. When you use Firebase AI Logic , you can make this request directly from your app.

With this capability, you can do things like:

  • Describe, summarize, or answer questions about audio content
  • Transcribe audio content
  • Analyze specific segments of audio using timestamps

Jump to code samples Jump to code for streamed responses


See other guides for additional options for working with audio
Generate structured output Multi-turn chat Bidirectional streaming

Before you begin

Click your Gemini API provider to view provider-specific content and code on this page.

If you haven't already, complete the getting started guide , which describes how to set up your Firebase project, connect your app to Firebase, add the SDK, initialize the backend service for your chosen Gemini API provider, and create a GenerativeModel instance.

For testing and iterating on your prompts and even getting a generated code snippet, we recommend using Google AI Studio .

Need a sample audio file?

You can use this publicly available file with a MIME type of audio/mp3 ( view or download file ). https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3

Generate text from audio files (base64-encoded)

Before trying this sample, complete the Before you begin section of this guide to set up your project and app.
In that section, you'll also click a button for your chosen Gemini API provider so that you see provider-specific content on this page.

You can ask a Gemini model to generate text by prompting with text and audio—providing the input file's mimeType and the file itself. Find requirements and recommendations for input files later on this page.

Swift

You can call generateContent() to generate text from multimodal input of text and a single audio file.

  import 
  
 FirebaseAI 
 // Initialize the Gemini Developer API backend service 
 let 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 firebaseAI 
 ( 
 backend 
 : 
  
 . 
 googleAI 
 ()) 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 let 
  
 model 
  
 = 
  
 ai 
 . 
 generativeModel 
 ( 
 modelName 
 : 
  
 "gemini-2.5-flash" 
 ) 
  // Provide the audio as `Data` 
 guard 
  
 let 
  
 audioData 
  
 = 
  
 try 
 ? 
  
 Data 
 ( 
 contentsOf 
 : 
  
 audioURL 
 ) 
  
 else 
  
 { 
  
 print 
 ( 
 "Error loading audio data." 
 ) 
  
 return 
  
 // Or handle the error appropriately 
 } 
 // Specify the appropriate audio MIME type 
 let 
  
 audio 
  
 = 
  
 InlineDataPart 
 ( 
 data 
 : 
  
 audioData 
 , 
  
 mimeType 
 : 
  
 "audio/mpeg" 
 ) 
 // Provide a text prompt to include with the audio 
 let 
  
 prompt 
  
 = 
  
 "Transcribe what's said in this audio recording." 
 // To generate text output, call `generateContent` with the audio and text prompt 
 let 
  
 response 
  
 = 
  
 try 
  
 await 
  
 model 
 . 
 generateContent 
 ( 
 audio 
 , 
  
 prompt 
 ) 
 // Print the generated text, handling the case where it might be nil 
 print 
 ( 
 response 
 . 
 text 
  
 ?? 
  
 "No text in response." 
 ) 
 

Kotlin

You can call generateContent() to generate text from multimodal input of text and a single audio file.

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope .
  // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 val 
  
 model 
  
 = 
  
 Firebase 
 . 
 ai 
 ( 
 backend 
  
 = 
  
 GenerativeBackend 
 . 
 googleAI 
 ()) 
  
 . 
 generativeModel 
 ( 
 "gemini-2.5-flash" 
 ) 
  val 
  
 contentResolver 
  
 = 
  
 applicationContext 
 . 
 contentResolver 
 val 
  
 inputStream 
  
 = 
  
 contentResolver 
 . 
 openInputStream 
 ( 
 audioUri 
 ) 
 if 
  
 ( 
 inputStream 
  
 != 
  
 null 
 ) 
  
 { 
  
 // Check if the audio loaded successfully 
  
 inputStream 
 . 
 use 
  
 { 
  
 stream 
  
 -> 
  
 val 
  
 bytes 
  
 = 
  
 stream 
 . 
 readBytes 
 () 
  
 // Provide a prompt that includes the audio specified above and text 
  
 val 
  
 prompt 
  
 = 
  
 content 
  
 { 
  
 inlineData 
 ( 
 bytes 
 , 
  
 "audio/mpeg" 
 ) 
  
 // Specify the appropriate audio MIME type 
  
 text 
 ( 
 "Transcribe what's said in this audio recording." 
 ) 
  
 } 
  
 // To generate text output, call `generateContent` with the prompt 
  
 val 
  
 response 
  
 = 
  
 generativeModel 
 . 
 generateContent 
 ( 
 prompt 
 ) 
  
 // Log the generated text, handling the case where it might be null 
  
 Log 
 . 
 d 
 ( 
 TAG 
 , 
  
 response 
 . 
 text 
 ?: 
  
 "" 
 ) 
  
 } 
 } 
  
 else 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Error getting input stream for audio." 
 ) 
  
 // Handle the error appropriately 
 } 
 

Java

You can call generateContent() to generate text from multimodal input of text and a single audio file.

For Java, the methods in this SDK return a ListenableFuture .
  // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 GenerativeModel 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 getInstance 
 ( 
 GenerativeBackend 
 . 
 googleAI 
 ()) 
  
 . 
 generativeModel 
 ( 
 "gemini-2.5-flash" 
 ); 
 // Use the GenerativeModelFutures Java compatibility layer which offers 
 // support for ListenableFuture and Publisher APIs 
 GenerativeModelFutures 
  
 model 
  
 = 
  
 GenerativeModelFutures 
 . 
 from 
 ( 
 ai 
 ); 
  ContentResolver 
  
 resolver 
  
 = 
  
 getApplicationContext 
 (). 
 getContentResolver 
 (); 
 try 
  
 ( 
 InputStream 
  
 stream 
  
 = 
  
 resolver 
 . 
 openInputStream 
 ( 
 audioUri 
 )) 
  
 { 
  
 File 
  
 audioFile 
  
 = 
  
 new 
  
 File 
 ( 
 new 
  
 URI 
 ( 
 audioUri 
 . 
 toString 
 ())); 
  
 int 
  
 audioSize 
  
 = 
  
 ( 
 int 
 ) 
  
 audioFile 
 . 
 length 
 (); 
  
 byte 
  
 audioBytes 
  
 = 
  
 new 
  
 byte 
 [ 
 audioSize 
 ] 
 ; 
  
 if 
  
 ( 
 stream 
  
 != 
  
 null 
 ) 
  
 { 
  
 stream 
 . 
 read 
 ( 
 audioBytes 
 , 
  
 0 
 , 
  
 audioBytes 
 . 
 length 
 ); 
  
 stream 
 . 
 close 
 (); 
  
 // Provide a prompt that includes the audio specified above and text 
  
 Content 
  
 prompt 
  
 = 
  
 new 
  
 Content 
 . 
 Builder 
 () 
  
 . 
 addInlineData 
 ( 
 audioBytes 
 , 
  
 "audio/mpeg" 
 ) 
  
 // Specify the appropriate audio MIME type 
  
 . 
 addText 
 ( 
 "Transcribe what's said in this audio recording." 
 ) 
  
 . 
 build 
 (); 
  
 // To generate text output, call `generateContent` with the prompt 
  
 ListenableFuture<GenerateContentResponse> 
  
 response 
  
 = 
  
 model 
 . 
 generateContent 
 ( 
 prompt 
 ); 
  
 Futures 
 . 
 addCallback 
 ( 
 response 
 , 
  
 new 
  
 FutureCallback<GenerateContentResponse> 
 () 
  
 { 
  
 @Override 
  
 public 
  
 void 
  
 onSuccess 
 ( 
 GenerateContentResponse 
  
 result 
 ) 
  
 { 
  
 String 
  
 text 
  
 = 
  
 result 
 . 
 getText 
 (); 
  
 Log 
 . 
 d 
 ( 
 TAG 
 , 
  
 ( 
 text 
  
 == 
  
 null 
 ) 
  
 ? 
  
 "" 
  
 : 
  
 text 
 ); 
  
 } 
  
 @Override 
  
 public 
  
 void 
  
 onFailure 
 ( 
 Throwable 
  
 t 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Failed to generate a response" 
 , 
  
 t 
 ); 
  
 } 
  
 }, 
  
 executor 
 ); 
  
 } 
  
 else 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Error getting input stream for file." 
 ); 
  
 // Handle the error appropriately 
  
 } 
 } 
  
 catch 
  
 ( 
 IOException 
  
 e 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Failed to read the audio file" 
 , 
  
 e 
 ); 
 } 
  
 catch 
  
 ( 
 URISyntaxException 
  
 e 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Invalid audio file" 
 , 
  
 e 
 ); 
 } 
 

Web

You can call generateContent() to generate text from multimodal input of text and a single audio file.

  import 
  
 { 
  
 initializeApp 
  
 } 
  
 from 
  
 "firebase/app" 
 ; 
 import 
  
 { 
  
 getAI 
 , 
  
 getGenerativeModel 
 , 
  
 GoogleAIBackend 
  
 } 
  
 from 
  
 "firebase/ai" 
 ; 
 // TODO(developer) Replace the following with your app's Firebase configuration 
 // See: https://firebase.google.com/docs/web/learn-more#config-object 
 const 
  
 firebaseConfig 
  
 = 
  
 { 
  
 // ... 
 }; 
 // Initialize FirebaseApp 
 const 
  
 firebaseApp 
  
 = 
  
 initializeApp 
 ( 
 firebaseConfig 
 ); 
 // Initialize the Gemini Developer API backend service 
 const 
  
 ai 
  
 = 
  
 getAI 
 ( 
 firebaseApp 
 , 
  
 { 
  
 backend 
 : 
  
 new 
  
 GoogleAIBackend 
 () 
  
 }); 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 const 
  
 model 
  
 = 
  
 getGenerativeModel 
 ( 
 ai 
 , 
  
 { 
  
 model 
 : 
  
 "gemini-2.5-flash" 
  
 }); 
  // Converts a File object to a Part object. 
 async 
  
 function 
  
 fileToGenerativePart 
 ( 
 file 
 ) 
  
 { 
  
 const 
  
 base64EncodedDataPromise 
  
 = 
  
 new 
  
 Promise 
 (( 
 resolve 
 ) 
  
 => 
  
 { 
  
 const 
  
 reader 
  
 = 
  
 new 
  
 FileReader 
 (); 
  
 reader 
 . 
 onloadend 
  
 = 
  
 () 
  
 => 
  
 resolve 
 ( 
 reader 
 . 
 result 
 . 
 split 
 ( 
 ',' 
 )); 
  
 reader 
 . 
 readAsDataURL 
 ( 
 file 
 ); 
  
 }); 
  
 return 
  
 { 
  
 inlineData 
 : 
  
 { 
  
 data 
 : 
  
 await 
  
 base64EncodedDataPromise 
 , 
  
 mimeType 
 : 
  
 file 
 . 
 type 
  
 }, 
  
 }; 
 } 
 async 
  
 function 
  
 run 
 () 
  
 { 
  
 // Provide a text prompt to include with the audio 
  
 const 
  
 prompt 
  
 = 
  
 "Transcribe what's said in this audio recording." 
 ; 
  
 // Prepare audio for input 
  
 const 
  
 fileInputEl 
  
 = 
  
 document 
 . 
 querySelector 
 ( 
 "input[type=file]" 
 ); 
  
 const 
  
 audioPart 
  
 = 
  
 await 
  
 fileToGenerativePart 
 ( 
 fileInputEl 
 . 
 files 
 ); 
  
 // To generate text output, call `generateContent` with the text and audio 
  
 const 
  
 result 
  
 = 
  
 await 
  
 model 
 . 
 generateContent 
 ([ 
 prompt 
 , 
  
 audioPart 
 ]); 
  
 // Log the generated text, handling the case where it might be undefined 
  
 console 
 . 
 log 
 ( 
 result 
 . 
 response 
 . 
 text 
 () 
  
 ?? 
  
 "No text in response." 
 ); 
 } 
 run 
 (); 
 

Dart

You can call generateContent() to generate text from multimodal input of text and a single audio file.

  import 
  
 'package:firebase_ai/firebase_ai.dart' 
 ; 
 import 
  
 'package:firebase_core/firebase_core.dart' 
 ; 
 import 
  
 'firebase_options.dart' 
 ; 
 // Initialize FirebaseApp 
 await 
  
 Firebase 
 . 
 initializeApp 
 ( 
  
 options: 
  
 DefaultFirebaseOptions 
 . 
 currentPlatform 
 , 
 ); 
 // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 final 
  
 model 
  
 = 
  
 FirebaseAI 
 . 
 googleAI 
 (). 
 generativeModel 
 ( 
 model: 
  
 'gemini-2.5-flash' 
 ); 
  // Provide a text prompt to include with the audio 
 final 
  
 prompt 
  
 = 
  
 TextPart 
 ( 
 "Transcribe what's said in this audio recording." 
 ); 
 // Prepare audio for input 
 final 
  
 audio 
  
 = 
  
 await 
  
 File 
 ( 
 'audio0.mp3' 
 ). 
 readAsBytes 
 (); 
 // Provide the audio as `Data` with the appropriate audio MIME type 
 final 
  
 audioPart 
  
 = 
  
 InlineDataPart 
 ( 
 'audio/mpeg' 
 , 
  
 audio 
 ); 
 // To generate text output, call `generateContent` with the text and audio 
 final 
  
 response 
  
 = 
  
 await 
  
 model 
 . 
 generateContent 
 ([ 
  
 Content 
 . 
 multi 
 ([ 
 prompt 
 , 
 audioPart 
 ]) 
 ]); 
 // Print the generated text 
 print 
 ( 
 response 
 . 
 text 
 ); 
 

Unity

You can call GenerateContentAsync() to generate text from multimodal input of text and a single audio file.

  using 
  
 Firebase 
 ; 
 using 
  
 Firebase.AI 
 ; 
 // Initialize the Gemini Developer API backend service 
 var 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 GetInstance 
 ( 
 FirebaseAI 
 . 
 Backend 
 . 
 GoogleAI 
 ()); 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 var 
  
 model 
  
 = 
  
 ai 
 . 
 GetGenerativeModel 
 ( 
 modelName 
 : 
  
 "gemini-2.5-flash" 
 ); 
  // Provide a text prompt to include with the audio 
 var 
  
 prompt 
  
 = 
  
 ModelContent 
 . 
 Text 
 ( 
 "Transcribe what's said in this audio recording." 
 ); 
 // Provide the audio as `data` with the appropriate audio MIME type 
 var 
  
 audio 
  
 = 
  
 ModelContent 
 . 
 InlineData 
 ( 
 "audio/mpeg" 
 , 
  
 System 
 . 
 IO 
 . 
 File 
 . 
 ReadAllBytes 
 ( 
 System 
 . 
 IO 
 . 
 Path 
 . 
 Combine 
 ( 
  
 UnityEngine 
 . 
 Application 
 . 
 streamingAssetsPath 
 , 
  
 "audio0.mp3" 
 ))); 
 // To generate text output, call `GenerateContentAsync` with the text and audio 
 var 
  
 response 
  
 = 
  
 await 
  
 model 
 . 
 GenerateContentAsync 
 ( 
 new 
  
 [] 
  
 { 
  
 prompt 
 , 
  
 audio 
  
 }); 
 // Print the generated text 
 UnityEngine 
 . 
 Debug 
 . 
 Log 
 ( 
 response 
 . 
 Text 
  
 ?? 
  
 "No text in response." 
 ); 
 

Learn how to choose a model appropriate for your use case and app.

Stream the response

Before trying this sample, complete the Before you begin section of this guide to set up your project and app.
In that section, you'll also click a button for your chosen Gemini API provider so that you see provider-specific content on this page.

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results. To stream the response, call generateContentStream .

View example: Stream generated text from audio files

Swift

You can call generateContentStream() to stream generated text from multimodal input of text and a single audio file.

  import 
  
 FirebaseAI 
 // Initialize the Gemini Developer API backend service 
 let 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 firebaseAI 
 ( 
 backend 
 : 
  
 . 
 googleAI 
 ()) 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 let 
  
 model 
  
 = 
  
 ai 
 . 
 generativeModel 
 ( 
 modelName 
 : 
  
 "gemini-2.5-flash" 
 ) 
  // Provide the audio as `Data` 
 guard 
  
 let 
  
 audioData 
  
 = 
  
 try 
 ? 
  
 Data 
 ( 
 contentsOf 
 : 
  
 audioURL 
 ) 
  
 else 
  
 { 
  
 print 
 ( 
 "Error loading audio data." 
 ) 
  
 return 
  
 // Or handle the error appropriately 
 } 
 // Specify the appropriate audio MIME type 
 let 
  
 audio 
  
 = 
  
 InlineDataPart 
 ( 
 data 
 : 
  
 audioData 
 , 
  
 mimeType 
 : 
  
 "audio/mpeg" 
 ) 
 // Provide a text prompt to include with the audio 
 let 
  
 prompt 
  
 = 
  
 "Transcribe what's said in this audio recording." 
 // To stream generated text output, call `generateContentStream` with the audio and text prompt 
 let 
  
 contentStream 
  
 = 
  
 try 
  
 model 
 . 
 generateContentStream 
 ( 
 audio 
 , 
  
 prompt 
 ) 
 // Print the generated text, handling the case where it might be nil 
 for 
  
 try 
  
 await 
  
 chunk 
  
 in 
  
 contentStream 
  
 { 
  
 if 
  
 let 
  
 text 
  
 = 
  
 chunk 
 . 
 text 
  
 { 
  
 print 
 ( 
 text 
 ) 
  
 } 
 } 
 

Kotlin

You can call generateContentStream() to stream generated text from multimodal input of text and a single audio file.

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope .
  // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 val 
  
 model 
  
 = 
  
 Firebase 
 . 
 ai 
 ( 
 backend 
  
 = 
  
 GenerativeBackend 
 . 
 googleAI 
 ()) 
  
 . 
 generativeModel 
 ( 
 "gemini-2.5-flash" 
 ) 
  val 
  
 contentResolver 
  
 = 
  
 applicationContext 
 . 
 contentResolver 
 val 
  
 inputStream 
  
 = 
  
 contentResolver 
 . 
 openInputStream 
 ( 
 audioUri 
 ) 
 if 
  
 ( 
 inputStream 
  
 != 
  
 null 
 ) 
  
 { 
  
 // Check if the audio loaded successfully 
  
 inputStream 
 . 
 use 
  
 { 
  
 stream 
  
 -> 
  
 val 
  
 bytes 
  
 = 
  
 stream 
 . 
 readBytes 
 () 
  
 // Provide a prompt that includes the audio specified above and text 
  
 val 
  
 prompt 
  
 = 
  
 content 
  
 { 
  
 inlineData 
 ( 
 bytes 
 , 
  
 "audio/mpeg" 
 ) 
  
 // Specify the appropriate audio MIME type 
  
 text 
 ( 
 "Transcribe what's said in this audio recording." 
 ) 
  
 } 
  
 // To stream generated text output, call `generateContentStream` with the prompt 
  
 var 
  
 fullResponse 
  
 = 
  
 "" 
  
 generativeModel 
 . 
 generateContentStream 
 ( 
 prompt 
 ). 
 collect 
  
 { 
  
 chunk 
  
 -> 
  
 // Log the generated text, handling the case where it might be null 
  
 Log 
 . 
 d 
 ( 
 TAG 
 , 
  
 chunk 
 . 
 text 
 ?: 
  
 "" 
 ) 
  
 fullResponse 
  
 += 
  
 chunk 
 . 
 text 
 ?: 
  
 "" 
  
 } 
  
 } 
 } 
  
 else 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Error getting input stream for audio." 
 ) 
  
 // Handle the error appropriately 
 } 
 

Java

You can call generateContentStream() to stream generated text from multimodal input of text and a single audio file.

For Java, the streaming methods in this SDK return a Publisher type from the Reactive Streams library .
  // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 GenerativeModel 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 getInstance 
 ( 
 GenerativeBackend 
 . 
 googleAI 
 ()) 
  
 . 
 generativeModel 
 ( 
 "gemini-2.5-flash" 
 ); 
 // Use the GenerativeModelFutures Java compatibility layer which offers 
 // support for ListenableFuture and Publisher APIs 
 GenerativeModelFutures 
  
 model 
  
 = 
  
 GenerativeModelFutures 
 . 
 from 
 ( 
 ai 
 ); 
  ContentResolver 
  
 resolver 
  
 = 
  
 getApplicationContext 
 (). 
 getContentResolver 
 (); 
 try 
  
 ( 
 InputStream 
  
 stream 
  
 = 
  
 resolver 
 . 
 openInputStream 
 ( 
 audioUri 
 )) 
  
 { 
  
 File 
  
 audioFile 
  
 = 
  
 new 
  
 File 
 ( 
 new 
  
 URI 
 ( 
 audioUri 
 . 
 toString 
 ())); 
  
 int 
  
 audioSize 
  
 = 
  
 ( 
 int 
 ) 
  
 audioFile 
 . 
 length 
 (); 
  
 byte 
  
 audioBytes 
  
 = 
  
 new 
  
 byte 
 [ 
 audioSize 
 ] 
 ; 
  
 if 
  
 ( 
 stream 
  
 != 
  
 null 
 ) 
  
 { 
  
 stream 
 . 
 read 
 ( 
 audioBytes 
 , 
  
 0 
 , 
  
 audioBytes 
 . 
 length 
 ); 
  
 stream 
 . 
 close 
 (); 
  
 // Provide a prompt that includes the audio specified above and text 
  
 Content 
  
 prompt 
  
 = 
  
 new 
  
 Content 
 . 
 Builder 
 () 
  
 . 
 addInlineData 
 ( 
 audioBytes 
 , 
  
 "audio/mpeg" 
 ) 
  
 // Specify the appropriate audio MIME type 
  
 . 
 addText 
 ( 
 "Transcribe what's said in this audio recording." 
 ) 
  
 . 
 build 
 (); 
  
 // To stream generated text output, call `generateContentStream` with the prompt 
  
 Publisher<GenerateContentResponse> 
  
 streamingResponse 
  
 = 
  
 model 
 . 
 generateContentStream 
 ( 
 prompt 
 ); 
  
 StringBuilder 
  
 fullResponse 
  
 = 
  
 new 
  
 StringBuilder 
 (); 
  
 streamingResponse 
 . 
 subscribe 
 ( 
 new 
  
 Subscriber<GenerateContentResponse> 
 () 
  
 { 
  
 @Override 
  
 public 
  
 void 
  
 onNext 
 ( 
 GenerateContentResponse 
  
 generateContentResponse 
 ) 
  
 { 
  
 String 
  
 chunk 
  
 = 
  
 generateContentResponse 
 . 
 getText 
 (); 
  
 String 
  
 text 
  
 = 
  
 ( 
 chunk 
  
 == 
  
 null 
 ) 
  
 ? 
  
 "" 
  
 : 
  
 chunk 
 ; 
  
 Log 
 . 
 d 
 ( 
 TAG 
 , 
  
 text 
 ); 
  
 fullResponse 
 . 
 append 
 ( 
 text 
 ); 
  
 } 
  
 @Override 
  
 public 
  
 void 
  
 onComplete 
 () 
  
 { 
  
 Log 
 . 
 d 
 ( 
 TAG 
 , 
  
 fullResponse 
 . 
 toString 
 ()); 
  
 } 
  
 @Override 
  
 public 
  
 void 
  
 onError 
 ( 
 Throwable 
  
 t 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Failed to generate a response" 
 , 
  
 t 
 ); 
  
 } 
  
 @Override 
  
 public 
  
 void 
  
 onSubscribe 
 ( 
 Subscription 
  
 s 
 ) 
  
 { 
  
 } 
  
 }); 
  
 } 
  
 else 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Error getting input stream for file." 
 ); 
  
 // Handle the error appropriately 
  
 } 
 } 
  
 catch 
  
 ( 
 IOException 
  
 e 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Failed to read the audio file" 
 , 
  
 e 
 ); 
 } 
  
 catch 
  
 ( 
 URISyntaxException 
  
 e 
 ) 
  
 { 
  
 Log 
 . 
 e 
 ( 
 TAG 
 , 
  
 "Invalid audio file" 
 , 
  
 e 
 ); 
 } 
 

Web

You can call generateContentStream() to stream generated text from multimodal input of text and a single audio file.

  import 
  
 { 
  
 initializeApp 
  
 } 
  
 from 
  
 "firebase/app" 
 ; 
 import 
  
 { 
  
 getAI 
 , 
  
 getGenerativeModel 
 , 
  
 GoogleAIBackend 
  
 } 
  
 from 
  
 "firebase/ai" 
 ; 
 // TODO(developer) Replace the following with your app's Firebase configuration 
 // See: https://firebase.google.com/docs/web/learn-more#config-object 
 const 
  
 firebaseConfig 
  
 = 
  
 { 
  
 // ... 
 }; 
 // Initialize FirebaseApp 
 const 
  
 firebaseApp 
  
 = 
  
 initializeApp 
 ( 
 firebaseConfig 
 ); 
 // Initialize the Gemini Developer API backend service 
 const 
  
 ai 
  
 = 
  
 getAI 
 ( 
 firebaseApp 
 , 
  
 { 
  
 backend 
 : 
  
 new 
  
 GoogleAIBackend 
 () 
  
 }); 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 const 
  
 model 
  
 = 
  
 getGenerativeModel 
 ( 
 ai 
 , 
  
 { 
  
 model 
 : 
  
 "gemini-2.5-flash" 
  
 }); 
  // Converts a File object to a Part object. 
 async 
  
 function 
  
 fileToGenerativePart 
 ( 
 file 
 ) 
  
 { 
  
 const 
  
 base64EncodedDataPromise 
  
 = 
  
 new 
  
 Promise 
 (( 
 resolve 
 ) 
  
 => 
  
 { 
  
 const 
  
 reader 
  
 = 
  
 new 
  
 FileReader 
 (); 
  
 reader 
 . 
 onloadend 
  
 = 
  
 () 
  
 => 
  
 resolve 
 ( 
 reader 
 . 
 result 
 . 
 split 
 ( 
 ',' 
 )); 
  
 reader 
 . 
 readAsDataURL 
 ( 
 file 
 ); 
  
 }); 
  
 return 
  
 { 
  
 inlineData 
 : 
  
 { 
  
 data 
 : 
  
 await 
  
 base64EncodedDataPromise 
 , 
  
 mimeType 
 : 
  
 file 
 . 
 type 
  
 }, 
  
 }; 
 } 
 async 
  
 function 
  
 run 
 () 
  
 { 
  
 // Provide a text prompt to include with the audio 
  
 const 
  
 prompt 
  
 = 
  
 "Transcribe what's said in this audio recording." 
 ; 
  
 // Prepare audio for input 
  
 const 
  
 fileInputEl 
  
 = 
  
 document 
 . 
 querySelector 
 ( 
 "input[type=file]" 
 ); 
  
 const 
  
 audioPart 
  
 = 
  
 await 
  
 fileToGenerativePart 
 ( 
 fileInputEl 
 . 
 files 
 ); 
  
 // To stream generated text output, call `generateContentStream` with the text and audio 
  
 const 
  
 result 
  
 = 
  
 await 
  
 model 
 . 
 generateContentStream 
 ([ 
 prompt 
 , 
  
 audioPart 
 ]); 
  
 // Log the generated text 
  
 for 
  
 await 
  
 ( 
 const 
  
 chunk 
  
 of 
  
 result 
 . 
 stream 
 ) 
  
 { 
  
 const 
  
 chunkText 
  
 = 
  
 chunk 
 . 
 text 
 (); 
  
 console 
 . 
 log 
 ( 
 chunkText 
 ); 
  
 } 
 } 
 run 
 (); 
 

Dart

You can call generateContentStream() to stream generated text from multimodal input of text and a single audio file.

  import 
  
 'package:firebase_ai/firebase_ai.dart' 
 ; 
 import 
  
 'package:firebase_core/firebase_core.dart' 
 ; 
 import 
  
 'firebase_options.dart' 
 ; 
 // Initialize FirebaseApp 
 await 
  
 Firebase 
 . 
 initializeApp 
 ( 
  
 options: 
  
 DefaultFirebaseOptions 
 . 
 currentPlatform 
 , 
 ); 
 // Initialize the Gemini Developer API backend service 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 final 
  
 model 
  
 = 
  
 FirebaseAI 
 . 
 googleAI 
 (). 
 generativeModel 
 ( 
 model: 
  
 'gemini-2.5-flash' 
 ); 
  // Provide a text prompt to include with the audio 
 final 
  
 prompt 
  
 = 
  
 TextPart 
 ( 
 "Transcribe what's said in this audio recording." 
 ); 
 // Prepare audio for input 
 final 
  
 audio 
  
 = 
  
 await 
  
 File 
 ( 
 'audio0.mp3' 
 ). 
 readAsBytes 
 (); 
 // Provide the audio as `Data` with the appropriate audio MIME type 
 final 
  
 audioPart 
  
 = 
  
 InlineDataPart 
 ( 
 'audio/mpeg' 
 , 
  
 audio 
 ); 
 // To stream generated text output, call `generateContentStream` with the text and audio 
 final 
  
 response 
  
 = 
  
 await 
  
 model 
 . 
 generateContentStream 
 ([ 
  
 Content 
 . 
 multi 
 ([ 
 prompt 
 , 
  
 audioPart 
 ]) 
 ]); 
 // Print the generated text 
 await 
  
 for 
  
 ( 
 final 
  
 chunk 
  
 in 
  
 response 
 ) 
  
 { 
  
 print 
 ( 
 chunk 
 . 
 text 
 ); 
 } 
 

Unity

You can call GenerateContentStreamAsync() to stream generated text from multimodal input of text and a single audio file.

  using 
  
 Firebase 
 ; 
 using 
  
 Firebase.AI 
 ; 
 // Initialize the Gemini Developer API backend service 
 var 
  
 ai 
  
 = 
  
 FirebaseAI 
 . 
 GetInstance 
 ( 
 FirebaseAI 
 . 
 Backend 
 . 
 GoogleAI 
 ()); 
 // Create a `GenerativeModel` instance with a model that supports your use case 
 var 
  
 model 
  
 = 
  
 ai 
 . 
 GetGenerativeModel 
 ( 
 modelName 
 : 
  
 "gemini-2.5-flash" 
 ); 
  // Provide a text prompt to include with the audio 
 var 
  
 prompt 
  
 = 
  
 ModelContent 
 . 
 Text 
 ( 
 "Transcribe what's said in this audio recording." 
 ); 
 // Provide the audio as `data` with the appropriate audio MIME type 
 var 
  
 audio 
  
 = 
  
 ModelContent 
 . 
 InlineData 
 ( 
 "audio/mpeg" 
 , 
  
 System 
 . 
 IO 
 . 
 File 
 . 
 ReadAllBytes 
 ( 
 System 
 . 
 IO 
 . 
 Path 
 . 
 Combine 
 ( 
  
 UnityEngine 
 . 
 Application 
 . 
 streamingAssetsPath 
 , 
  
 "audio0.mp3" 
 ))); 
 // To stream generated text output, call `GenerateContentStreamAsync` with the text and audio 
 var 
  
 responseStream 
  
 = 
  
 model 
 . 
 GenerateContentStreamAsync 
 ( 
 new 
  
 [] 
  
 { 
  
 prompt 
 , 
  
 audio 
  
 }); 
 // Print the generated text 
 await 
  
 foreach 
  
 ( 
 var 
  
 response 
  
 in 
  
 responseStream 
 ) 
  
 { 
  
 if 
  
 ( 
 ! 
 string 
 . 
 IsNullOrWhiteSpace 
 ( 
 response 
 . 
 Text 
 )) 
  
 { 
  
 UnityEngine 
 . 
 Debug 
 . 
 Log 
 ( 
 response 
 . 
 Text 
 ); 
  
 } 
 } 
 

Learn how to choose a model appropriate for your use case and app.



Requirements and recommendations for input audio files

Note that a file provided as inline data is encoded to base64 in transit, which increases the size of the request. You get an HTTP 413 error if a request is too large.

See "Supported input files and requirements" page to learn detailed information about the following:

Supported audio MIME types

Gemini multimodal models support the following audio MIME types:

  • AAC - audio/aac
  • FLAC - audio/flac
  • MP3 - audio/mp3
  • MPA - audio/m4a
  • MPEG - audio/mpeg
  • MPGA - audio/mpga
  • MP4 - audio/mp4
  • OPUS - audio/opus
  • PCM - audio/pcm
  • WAV - audio/wav
  • WEBM - audio/webm

Limits per request

Maximum files per request: 1 audio file



What else can you do?

Try out other capabilities

Learn how to control content generation

You can also experiment with prompts and model configurations and even get a generated code snippet using Google AI Studio .

Learn more about the supported models

Learn about the models available for various use cases and their quotas and pricing .


Give feedback about your experience with Firebase AI Logic


Create a Mobile Website
View Site in Mobile | Classic
Share by: