Send audio and video streams

This document describes how to send audio and video streams to the Live API for real-time, bidirectional communication with Gemini models. Learn how to configure and transmit audio and video data to build dynamic and interactive applications.

Send audio streams

Implementing real-time audio requires strict adherence to sample rate specifications and careful buffer management to ensure low latency and natural interruptibility.

The Live API supports the following audio formats:

  • Input audio: Raw 16-bit PCM audio at 16 kHz, little-endian
  • Output audio: Raw 16-bit PCM audio at 24 kHz, little-endian

The following code sample shows you how to send streaming audio data:

  import 
  
 asyncio 
 # Assumes session is an active Live API session 
 # and chunk_data contains bytes of raw 16-bit PCM audio at 16 kHz. 
 from 
  
 google.genai 
  
 import 
 types 
 # Send audio input data in chunks 
 await 
 session 
 . 
 send_realtime_input 
 ( 
 audio 
 = 
 types 
 . 
 Blob 
 ( 
 data 
 = 
 chunk_data 
 , 
 mime_type 
 = 
 "audio/pcm;rate=16000" 
 ) 
 ) 
 

The client must maintain a playback buffer. The server streams audio in chunks within server_content messages. The client's responsibility is to decode, buffer, and play the data.

The following code sample shows you how to process streaming audio data:

  import 
  
 asyncio 
 # Assumes session is an active Live API session 
 # and audio_queue is an asyncio.Queue for buffering audio for playback. 
 import 
  
 numpy 
  
 as 
  
 np 
 async 
 for 
 msg 
 in 
 session 
 . 
 receive 
 (): 
 server_content 
 = 
 msg 
 . 
 server_content 
 if 
 server_content 
 : 
 # 1. Handle Interruption 
 if 
 server_content 
 . 
 interrupted 
 : 
 print 
 ( 
 " 
 \n 
 [Interrupted] Flushing buffer..." 
 ) 
 # Clear the Python queue 
 while 
 not 
 audio_queue 
 . 
 empty 
 (): 
 try 
 : 
 audio_queue 
 . 
 get_nowait 
 () 
 except 
 asyncio 
 . 
 QueueEmpty 
 : 
 break 
 # Send signal to worker to reset hardware buffers if needed 
 await 
 audio_queue 
 . 
 put 
 ( 
 None 
 ) 
 continue 
 # 2. Process Audio chunks 
 if 
 server_content 
 . 
 model_turn 
 : 
 for 
 part 
 in 
 server_content 
 . 
 model_turn 
 . 
 parts 
 : 
 if 
 part 
 . 
 inline_data 
 : 
 # Add PCM data to playback queue 
 await 
 audio_queue 
 . 
 put 
 ( 
 np 
 . 
 frombuffer 
 ( 
 part 
 . 
 inline_data 
 . 
 data 
 , 
 dtype 
 = 
 'int16' 
 )) 
 

Send video streams

Video streaming provides visual context. The Live API expects a sequence of discrete image frames and supports video frames input at 1 FPS. For best results, use native 768x768 resolution at 1 FPS.

The following code sample shows you how to send streaming video data:

  import 
  
 asyncio 
 # Assumes session is an active Live API session 
 # and chunk_data contains bytes of a JPEG image. 
 from 
  
 google.genai 
  
 import 
 types 
 # Send video input data in chunks 
 await 
 session 
 . 
 send_realtime_input 
 ( 
 media 
 = 
 types 
 . 
 Blob 
 ( 
 data 
 = 
 chunk_data 
 , 
 mime_type 
 = 
 "image/jpeg" 
 ) 
 ) 
 

The client implementation captures a frame from the video feed, encodes it as a JPEG blob, and transmits it using the realtime_input message structure.

  import 
  
 cv2 
 import 
  
 asyncio 
 from 
  
 google.genai 
  
 import 
 types 
 async 
 def 
  
 send_video_stream 
 ( 
 session 
 ): 
 # Open webcam 
 cap 
 = 
 cv2 
 . 
 VideoCapture 
 ( 
 0 
 ) 
 while 
 True 
 : 
 ret 
 , 
 frame 
 = 
 cap 
 . 
 read 
 () 
 if 
 not 
 ret 
 : 
 break 
 # 1. Resize to optimal resolution (768x768 max) 
 frame 
 = 
 cv2 
 . 
 resize 
 ( 
 frame 
 , 
 ( 
 768 
 , 
 768 
 )) 
 # 2. Encode as JPEG 
 _ 
 , 
 buffer 
 = 
 cv2 
 . 
 imencode 
 ( 
 '.jpg' 
 , 
 frame 
 ,) 
 # 3. Send as realtime input 
 await 
 session 
 . 
 send_realtime_input 
 ( 
 media 
 = 
 types 
 . 
 Blob 
 ( 
 data 
 = 
 buffer 
 . 
 tobytes 
 (), 
 mime_type 
 = 
 "image/jpeg" 
 ) 
 ) 
 # 4. Wait 1 second (1 FPS) 
 await 
 asyncio 
 . 
 sleep 
 ( 
 1.0 
 ) 
 cap 
 . 
 release 
 () 
 

Configure media resolution

You can specify the resolution for input media by setting the media_resolution field in the session configuration. Lower resolution reduces token usage and latency, while higher resolution improves detail recognition. Supported values include low , medium , and high .

  config 
 = 
 { 
 "response_modalities" 
 : 
 [ 
 "audio" 
 ], 
 "media_resolution" 
 : 
 "low" 
 , 
 } 
 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: