Transcribe audio with multiple channels

This page describes how to use Speech-to-Text to transcribe audio files that include more than one channel. Multi-channel recognition is available for most, but not all, audio encodings supported by Speech-to-Text. For information about how many channels are recognized in audio files of each encoding type, see audioChannelCount .

Audio data usually includes a channel for each speaker present on the recording. For example, audio of two people talking over the phone might contain two channels, where each line is recorded separately.

To transcribe audio data that includes multiple channels, you must provide the number of channels in your request to the Speech-to-Text API. In your request, set the audioChannelCount field in your request to the number of channels present in your audio.

When you send a request with multiple channels, Speech-to-Text returns a result to you that identifies the different channels present in the audio, labeling the alternatives for each result with the channelTag field.

The following code sample demonstrates how to transcribe audio that contains multiple channels.

Protocol

Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl . The example uses the Google Cloud CLI to generate an access token. For instructions on installing the gcloud CLI, see the quickstart .

The following example show how to send a POST request using curl , where the body of the request specifies the number of channels present on the audio sample.

curl  
-X  
POST  
-H  
 "Authorization: Bearer 
 $( 
gcloud  
auth  
application-default  
print-access-token ) 
 " 
  
 \ 
  
-H  
 "Content-Type: application/json; charset=utf-8" 
  
 \ 
  
--data  
 '{ 
 "config": { 
 "encoding": "LINEAR16", 
 "languageCode": "en-US", 
  "audioChannelCount": 2, 
  "enableSeparateRecognitionPerChannel": true 
 }, 
 "audio": { 
 "uri": "gs://cloud-samples-tests/speech/commercial_stereo.wav" 
 } 
 }' 
  
 "https://speech.googleapis.com/v1/speech:recognize" 
 > 
multi-channel.txt

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format, saved to a file named multi-channel.json .

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hi I'd like to buy a Chromecast I'm always wondering whether you could help me with that",
          "confidence": 0.8991147
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": "certainly which color would you like we have blue black and red",
          "confidence": 0.9408236
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " let's go with the black one",
          "confidence": 0.98783094
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " would you like the new Chromecast Ultra model or the regular Chromecast",
          "confidence": 0.9573053
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " regular Chromecast is fine thank you",
          "confidence": 0.9671048
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " okay sure would you like to ship it regular or Express",
          "confidence": 0.9544821
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " express please",
          "confidence": 0.9487205
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " terrific it's on the way thank you",
          "confidence": 0.97655964
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " thank you very much bye",
          "confidence": 0.9735077
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    }
  ]
}

Go

To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries . For more information, see the Speech-to-Text Go API reference documentation .

To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .

  // transcribeMultichannel generates a transcript from a multichannel speech file and tags the speech from each channel. 
 func 
  
 transcribeMultichannel 
 ( 
 w 
  
 io 
 . 
 Writer 
 ) 
  
 error 
  
 { 
  
 ctx 
  
 := 
  
 context 
 . 
 Background 
 () 
  
 client 
 , 
  
 err 
  
 := 
  
 speech 
 . 
 NewClient 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "NewClient: %w" 
 , 
  
 err 
 ) 
  
 } 
  
 defer 
  
 client 
 . 
 Close 
 () 
  
 data 
 , 
  
 err 
  
 := 
  
 os 
 . 
 ReadFile 
 ( 
 "../testdata/commercial_stereo.wav" 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "ReadFile: %w" 
 , 
  
 err 
 ) 
  
 } 
  
 resp 
 , 
  
 err 
  
 := 
  
 client 
 . 
 Recognize 
 ( 
 ctx 
 , 
  
& speechpb 
 . 
 RecognizeRequest 
 { 
  
 Config 
 : 
  
& speechpb 
 . 
 RecognitionConfig 
 { 
  
 Encoding 
 : 
  
 speechpb 
 . 
 RecognitionConfig_LINEAR16 
 , 
  
 SampleRateHertz 
 : 
  
 44100 
 , 
  
 LanguageCode 
 : 
  
 "en-US" 
 , 
  
 AudioChannelCount 
 : 
  
 2 
 , 
  
 EnableSeparateRecognitionPerChannel 
 : 
  
 true 
 , 
  
 }, 
  
 Audio 
 : 
  
& speechpb 
 . 
 RecognitionAudio 
 { 
  
 AudioSource 
 : 
  
& speechpb 
 . 
 RecognitionAudio_Content 
 { 
 Content 
 : 
  
 data 
 }, 
  
 }, 
  
 }) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "Recognize: %w" 
 , 
  
 err 
 ) 
  
 } 
  
 // Print the results. 
  
 for 
  
 _ 
 , 
  
 result 
  
 := 
  
 range 
  
 resp 
 . 
 Results 
  
 { 
  
 for 
  
 _ 
 , 
  
 alt 
  
 := 
  
 range 
  
 result 
 . 
 Alternatives 
  
 { 
  
 fmt 
 . 
 Fprintf 
 ( 
 w 
 , 
  
 "Channel %v: %v\n" 
 , 
  
 result 
 . 
 ChannelTag 
 , 
  
 alt 
 . 
 Transcript 
 ) 
  
 } 
  
 } 
  
 return 
  
 nil 
 } 
 

Java

To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries . For more information, see the Speech-to-Text Java API reference documentation .

To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .

  /** 
 * Transcribe a remote audio file with multi-channel recognition 
 * 
 * @param gcsUri the path to the audio file 
 */ 
 public 
  
 static 
  
 void 
  
 transcribeMultiChannelGcs 
 ( 
 String 
  
 gcsUri 
 ) 
  
 throws 
  
 Exception 
  
 { 
  
 try 
  
 ( 
 SpeechClient 
  
 speechClient 
  
 = 
  
 SpeechClient 
 . 
 create 
 ()) 
  
 { 
  
 // Configure request to enable multiple channels 
  
 RecognitionConfig 
  
 config 
  
 = 
  
 RecognitionConfig 
 . 
 newBuilder 
 () 
  
 . 
 setEncoding 
 ( 
 AudioEncoding 
 . 
 LINEAR16 
 ) 
  
 . 
 setLanguageCode 
 ( 
 "en-US" 
 ) 
  
 . 
 setSampleRateHertz 
 ( 
 44100 
 ) 
  
 . 
 setAudioChannelCount 
 ( 
 2 
 ) 
  
 . 
 setEnableSeparateRecognitionPerChannel 
 ( 
 true 
 ) 
  
 . 
 build 
 (); 
  
 // Set the remote path for the audio file 
  
 RecognitionAudio 
  
 audio 
  
 = 
  
 RecognitionAudio 
 . 
 newBuilder 
 (). 
 setUri 
 ( 
 gcsUri 
 ). 
 build 
 (); 
  
 // Use non-blocking call for getting file transcription 
  
 OperationFuture<LongRunningRecognizeResponse 
 , 
  
 LongRunningRecognizeMetadata 
>  
 response 
  
 = 
  
 speechClient 
 . 
 longRunningRecognizeAsync 
 ( 
 config 
 , 
  
 audio 
 ); 
  
 while 
  
 ( 
 ! 
 response 
 . 
 isDone 
 ()) 
  
 { 
  
 System 
 . 
 out 
 . 
 println 
 ( 
 "Waiting for response..." 
 ); 
  
 Thread 
 . 
 sleep 
 ( 
 10000 
 ); 
  
 } 
  
 // Just print the first result here. 
  
 for 
  
 ( 
 SpeechRecognitionResult 
  
 result 
  
 : 
  
 response 
 . 
 get 
 (). 
 getResultsList 
 ()) 
  
 { 
  
 // There can be several alternative transcripts for a given chunk of speech. Just use the 
  
 // first (most likely) one here. 
  
 SpeechRecognitionAlternative 
  
 alternative 
  
 = 
  
 result 
 . 
 getAlternativesList 
 (). 
 get 
 ( 
 0 
 ); 
  
 // Print out the result 
  
 System 
 . 
 out 
 . 
 printf 
 ( 
 "Transcript : %s\n" 
 , 
  
 alternative 
 . 
 getTranscript 
 ()); 
  
 System 
 . 
 out 
 . 
 printf 
 ( 
 "Channel Tag : %s\n" 
 , 
  
 result 
 . 
 getChannelTag 
 ()); 
  
 } 
  
 } 
 } 
 

Node.js

To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries . For more information, see the Speech-to-Text Node.js API reference documentation .

To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .

  const 
  
 speech 
  
 = 
  
 require 
 ( 
 ' @google-cloud/speech 
' 
 ). 
 v1 
 ; 
 // Creates a client 
 const 
  
 client 
  
 = 
  
 new 
  
 speech 
 . 
  SpeechClient 
 
 (); 
 const 
  
 config 
  
 = 
  
 { 
  
 encoding 
 : 
  
 'LINEAR16' 
 , 
  
 languageCode 
 : 
  
 'en-US' 
 , 
  
 audioChannelCount 
 : 
  
 2 
 , 
  
 enableSeparateRecognitionPerChannel 
 : 
  
 true 
 , 
 }; 
 const 
  
 audio 
  
 = 
  
 { 
  
 uri 
 : 
  
 gcsUri 
 , 
 }; 
 const 
  
 request 
  
 = 
  
 { 
  
 config 
 : 
  
 config 
 , 
  
 audio 
 : 
  
 audio 
 , 
 }; 
 const 
  
 [ 
 response 
 ] 
  
 = 
  
 await 
  
 client 
 . 
 recognize 
 ( 
 request 
 ); 
 const 
  
 transcription 
  
 = 
  
 response 
 . 
 results 
  
 . 
 map 
 ( 
  
 result 
  
 = 
>  
 ` Channel Tag: 
 ${ 
 result 
 . 
 channelTag 
 } 
  
 ${ 
 result 
 . 
 alternatives 
 [ 
 0 
 ]. 
 transcript 
 } 
 ` 
  
 ) 
  
 . 
 join 
 ( 
 '\n' 
 ); 
 console 
 . 
 log 
 ( 
 `Transcription: \n 
 ${ 
 transcription 
 } 
 ` 
 ); 
 

Python

To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries . For more information, see the Speech-to-Text Python API reference documentation .

To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .

  from 
  
 google.cloud 
  
 import 
 speech 
 def 
  
 transcribe_file_with_multichannel 
 ( 
 audio_file 
 : 
 str 
 ) 
 - 
> speech 
 . 
 RecognizeResponse 
 : 
  
 """Transcribe the given audio file synchronously with multi channel. 
 Args: 
 audio_file (str): Path to the local audio file to be transcribed. 
 Example: "resources/multi.wav" 
 Returns: 
 cloud_speech.RecognizeResponse: The full response object which includes the transcription results. 
 """ 
 client 
 = 
 speech 
 . 
 SpeechClient 
 () 
 with 
 open 
 ( 
 audio_file 
 , 
 "rb" 
 ) 
 as 
 f 
 : 
 audio_content 
 = 
 f 
 . 
 read 
 () 
 audio 
 = 
 speech 
 . 
 RecognitionAudio 
 ( 
 content 
 = 
 audio_content 
 ) 
 config 
 = 
 speech 
 . 
 RecognitionConfig 
 ( 
 encoding 
 = 
 speech 
 . 
 RecognitionConfig 
 . 
 AudioEncoding 
 . 
 LINEAR16 
 , 
 sample_rate_hertz 
 = 
 44100 
 , 
 language_code 
 = 
 "en-US" 
 , 
 audio_channel_count 
 = 
 2 
 , 
 enable_separate_recognition_per_channel 
 = 
 True 
 , 
 ) 
 response 
 = 
 client 
 . 
 recognize 
 ( 
 config 
 = 
 config 
 , 
 audio 
 = 
 audio 
 ) 
 for 
 i 
 , 
 result 
 in 
 enumerate 
 ( 
 response 
 . 
 results 
 ): 
 alternative 
 = 
 result 
 . 
 alternatives 
 [ 
 0 
 ] 
 print 
 ( 
 "-" 
 * 
 20 
 ) 
 print 
 ( 
 f 
 "First alternative of result 
 { 
 i 
 } 
 " 
 ) 
 print 
 ( 
 f 
 "Transcript: 
 { 
 alternative 
 . 
 transcript 
 } 
 " 
 ) 
 print 
 ( 
 f 
 "Channel Tag: 
 { 
 result 
 . 
 channel_tag 
 } 
 " 
 ) 
 return 
 result 
 

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.

Design a Mobile Site
View Site in Mobile | Classic
Share by: