Speech-to-Text overview

Speech-to-Text is an API that lets you integrate Google's speech recognition technologies into your developer applications. This document covers the basics of using Speech-to-Text, including the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. Before you dive into using the API, read this guide and one of the associated tutorials.

Speech-to-Text recognition requests

Speech-to-Text (STT) has three main methods to perform speech recognition. The following methods are available:

Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests process audio data of 1 minute or less.
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a long-running operation . Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream . Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results as audio is captured. For example, results can appear while a user is still speaking.

Requests contain configuration parameters as well as audio data. Recognition requests can optionally contain a recognizer , a stored and reusable recognition configuration.

Audio Metadata

For most audio files, Speech-to-Text API can automatically deduce the audio metadata. Speech-to-Text parses the header of the file and decodes it according to that information. See the encoding page for supported file types.

For headerless audio files, Speech-to-Text API lets you specify the audio metadata explicitly in the recognition config. See the encoding page for more details.

If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this can impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.

However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, don't resample your audio to 16000 Hz. For example, most legacy telephony audio uses sample rates of 8000 Hz, which can give less accurate results. If you must use such audio, provide the audio to the Speech-to-Text API at its original sample rate.

Languages

Speech-to-Text's recognition engine supports a variety of languages and dialects. You specify the language (and national or regional dialect) of your audio within the request configuration's languageCode field, using a BCP-47 identifier.

A full list of supported languages for each feature is available on the Supported languages page.

Recognition features

Speech-to-Text API has additional recognition features such as automatic punctuation and word-level confidence . You enable these features in the recognition configuration in requests. See the sample code in the provided links and the languages page for feature availability.

Model selection

Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources. See the model selection documentation to learn about the available models and how to select one in your requests.

Embedded audio content

You include embedded audio in the speech recognition request by passing a content parameter within the request's audio_source field. For embedded audio that you provide as content within a gRPC request, the audio must be compatible with Proto3 serialization and provided as binary data. For embedded audio that you provide as content within a REST request, the audio must be compatible with JSON serialization and first be Base64-encoded. See Base64 Encoding Your Audio for more information.

When constructing a request using a Google Cloud client library , you generally write out this binary (or Base64-encoded) data directly within the content field.

Pass audio referenced by a URI

More typically, you pass a uri parameter within the Speech-to-Text API request's audio_source field, pointing to an audio file (in binary format, not Base64) located on Cloud Storage in the following form:

 gs://bucket-name/path/to/audio/file

Speech-to-Text uses a service account to access your files in Cloud Storage. By default, the service account has access to Cloud Storage files in the same project.

The service account email address is the following:

 service- PROJECT_NUMBER 
@gcp-sa-speech.iam.gserviceaccount.com

In order to transcribe Cloud Storage files in another project, you can give this service account the [Speech-to-Text Service Agent][speech-service-agent] role in the other project:

 gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
 \ 
  
--member = 
serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-speech.iam.gserviceaccount.com  
 \ 
  
--role = 
roles/speech.serviceAgent

More information about project IAM policy is available at [Manage access to projects, folders, and organizations][manage-access].

You can also give the service account more granular access by giving it permission to a specific Cloud Storage bucket:

 gcloud  
storage  
buckets  
add-iam-policy-binding  
gs:// BUCKET_NAME 
  
 \ 
  
--member = 
serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-speech.iam.gserviceaccount.com  
 \ 
  
--role = 
roles/storage.admin

More information about managing access to Cloud Storage is available at [Create and Manage access control lists][buckets-manage-acl] in the Cloud Storage documentation.

Speech-to-Text API responses

After Speech-to-Text API processes audio, it returns the transcription results in SpeechRecognitionResult messages for synchronous and batch requests, and in StreamingRecognitionResult messages for streaming requests. In synchronous and batch requests, the RPC response contains a list of results. The list of recognized audio appears in contiguous order. For streaming responses, all results marked as is_final appear in contiguous order.

Select alternatives

Each result within a successful synchronous recognition response can contain one or more alternatives (if the max_alternatives is greater than 1 ). If Speech-to-Text determines that an alternative has a sufficient confidence value , then Speech-to-Text includes that alternative in the response. The first alternative in the response is always the best (most likely) alternative.

Setting max_alternatives to a higher value than 1 does not imply or guarantee that multiple alternatives are returned. In general, more than one alternative is more appropriate for providing real-time options to users who get results through a streaming recognition request .

Handling transcriptions

Each alternative in the response contains a transcript with the recognized text. When you receive sequential alternatives, concatenate these transcriptions.

Confidence values

The confidence value is an estimate between 0.0 and 1.0. It's calculated by aggregating the "likelihood" values assigned to each word in the audio. A higher number indicates a greater estimated likelihood that the individual words are recognized correctly. This field is typically provided only for the top hypothesis and only for results where is_final=true . For example, you can use the confidence value to decide whether to show alternative results or ask for your confirmation.

Be aware, however, that the model determines the "best", top-ranked result based on more signals than the confidence score alone (such as sentence context). Because of this, occasional cases exist where the top result doesn't have the highest confidence score. If you haven't requested multiple alternative results, the single "best" result can have a lower confidence value than anticipated. This can occur, for example, when rare words are used. Even if the system recognizes a rarely used word correctly, it can be assigned a low "likelihood" value. If the model determines the rare word to be the most likely option based on context, it returns that result at the top even if the result's confidence value is lower than alternative options.

What's next

Use client libraries to transcribe audio using your favorite programming language.
Learn how to transcribe short audio files .
Learn how to transcribe streaming audio .
Learn how to transcribe long audio files .