Speech-to-Text is an API that lets you integrate Google's speech recognition technologies into your developer applications. This document covers the basics of using Speech-to-Text, including the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. Before you dive into using the API, read this guide and one of the associated tutorials.
Speech-to-Text recognition requests
Speech-to-Text (STT) has three main methods to perform speech recognition. The following methods are available:
-
Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests process audio data of 1 minute or less.
-
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a long-running operation . Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
-
Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream . Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results as audio is captured. For example, results can appear while a user is still speaking.
Requests contain configuration parameters as well as audio data. Recognition requests can optionally contain a recognizer , a stored and reusable recognition configuration.
Audio Metadata
For most audio files, Speech-to-Text API can automatically deduce the audio metadata. Speech-to-Text parses the header of the file and decodes it according to that information. See the encoding page for supported file types.
For headerless audio files, Speech-to-Text API lets you specify the audio metadata explicitly in the recognition config. See the encoding page for more details.
If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this can impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.
However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, don't resample your audio to 16000 Hz. For example, most legacy telephony audio uses sample rates of 8000 Hz, which can give less accurate results. If you must use such audio, provide the audio to the Speech-to-Text API at its original sample rate.
Languages
Speech-to-Text's recognition engine supports a variety of languages and
dialects. You specify the language (and national or regional dialect) of your
audio within the request configuration's languageCode
field, using a BCP-47
identifier.
A full list of supported languages for each feature is available on the Supported languages page.
Recognition features
Speech-to-Text API has additional recognition features such as automatic punctuation and word-level confidence . You enable these features in the recognition configuration in requests. See the sample code in the provided links and the languages page for feature availability.
Model selection
Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources. See the model selection documentation to learn about the available models and how to select one in your requests.
Embedded audio content
You include embedded audio in the speech recognition request by passing a content
parameter within the request's audio_source
field. For embedded
audio that you provide as content within a gRPC request, the audio must be
compatible with Proto3
serialization and provided as binary data. For embedded audio that you provide
as content within a REST request, the audio must be compatible with JSON
serialization and first be Base64-encoded. See Base64 Encoding Your Audio
for more information.
When constructing a request using a Google Cloud client
library
, you generally
write out this binary (or Base64-encoded) data directly within the content
field.
Pass audio referenced by a URI
More typically, you pass a uri
parameter within the Speech-to-Text API
request's audio_source
field, pointing to an audio file (in binary format, not
Base64) located on Cloud Storage in the following form:
gs://bucket-name/path/to/audio/file
Speech-to-Text uses a service account to access your files in Cloud Storage. By default, the service account has access to Cloud Storage files in the same project.
The service account email address is the following:
service- PROJECT_NUMBER
@gcp-sa-speech.iam.gserviceaccount.com
In order to transcribe Cloud Storage files in another project, you can give this service account the [Speech-to-Text Service Agent][speech-service-agent] role in the other project:
gcloud
projects
add-iam-policy-binding
PROJECT_ID
\
--member =
serviceAccount:service- PROJECT_NUMBER
@gcp-sa-speech.iam.gserviceaccount.com
\
--role =
roles/speech.serviceAgent
More information about project IAM policy is available at [Manage access to projects, folders, and organizations][manage-access].
You can also give the service account more granular access by giving it permission to a specific Cloud Storage bucket:
gcloud
storage
buckets
add-iam-policy-binding
gs:// BUCKET_NAME
\
--member =
serviceAccount:service- PROJECT_NUMBER
@gcp-sa-speech.iam.gserviceaccount.com
\
--role =
roles/storage.admin
More information about managing access to Cloud Storage is available at [Create and Manage access control lists][buckets-manage-acl] in the Cloud Storage documentation.
Speech-to-Text API responses
After Speech-to-Text API processes audio, it returns the transcription results in SpeechRecognitionResult
messages for
synchronous and batch requests, and in StreamingRecognitionResult
messages for streaming requests. In
synchronous and batch requests, the RPC response contains a list of results. The
list of recognized audio appears in contiguous order. For streaming responses,
all results marked as is_final
appear in contiguous order.
Select alternatives
Each result within a successful synchronous recognition response can contain one
or more alternatives
(if the max_alternatives
is greater than 1
). If Speech-to-Text
determines that an alternative has a sufficient confidence
value
, then Speech-to-Text includes that
alternative in the response. The first alternative in the response is always the
best (most likely) alternative.
Setting max_alternatives
to a higher value than 1
does not imply or
guarantee that multiple alternatives are returned. In general, more than one
alternative is more appropriate for providing real-time options to users who get
results through a streaming recognition request
.
Handling transcriptions
Each alternative in the response contains a transcript
with the recognized
text. When you receive sequential alternatives, concatenate these
transcriptions.
Confidence values
The confidence
value is an estimate between 0.0 and 1.0. It's calculated by
aggregating the "likelihood" values assigned to each word in the audio. A higher
number indicates a greater estimated likelihood that the individual words are
recognized correctly. This field is typically provided only for the top
hypothesis and only for results where is_final=true
. For example, you can use
the confidence
value to decide whether to show alternative results or ask for
your confirmation.
Be aware, however, that the model determines the "best", top-ranked result based
on more signals than the confidence
score alone (such as sentence context).
Because of this, occasional cases exist where the top result doesn't have the
highest confidence score. If you haven't requested multiple alternative results,
the single "best" result can have a lower confidence value than anticipated.
This can occur, for example, when rare words are used. Even if the system
recognizes a rarely used word correctly, it can be assigned a low "likelihood"
value. If the model determines the rare word to be the most likely option based
on context, it returns that result at the top even if the result's confidence
value is lower than alternative options.
What's next
- Use client libraries to transcribe audio using your favorite programming language.
- Learn how to transcribe short audio files .
- Learn how to transcribe streaming audio .
- Learn how to transcribe long audio files .

