What Is Automatic Speech Recognition?

Automatic speech recognition (ASR) is technology that converts spoken words into text. In short, it’s the first step in enabling voice technologies like Amazon Alexa to respond when we ask, “Alexa, what’s it like outside?”

With ASR, voice technology can detect spoken sounds and recognize them as words. ASR is the cornerstone of the entire voice experience, allowing computers to finally understand us through our most natural form of communication: speech.

Teaching Computers to Understand

Before ASR, our speech was simply an audio recording of peaks and valleys in a computer’s “mind.” With ASR, computers can detect patterns in audio waveforms, match them with the sounds in a given language, and ultimately identify which words we spoke. Like other forms of human-computer interaction, voice services began with only basic functionality, such as robotic call centers with a limited list of words it could understand (think: “say: yes or no”).

Today, voice services have grown by leaps and bounds. They can understand the way you speak, in certain languages, and even in your accent. It can even tell when you’re just mumbling or thinking out loud with a few ‘um’s. And most importantly, today a computer can talk back to you.

Here are three ways ASR makes it possible to interact with technology via voice:

1. It Feels Fast

For a conversation to feel natural, responses must happen in milliseconds. Modern voice technologies take advantage of cloud computing to break recorded audio into text that computers can act on instantly.

2. It Can Make Educated Decisions

Languages are full of homonyms. How can computers distinguish between to, two, and too? The leading technologies of today all use some background statistics to determine which word the speaker actually intended to say.

3. It Helps Voice Get Smarter

ASR is only the first step in voice user interfaces. With additional technology layered on top of ASR, such as natural language understanding, Alexa can also understand the context of what users mean. “Four miles” might mean distance, but it might mean a reminder to buy a gift “for Miles.”

Powering the Next Revolution in Voice

ASR has been making quiet advancements for decades. It’s been used in education to help people learn second languages, as accessibility tools for those hard of hearing, and even for hands-free computing.

Today, ASR enables us to have conversations with computers. We don’t have to learn how to use a mouse and keyboard or a touch-screen UI just to set a timer, to look up a sports score, or to call another person. We just have to speak the way we already do in everyday life.

This opens up the doors to so many other possibilities. Now that computers can understand our language, what else can we teach them to do? What other magical experiences can we build with voice? That part is still up to us.

Start Building with the Alexa Skills Kit

There are many elements to voice design, but you don’t need to be an expert to start designing and building voice experiences. The Alexa Skills Kit (ASK) is a collection of self-service APIs and tools for making Alexa skills. Skills are like apps for Alexa, enabling customers to engage with your content or services naturally with voice.

Join hundreds of thousands of developers who are building Alexa skills to engage and delight customers on hundreds of millions of Alexa devices.

Get Started