Home > Alexa > Alexa Voice Service

Designing for AVS

Determine the right implementation for your AVS-enabled product


Alexa Voice Service (AVS) is Amazon’s intelligent cloud service that allows you to voice-enable connected products that have a microphone and speaker. By integrating AVS, your users immediately gain access to Alexa’s core capabilities and a growing library of third-party skills.

Alexa users expect a familiar experience. Use the design and implementation guidance below to ensure that your AVS integration meets user expectations:

As you integrate AVS, please make sure to reference our Terms and Agreements and Functional Design Guide.

Typical Application Examples

Alexa allows users to interact with products in the most natural way possible – with their voice. Whether you’re building a hand-held device like a TV remote, a wearable like a smart watch, or you want a hands-free experience for your connected speaker or home intercom, AVS provides a way for your users to speak to Alexa.

There are two ways to initiate an interaction with Alexa: touch and voice. Touch-initiated interactions rely on a physical control like the Amazon Fire TV remote or Amazon Tap. Voice-initiated interactions leverage the wake word “Alexa”, like Amazon Echo.

This table highlights typical application examples for push-to-talk, tap-to-talk and voice-initiated products:

Fire TV
Application Push-to-talk Tap-to-talk Voice-initiated
(Wake Word)
Mobile Apps
Portable Speakers  
Home Audio  
Smart Home / Appliances
Personal Computers  
Smart TV / Set Top Boxes    

Automatic Speech Recognition Profiles

Alexa uses a combination of automatic speech recognition (ASR) and natural language understanding (NLU) to understand user speech and respond with precision. ASR converts user speech into text, NLU converts that text into intents for Alexa to act on. At the end of this process Alexa sends directives to your product instructing it to perform an action, like playing music.

AVS provides a choice of three ASR Profiles tuned for different products, form factors, acoustic environments and use cases. The profile parameter is sent to Alexa in the payload of each Recognize event and is also used to indicate if the end of user speech will be determined by your client or in the cloud (how the end of speech is determined is called speech endpointing).

The following table highlights which ASR Profiles are commonly associated with different user interactions.

  Push-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range Up to 2.5 ft. Up to 5 ft. Up to 5 ft. Up to 20 ft.
Speech Endpointing Client Cloud Cloud Cloud

Hardware and Audio Algorithms

The correct hardware configuration and audio processing algorithms can improve your product’s listening sensitivity for the wake word Alexa. This is especially true if your product is designed for music playback or intended for use in noisy environments.

The following table highlights typical configurations:

  Push-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range Up to 2.5 ft. Up to 5 ft. Up to 5 ft. Up to 20 ft.
Wake Word    
# of Microphones 1 1 1+ 2+
Noise Reduction
Beamforming     >2 Microphones >2 Microphones

Noise Reduction
Noise reduction algorithms traditionally refer to single-channel signal processing techniques that cater to learning the temporal and spectral characteristics of acoustic ambient noise in the captured microphone data; the learnt noise profile is then used to reduce the noise in the microphone data. This improves the signal-to-noise-ratio (SNR) of the microphone data, which improves the accuracy of the ASR system. The noise reduction algorithms operate in temporal and/or spectral domains and are often used in the post-filtering stage of a beamformer algorithm.

Acoustic Echo Cancellation (AEC)
In speech recognition systems, the term “acoustic echo” refers to the signal that is played out of a loudspeaker and captured by a microphone in the vicinity of the loudspeaker. The acoustic echo is a source of interference for the ASR engine since it is simultaneously captured along with the user’s voice at the microphone. The goal of AEC is to remove the acoustic echo component from the microphone signal, so that the user’s voice can be clearly understood by the ASR engine. The AEC algorithm functions by adaptively estimating the acoustic echo path (and thereby the acoustic echo) between the loudspeaker and microphone components. The estimated acoustic echo is then subtracted from the microphone signal to obtain a near echo-free microphone signal. An AEC-processed microphone signal is ideally free of acoustic echo.

Beamforming is a signal processing technique for multi-microphone arrays that emphasizes the user’s speech from a desired direction while suppressing audio interference from other directions. These algorithms result in an increase in SNR and a reduction in reverberation in the audio signal from the desired direction that improves the accuracy of speech recognition systems.

Third-Party Resources *

The following are third-party providers with wake word and audio processing solutions. Please refer to the manufacturers’ product information to ensure that it meets your needs.

  Wake Word Noise Reduction AEC Beamforming
Sensory -
TrulyHandsFree Voice Control

Snowboy Hotword Detection

Conexant -
Voice Speech Processors

Conexant -
AudioSmart™ 2-Mic Development Kit for Amazon AVS
with Sensory Wake Word


Amazon makes no warranty or representation regarding, does not endorse, and is not in any way responsible for any third party solutions or any content or materials provided by such third parties. If you decide to visit any linked website, you do so at your own risk and it is your responsibility to review the terms of use, privacy policy and any other relevant legal notices on such site.