Thank you for your visit. This page is only available in English at this time.

Audio Hardware Configurations

Using AVS to include Alexa in your product opens up an ever increasing array of voice-forward experiences and capabilities to engage customers. This page discusses some of the hardware-related decisions you will be considering as you design the audio interaction for your AVS product.

A key piece of a customer’s enjoyment of an Alexa Built-in device is going to be the quality of the audio interaction with Alexa. When customer utterances are received and understood clearly, and the return of Alexa responses and other audio content is high quality, the customer is much more likely to be delighted with their experience.

As you integrate AVS, please make sure to reference our Amazon Developer Services Agreement and AVS Functional Requirements.

Device Form Factor and Alexa Interaction

One of the primary factors in designing audio solutions is the form factor of your device and how customers will be interacting with Alexa. Will your device have direct audio output? Will your device be near the customer or farther away? Will they hold the device in their hands, or even wear it?

You can design a product to interact with Alexa in two ways: by voice and by touch. Voice-initiated devices allow customers to use the “Alexa” wake word to interact with her. Touch-initiated devices require a customer to either tap or hold a physical affordance, such as a button, to talk to Alexa. We discuss this further in our speaker design page and in our other UX design guidelines.

The table below presents some common device form factors and the interaction types available. Use this table to help you think about what audio hardware and processing choices you will make during your design process.

 
Fire TV
Tap
Echo
Form Hold-to-talk Tap-to-talk Voice-initiated
(Wake Word)
Remote Controls    
Wearables
Mobile Apps
Smart Home / Appliances
Portable Speakers  
Home Audio  
Intercoms  
Automotive  
Personal Computers  
Smart TV / Set Top Box Devices    

Automatic Speech Recognition Profiles

Alexa uses a combination of automatic speech recognition (ASR) and natural language understanding (NLU) to understand customer speech and respond with precision.

  • ASR converts customer speech into text
  • NLU converts that text into intents for Alexa to act on.

At the end of this process Alexa sends directives to your product instructing it to perform an action, like playing music.

AVS provides a choice of three ASR Profiles tuned for different products, form factors, acoustic environments, and use cases. The profile parameter is sent to Alexa in the payload of each Recognize event and is also used to indicate if the end of customer speech will be determined by your client or in the cloud (determining the end of speech is called speech endpointing).

The following table highlights which ASR Profiles are commonly associated with different customer interactions. Keep in mind, however, that some cases will overlap these definitions. You should also keep in mind the use case conditions, and the acoustic testing considerations, you will have for your device.

  Hold-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range 0 to 0.3 m (1 ft) 0 to 0.9 m (3 ft) 0 to 0.9 m (3 ft) 0 to 0.2.75 m (9 ft)
ASR Profile "CLOSE_TALK" "NEAR_FIELD" "NEAR_FIELD" "FAR_FIELD"
Speech Endpointing Client Cloud Cloud Cloud

Hardware and Audio Algorithms

Using the correct hardware configuration and audio processing algorithms can improve your product's listening sensitivity for the wake word and customer utterances. This is especially true if your product is designed for music playback or intended for use in noisy environments. Any type of nonlinear processing on the audio input, such as traditional noise reduction algorithms or automatic gain controls, should not be used.

The following table highlights typical configurations:

  Push-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range 0 to 0.3 m (1 ft) 0 to 0.9 m (3 ft) 0 to 0.9 m (3 ft) 0 to 0.2.75 m (9 ft)
Wake Word    
# of Microphones 1 1 1+ 2+
AEC    
Beamforming     >2 Microphones >2 Microphones

Acoustic Echo Cancellation (AEC)
In speech recognition systems, the term “acoustic echo” refers to the signal that is played out of a loudspeaker and captured by a microphone in the vicinity of the loudspeaker. The acoustic echo is a source of interference for the ASR engine since it is simultaneously captured along with the customer’s voice at the microphone. The goal of AEC is to remove the acoustic echo component from the microphone signal, so that the customer’s voice can be clearly understood by the ASR engine. The AEC algorithm functions by adaptively estimating the acoustic echo path (and thereby the acoustic echo) between the loudspeaker and microphone components. The estimated acoustic echo is then subtracted from the microphone signal to obtain a near echo-free microphone signal. An AEC-processed microphone signal is ideally free of acoustic echo.

However, because of system non-linearities and room acoustics, not all echo is typically removed. Only linear AEC should be applied for ASR. Nonlinear processing to further clean up the echo should not be used. In addition, any type of nonlinear processing on the output path, such as compression or limiting, should be part of the reference audio sent to the AEC.

Beamforming
Beamforming is a signal processing technique for multi-microphone arrays that emphasizes the customer’s speech from a desired direction while suppressing audio interference from other directions. These algorithms result in an increase in SNR and a reduction in reverberation in the audio signal from the desired direction that improves the accuracy of speech recognition systems, especially for far-field. Only linear processing based versions of beamforming should be used for ASR.

Third-Party Resources

Visit Development Kits for AVS for a complete list of third-party providers with wake word and audio processing solutions.

Disclaimer

Amazon makes no warranty or representation regarding, does not endorse, and is not in any way responsible for any third party solutions or any content or materials provided by such third parties. If you decide to visit any linked website, you do so at your own risk and it is your responsibility to review the terms of use, privacy policy and any other relevant legal notices on such site.

References

There is help available for you throughout your development process. This web site has documentation on design, development, and marketing. To start, check out these pages: