Audio Hardware Configurations
One aspect of the user experience with an Alexa Built-in device is the quality of audio interactions between the device and Alexa. When Alexa understands user utterances and returns appropriate responses, these interactions contribute to a positive user experience with your device with your device.
As you design your product, consider the hardware-related options related to the audio interactions for your device. To help you with your Alexa Voice Service (AVS) implementation, see Amazon Developer Services Agreement and AVS Functional Requirements for guidance.
Device form factor and Alexa interaction
One factor in designing audio solutions for your device is determining how you expect users to interact with Alexa. Should your device have direct audio output? Do you expect your device to be physically located near the user or farther away? Do you expect users to wear the device or hold the device in their hands?
A device can interact with Alexa either by user voice or by touch:
- Voice-initiated devices allow users to invoke the "Alexa" wake word to start an interaction.
- Touch-initiated devices require a user to either tap or hold a physical control, such as a button, to talk to Alexa.
For more details about expected user interactions, see the UX design guidelines.
The following table presents some common device form factors and the interaction types available. Use this table to help with your audio hardware and processing choices during your design process.
About Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) profiles
Alexa uses a combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) to understand user speech and respond with precision.
- ASR converts customer speech into text. To learn more about ASR, see What Is Automatic Speech Recognition?.
- NLU converts that text into intents for Alexa to act on. To learn more about NLU, see What Is Natural Language Understanding?
- Based on the intents, Alexa sends directives to your device with instructions to perform an action, such as playing music.
AVS provides three ASR Profile options tuned for different product types and their usage:
- Close talk
- Near field
- Far field
A device sends the profile parameter to Alexa in the payload of each Recognize event and indicates whether the device or Alexa determines the end of user speech, a process called "speech endpointing."
The following table highlights which ASR Profiles are commonly associated with different product types and user interactions. Some scenarios overlap within these definitions. Keep in mind the usage conditions and acoustic testing considerations that you expect to have for your device.
Hold-to-talk | Tap-to-talk | Voice-initiated (Wake Word) | ||
Listening Range | 0 to 0.3 m (1 ft) | 0 to 0.9 m (3 ft) | 0 to 0.9 m (3 ft) | 0 to 2.75 m (9 ft) |
ASR Profile | "CLOSE_TALK" | "NEAR_FIELD" | "NEAR_FIELD" | "FAR_FIELD" |
Speech endpointing | Device | Alexa | Alexa | Alexa |
Hardware and audio algorithms
Using the correct hardware configuration and audio processing algorithms can improve device listening sensitivity for wake word and customer utterances, especially if your device focuses on music playback or is intended for use in noisy environments. Don't use any type of nonlinear processing, such as traditional noise reduction algorithms or automatic gain controls for the audio input.
The following table highlights typical configurations:
Push-to-talk | Tap-to-talk | Voice-initiated (Wake Word) | ||
Listening Range | 0 to 0.3 m (1 ft) | 0 to 0.9 m (3 ft) | 0 to 0.9 m (3 ft) | 0 to 2.75 m (9 ft) |
Wake Word | ● | ● | ||
# of Microphones | 1 | 1 | 1+ | 2+ |
AEC | ● | ● | ||
Beamforming | >2 Microphones | >2 Microphones |
Acoustic Echo Cancellation (AEC)
In speech recognition systems, the term "acoustic echo" refers to the signal that a loudspeaker plays and that a microphone captures from the loudspeaker. The acoustic echo is a source of interference for the ASR engine because the microphone simultaneously captures the echo and the user utterance. The goal of AEC is to remove the acoustic echo component from the microphone signal so that the ASR engine accurately understands the user utterance. The AEC algorithm adaptively estimates the acoustic echo path and the acoustic echo between the loudspeaker and microphone components. The estimated acoustic echo is then subtracted from the microphone signal to obtain a near echo-free microphone signal. An AEC-processed microphone signal should be free from acoustic echo.
However, because of system non-linearities and room acoustics, not all echo is typically removed. Always apply linear AEC should for ASR. Avoid using non-linear processing to further clean up the echo. In addition, any type of nonlinear processing on the output path, such as compression or limiting, should be part of the reference audio sent to the AEC.
Beamforming
Beamforming is a signal processing technique for multi-microphone arrays that emphasizes user speech from a desired direction when suppressing audio interference from other directions. These algorithms result in an increase in SNR and a reduction in reverberation in the audio signal from the desired direction that improves the accuracy of speech recognition systems, especially for far-field. For ASR, always use linear processing based versions of beamforming.
Developer resources
Visit Development Kits for AVS for a complete list of options for wake word and audio processing solutions.
Amazon makes no warranty or representation regarding, does not endorse, and is not in any way responsible for any third party solutions or any content or materials provided by such third parties. If you decide to visit any linked website, you do so at your own risk and it is your responsibility to review the terms of use, privacy policy and any other relevant legal notices on such site.
References
For additional help with your product development process, see the following pages:
- Development kits are available to help your development effort
- Getting started with AVS, including technical documentation
- Complete UX Guidelines including setup, attention system, and more
- AVS Functional Requirements
- AVS Program Requirements
- Amazon Alexa Brand External
- AVS device certification process
- AVS device self-testing process
- Acoustic Testing Guide
- Forum and Knowledge Base
Last updated: Jan 04, 2021