Audio Hardware Configurations

One aspect of the user experience with an Alexa Built-in device is the quality of audio interactions between the device and Alexa. When Alexa understands user utterances and returns appropriate responses, these interactions contribute to a positive user experience with your device with your device.

As you design your product, consider the hardware-related options related to the audio interactions for your device. To help you with your AVS implementation, see Amazon Developer Services Agreement and AVS Functional Requirements for guidance.

Device form factor and Alexa interaction

One factor in designing audio solutions for your device is determining how you expect users to interact with Alexa. Should your device have direct audio output? Do you expect your device to be physically located near the user or farther away? Do you expect users to wear the device or hold the device in their hands?

A device can interact with Alexa either by user voice or by touch:

  • Voice-initiated devices allow users to invoke the "Alexa" wake word to start an interaction.
  • Touch-initiated devices require a user to either tap or hold a physical control, such as a button, to talk to Alexa.

For more details about expected user interactions, see the UX design guidelines.

The following table presents some common device form factors and the interaction types available. Use this table to help with your audio hardware and processing choices during your design process.

 
Fire TV
Tap
Echo
Form Hold-to-talk Tap-to-talk Voice-initiated
(Wake Word)
Remote controls    
Wearables
Mobile apps
Smart Home / Appliances
Portable speakers  
Home audio  
Intercoms  
Automotive  
Personal computers  
Smart TV / Set top box devices    

About Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) profiles

Alexa uses a combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) to understand user speech and respond with precision.

  1. ASR converts customer speech into text. To learn more about ASR, see What Is Automatic Speech Recognition?.
  2. NLU converts that text into intents for Alexa to act on. To learn more about NLU, see What Is Natural Language Understanding?
  3. Based on the intents, Alexa sends directives to your device with instructions to perform an action, such as playing music.

AVS provides three ASR Profile options tuned for different product types and their usage:

  • Close talk
  • Near field
  • Far field

A device sends the profile parameter to Alexa in the payload of each Recognize event and indicates whether the device or Alexa determines the end of user speech, a process called "speech endpointing."

The following table highlights which ASR Profiles are commonly associated with different product types and user interactions. Some scenarios overlap within these definitions. Keep in mind the usage conditions and acoustic testing considerations that you expect to have for your device.

  Hold-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range 0 to 0.3 m (1 ft) 0 to 0.9 m (3 ft) 0 to 0.9 m (3 ft) 0 to 0.2.75 m (9 ft)
ASR Profile "CLOSE_TALK" "NEAR_FIELD" "NEAR_FIELD" "FAR_FIELD"
Speech endpointing Device Alexa Alexa Alexa

Hardware and audio algorithms

Using the correct hardware configuration and audio processing algorithms can improve device listening sensitivity for wake word and customer utterances, especially if your device focuses on music playback or is intended for use in noisy environments. Don't use any type of nonlinear processing, such as traditional noise reduction algorithms or automatic gain controls for the audio input.

The following table highlights typical configurations:

  Push-to-talk Tap-to-talk Voice-initiated (Wake Word)
Listening Range 0 to 0.3 m (1 ft) 0 to 0.9 m (3 ft) 0 to 0.9 m (3 ft) 0 to 0.2.75 m (9 ft)
Wake Word    
# of Microphones 1 1 1+ 2+
AEC    
Beamforming     >2 Microphones >2 Microphones

Acoustic Echo Cancellation (AEC)

In speech recognition systems, the term "acoustic echo" refers to the signal that a loudspeaker plays and that a microphone captures from the loudspeaker. The acoustic echo is a source of interference for the ASR engine because the microphone simultaneously captures the echo and the user utterance. The goal of AEC is to remove the acoustic echo component from the microphone signal so that the ASR engine accurately understands the user utterance. The AEC algorithm adaptively estimates the acoustic echo path and the acoustic echo between the loudspeaker and microphone components. The estimated acoustic echo is then subtracted from the microphone signal to obtain a near echo-free microphone signal. An AEC-processed microphone signal should be free from acoustic echo.

However, because of system non-linearities and room acoustics, not all echo is typically removed. Always apply linear AEC should for ASR. Avoid using non-linear processing to further clean up the echo. In addition, any type of nonlinear processing on the output path, such as compression or limiting, should be part of the reference audio sent to the AEC.

Beamforming

Beamforming is a signal processing technique for multi-microphone arrays that emphasizes user speech from a desired direction when suppressing audio interference from other directions. These algorithms result in an increase in SNR and a reduction in reverberation in the audio signal from the desired direction that improves the accuracy of speech recognition systems, especially for far-field. For ASR, always use linear processing based versions of beamforming.

Developer Resources

Visit Development Kits for AVS for a complete list of options for wake word and audio processing solutions.

Disclaimer

Amazon makes no warranty or representation regarding, does not endorse, and is not in any way responsible for any third party solutions or any content or materials provided by such third parties. If you decide to visit any linked website, you do so at your own risk and it is your responsibility to review the terms of use, privacy policy and any other relevant legal notices on such site.

References

For additional help with your product development process, see the following pages: