Overview of the Alexa Voice Service (AVS) Device SDK
The Alexa Voice Service (AVS) Device SDK provides you with a set of C ++ libraries to build an Alexa Built-in product. With these libraries your device has direct access to cloud-based Alexa capabilities to receive voice responses instantly. Your device can be almost anything – a smartwatch, a speaker, headphones – the choice is yours.
The SDK is modular and abstract. It provides separate components to handle necessary Alexa functionality including processing audio, maintaining persistent connections, and managing Alexa interactions.
- Release notes
- SDK architecture
- Important considerations
For a complete list of releases, updates, and known bugs, see the SDK release notes.
|1.25.0||August 24, 2021|
|1.24.0||June 4, 2021|
|1.23.0||March 29, 2021|
|1.22.0||December 8, 2020|
|1.21.0||October 26, 2020|
|1.20.1||August 6, 2020|
|1.20.0||June 22, 2020|
|1.19.1||April 27, 2020|
|1.19.0||April 13, 2020|
|1.18.0||February 19, 2020|
|1.17.0||December 10, 2019|
|1.16.0||October 25, 2019|
|1.15.0||September 25, 2019|
|1.14.0||July 09, 2019|
|1.13.0||May 05, 2019|
|1.12.1||April 02, 2019|
|1.12.9||February 25, 2019|
|Older versions||SDK release notes|
The following diagram illustrates components of the SDK and how data flows between them.
The green boxes are official components of the SDK – they include the following items:
- Audio Input Processor (AIP)
- Shared Data Stream (SDS)
- Alexa Communication Library (ACL)
- Alexa Directive Sequencer Library (ADSL)
- Activity Focus Manager Library (AFML)
- Capability Agent
The white and blue boxes aren't official components and depend on external libraries – these include the following items:
- Audio Signal Processor (ASP)
- Wake Word Engine (WWE)
- Media Player
For general information about Alexa and client interaction, see the Interaction Model.
Here's an example interaction with the SDK. This process might vary if you've added or removed any components.
- You ask a question, "Alexa, what is the weather?"
- The microphone captures the audio and writes it to the SDS.
- The WWE is always monitoring the SDS. When the WWE detects the wake word Alexa, it sends the audio to the AIP.
- The AIP sends a
SpeechRecognizerevent to AVS using the ACL.
- AVS processes the event and sends the appropriate directive back down through the ACL. The SDS then picks up the directive and sends it to the ADSL.
- The ADSL examines the header of the payload and determines what Capability Agent it must call.
- When the Capability Agent activates, it requests focus from the AFML.
- The Media Player plays the directive. For this example, Alexa responds with "The weather is nine degrees and cloudy with a chance of rain."
Here are some details about each individual component in the sequence.
Audio Signal Processor (ASP)
The ASP isn't actually a component of the AVS Device SDK. It's Software On a Chip (SOC) or firmware on a dedicated Digital Signal Processor (DSP). Its job is to clean up the audio and create a single audio stream, even if your device uses a multimicrophone array. Techniques used to clean the audio include Acoustic Echo Cancellation (AEC), noise suppression, beam forming, Voice Activity Detection (VAD), Dynamic Range Compression (DRC), and equalization.
Shared Data Stream (SDS)
The SDS is single producer, multi-consumer audio input buffer that transports data between a single writer and one or more readers. This ring buffer moves data throughout the different components of the SDK without duplication. This process minimizes the memory footprint, as it continuously overwrites itself. SDS operates on product-specific and user-specified memory segments, allowing for interprocess communication. Keep in mind, the writer and readers might be in different threads or processes.
SDS handles these key tasks:
- Receives audio from the ASP and then passes it to the WWE.
- Passes the audio from the WWE engine to the ACL. The ACL then passes the audio to AVS for processing.
- Receives data attachments back from the ACL and passes it to the appropriate Capability Agent.
Wake Word Engine (WWE)
The WWE is software that constantly monitors the SDS, waiting for a preconfigured wake word. When the WWE detects the correct wake word, it notifies the AIP to begin reading the audio. When using the AVS Device SDK, the wake word is always "Alexa."
The WWE consists of following two binary interfaces.
- Interface 1 – Handles general wake word detection.
- Interface 2 – Handles specific wake word models.
Audio Input Processor (AIP)
Responsibilities of the AIP include reading audio from the SDS and then sending it to AVS for processing. The AIP also includes the logic to switch between different audio input sources. The AIP triggers with the following inputs:
- External audio – Captured with on-device microphones, remote microphones and other audio input sources.
- Tap-to-Talk – Captured with designated Tap-to-Talk inputs.
- Speech directive – Sent from AVS to continue an interaction. For example, multiturn dialog.
When triggered, the AIP continues to stream audio until it receives a
Stop directive or times out. AVS can only receive one audio input source at any given time.
Alexa Communications Library (ACL)
The ACL manages the network connection between the SDK and AVS. The ACL performs the following key functions:
- Establishes and maintains long-lived persistent connections with AVS. ACL adheres to the messaging specification detailed in Managing an HTTP/2 Connection with AVS.
- Provides message sending and receiving capabilities. These capabilities include support JSON-formatted text, and binary audio content. For more details, see Structuring an HTTP/2 Request to AVS.
- Forwards incoming directives to the ADSL.
- Handles disconnect and reconnections. If the device disconnects, it automatically attempts to reconnect for you.
- Manages secure connections.
Alexa Directive Sequencer Library (ADSL)
- Accepts directives from the ACL.
- Manages the lifecycle of each directive, including queuing, reordering, or canceling directives as necessary.
- Forwards directives to the appropriate Capability Agents by examining the directive header and reading the namespace of the interface.
A Capability Agent is what performs the desired action on a device. They map directly to interfaces supported by AVS. For example, if you ask Alexa to play a song, the Capability Agent is what loads the song into your media player and plays it. A Capability Agent performs the following two tasks:
- Receives the appropriate directive from the ADSL.
- Reads the payload and performs the requested action on the device.
The following table maps the core AVS Interfaces to their equivalent AVS Device SDK Capability Agents. For a complete list of SDK interfaces, browse the SDK source files on GitHub.
|AVS Interface||SDK Capability Agent||Description|
|Alerts||Alerts||Settings, stopping, and deleting timers and alarms.|
|AudioPlayer||AudioPlayer||Managing and controlling audio playback.|
|Bluetooth||Bluetooth||Managing Bluetooth connections between peer devices and Alexa-Built-in products.|
|DoNotDisturb||DoNotDisturb||Enabling the Do Not Disturb feature.|
|EqualizerController||Equalizer||Adjust equalizer settings, such as decibel (dB) levels and modes.|
|InteractionModel||InteractionModel||Enable a client to support complex interactions initiated by Alexa, such as Alexa Routines.|
|Notifications||Notifications||Displaying notifications indicators.|
|PlaybackController||PlaybackController||Navigating a playback queue with GUI or buttons.|
|Multi-room Music||Multi-room music||Implement the Multi-room Music (MRM) feature.|
|Speaker||SpeakerManager||Volume control, including mute and unmute.|
|SpeechRecognizer||Audio Input Processor||Speech capture.|
|SpeechSynthesizer||SpeechSynthesizer||Alexa speech output.|
|System||System||Communicating product status/state to AVS.|
|TemplateRuntime||TemplateRuntime||Rendering visual metadata.|
Activity Focus Manager Library (AFML)
The AFML makes sure the SDK handles directives in the correct order. It determines which capability has control over the input and output of the device at any time. For example, if you're playing music and an alarm goes off on your device, the alarm takes focus over the music. The music pauses and the alarm rings.
Focus uses a concept called channels to govern the prioritization of audiovisual inputs and outputs.
Channels exist in the foreground or background. At any given time, only one channel can inherit the foreground state and take focus. If more than one channel is active, a device must respect the following priority order: Dialog > Alerts > Content. When a channel in the foreground becomes inactive, the next active channel in the priority order moves into the foreground.
Focus management isn't specific to Capability Agents or Directive Handlers. Agents that aren't related to Alexa also use it. Focus management enables all agents by using the AFML to have a consistent focus across a device.
The media player isn't actually a component of the AVS Device SDK. The SDK comes with a wrapper for Gstreamer and Android Media Player. If you want to use a different media player, you must build a wrapper for it with the
MediaPlayer interface. For more details about custom media players, see media player.
- Review the AVS Terms and Agreements.
- The sound files – known as earcons – associated with the sample project are for prototyping purposes. For implementation and design guidance for commercial products, see Designing for AVS and AVS UX Guidelines.
Security best practices
All Alexa products should adopt the Security Best Practices for Alexa. When building the AVS Device SDK, you should also adhere to the following security principles.
- Protect configuration parameters, such as those found in the
AlexaClientSDKConfig.jsonfile, from tampering and inspection.
- Protect executable files and processes from tampering and inspection.
- Protect persistent states of the SDK from tampering and inspection.
- Your C++ implementation of AVS Device SDK interfaces must not retain locks, crash, stop responding, or throw exceptions.
- Use exploit mitigation flags and memory randomization techniques when you compile your source code to prevent vulnerabilities from exploiting buffer overflows and memory corruptions.