Overview

The Alexa Voice Service (AVS) Device SDK provides you with a set of C ++ libraries to build an Alexa Built-in product. With these libraries your device has direct access to cloud-based Alexa capabilities to receive voice responses instantly. Your device can be almost anything – a smartwatch, a speaker, headphones – the choice is yours.

The SDK is modular and abstract. It provides separate components to handle necessary Alexa functionality including processing audio, maintaining persistent connections, and managing Alexa interactions.

Each component exposes Alexa APIs to customize your device integrations as needed. The SDK also includes a Sample App, so you can test interactions before integration.

Why use the SDK?

  • Free - Source code provided on the AVS SDK GitHub page.
  • Feature rich - Add or remove components as needed.
  • Modular - New Alexa features added with each release.
  • Community involvement - Open source model helps drive development.

Release notes

For a complete list of releases, updates, and known bugs, see the SDK release notes.

Version Release date
1.17.0 December 10, 2019
1.16.0 October 25, 2019
1.15.0 September 25, 2019
1.14.0 July 09, 2019
1.13.0 May 05, 2019
1.12.1 April 02, 2019
1.12.9 February 25, 2019
Older versions SDK release notes

SDK architecture

The following diagram illustrates components of the AVS Device SDK and how data flows between them.

The green boxes are official components of the SDK – they include the following items:

  • Audio Input Processor (AIP)
  • Shared Data Stream (SDS)
  • Alexa Communication Library (ACL)
  • Alexa Directive Sequencer Library (ADSL)
  • Activity Focus Manager Library (AFML)
  • Capability Agent

The white and blue boxes aren't official components and depend on external libraries – these include the following items:

  • Audio Signal Processer (ASP)
  • Wake Word Engine (WWE)
  • Media Player

For general information about Alexa and client interaction, see the Interaction Model.

avs device sdk architecture

Here's an example interaction with the SDK. This process might vary if you've added or removed any components.

  1. You ask a question, "Alexa, what's the weather?"
  2. The microphone captures the audio and writes it to the SDS.
  3. The WWE is always monitoring the SDS. When the WWE detects the wake word Alexa, it sends the audio to the AIP.
  4. The AIP sends a SpeechRecognizer event to AVS using the ACL.
  5. AVS processes the event and sends the appropriate directive back down through the ACL. The SDS then picks up the directive and sends it to the ADSL.
  6. The ADSL examines the header of the payload and determines what Capability Agent it must call.
  7. When the Capability Agent activates, it requests focus from the AFML.
  8. The Media Player plays the directive. For this example, Alexa responds "The weather is nine degrees and cloudy with a chance of rain."

Here are some details about each individual component in the sequence.

Audio Signal Processor (ASP)

The ASP isn't actually a component of the AVS Device SDK. It's software that lives on a chip (SOC) or firmware on a dedicated digital signal processor (DSP). Its job is to clean up the audio and create a single audio stream, even if your device is using a multimicrophone array. Techniques used to clean the audio include acoustic echo cancellation (AEC), noise suppression, beam forming, voice activity detection (VAD), dynamic range compression (DRC), and equalization.

Shared Data Stream (SDS)

The SDS is single producer, multi-consumer audio input buffer that transports data between a single writer and one or more readers. This ring buffer moves data throughout the different components of the SDK without duplication. This minimizes the memory footprint, as it continuously overwrites itself. SDS operates on product-specific and user-specified memory segments, allowing for interprocess communication. Keep in mind, the writer and readers might be in different threads or processes.

SDS handles these key tasks:

  1. Receives audio from the ASP and then passes it to the WWE.
  2. Passes the audio from the WWE engine to the ACL. The ACL then passes the audio to AVS for processing.
  3. Receives data attachments back from the ACL and passes it to the appropriate Capability Agent.

Wake Word Engine (WWE)

The WWE is software that constantly monitors the SDS, waiting for a preconfigured wake word. When the WWE detects the correct wake word, it notifies the AIP to begin reading the audio. When using the AVS Device SDK, the wake word is always "Alexa." The SDK includes connectors for Kitt.AI and the Sensory wake word engine – However, you can use any wake word engine of your choice.

The WWE consists of following two binary interfaces.

  • Interface 1 - Handles general wake word detection.
  • Interface 2 - Handles specific wake word models.

Audio Input Processor (AIP)

Responsibilities of the AIP include reading audio from the SDS and then sending it to AVS for processing. The AIP also includes the logic to switch between different audio input sources. The AIP triggers with the following inputs:

  • External audio - Captured with on-device microphones, remote microphones and other audio input sources.
  • Tap-to-Talk - Captured with designated Tap-to-Talk inputs.
  • Speech directive - Sent from AVS to continue an interaction. For example, multiturn dialogue.

When trigged, the AIP continues to stream audio until it receives a Stop directive or times out. AVS can only receive one audio input source at any given time.

Alexa Communications Library (ACL)

The ACL manages the network connection between the SDK and AVS. The ACL performs the following key functions:

  • Establishes and maintains long-lived persistent connections with AVS. ACL adheres to the messaging specification detailed in Managing an HTTP/2 Connection with AVS.
  • Provides message sending and receiving capabilities. These capabilities include support JSON-formatted text, and binary audio content. For more information, see Structuring an HTTP/2 Request to AVS.
  • Forwards incoming directives to the ADSL.
  • Handles disconnect and reconnections. If the device disconnects, it automatically attempts to reconnect for you.
  • Manages secure connections.

Alexa Directive Sequencer Library (ADSL)

The ADSL Manages handles incoming directives, as outlined in the AVS Interaction Model. The ACL performs the following key functions:

  1. Accepts directives from the ACL.
  2. Manages the lifecycle of each directive, including queuing, reordering, or canceling directives as necessary.
  3. Forwards directives to the appropriate Capability Agents by examining the directive header and reading the namespace of the interface.

Capability Agents

A Capability Agent is what performs the desired action on a device. They map directly to interfaces supported by AVS. For example, if you ask Alexa to play a song, the Capability Agent is what loads the song into your media player and plays it. A Capability Agent performs the following two tasks:

  1. Receives the appropriate directive from the ADSL.
  2. Reads the payload and performs the requested action on the device.

The following table maps AVS Interfaces to their equivalent AVS Device SDK Capability Agents.

AVS Interface SDK Capability Agent Description
Alerts Alerts Settings, stopping, and deleting timers and alarms.
AudioPlayer AudioPlayer Managing and controlling audio playback.
Bluetooth Bluetooth Managing Bluetooth connections between peer devices and Alexa-enabled products.
DoNotDisturb DoNotDisturb Enabling the Do Not Disturb feature.
EqualizerController EqualizerController Adjusting equalizer settings, such as decibel (dB) levels and modes.
InteractionModel InteractionModel Enabling a client to support complex interactions initiated by Alexa, such as Alexa Routines.
Notifications Notifications Displaying notifications indicators.
PlaybackController PlaybackController Navigating a playback queue via GUI or buttons.
Multi Room Music Multi Room music Implement Multi-Room Music (MRM) feature
Speaker SpeakerManager Volume control, including mute and unmute.
SpeechRecognizer Audio Input Processor Speech capture.
SpeechSynthesizer SpeechSynthesizer Alexa speech output.
System System Communicating product status/state to AVS.
TemplateRuntime TemplateRuntime Rendering visual metadata.

Activity Focus Manager Library (AFML)

The AFML makes sure the SDK handles directives in the correct order. It determines which capability has control over the input and output of the device at any time. For example, if you're playing music and an alarm goes off on your device, the alarm takes focus over the music. The music pauses and the alarm rings.

Focus uses a concept called channels to govern the prioritization of audiovisual inputs and outputs.

Channels exist in the foreground or background. At any given time, only one channel can inherit the foreground state and take focus. If more than one channel is active, a device must respect the following priority order: Dialog > Alerts > Content. When a channel that's in the foreground becomes inactive, the next active channel in the priority order moves into the foreground.

Focus management isn't specific to Capability Agents or Directive Handlers. Agents that aren’t related to Alexa also use it. Focus management enables all agents using the AFML to have a consistent focus across a device.

Media Player

The media player isn't actually a component of the AVS Device SDK. The SDK comes with a wrapper for Gstreamer and Android Media Player. If you want to use a different media player, you must build a wrapper for it with the MediaPlayer interface.

Important Considerations

  • Review the AVS Terms and Agreements.
  • The earcons associated with the sample project are for prototyping purposes. For implementation and design guidance for commercial products, please see Designing for AVS and AVS UX Guidelines.
  • Use the following contact information for licensing inquires:
    • Sensory for information about TrulyHandsFree licensing.
    • KITT.AI for information about SnowBoy licensing.
  • IMPORTANT: The Sensory wake word engine referenced in the SDK documents is time-limited: code linked against it stops working when the library expires. The included library has an expiry date of at least 120 days. See Sensory's GitHub page for more information.

Security best practices

All Alexa products should adopt the Security Best Practices for Alexa. When building the AVS Device SDK, you should adhere to the following security principles.

  • Protect configuration parameters, such as those found in the AlexaClientSDKConfig.json file, from tampering and inspection.
  • Protect executable files and processes from tampering and inspection.
  • Protect storage of the SDK's persistent states from tampering and inspection.
  • Your C++ implementation of AVS Device SDK interfaces must not retain locks, crash, hang, or throw exceptions.
  • Use exploit mitigation flags and memory randomization techniques when you compile your source code to prevent vulnerabilities from exploiting buffer overflows and memory corruptions.

Learn more

Watch the get started tutorial to learn about the how the SDK works and the set up process.

Tutorial