Your Alexa Dashboards Settings

Streaming Requirements for Cloud-Based Wake Word Verification

Cloud-Based Wake Word Verification is a feature that improves wake word accuracy for Alexa-enabled products by reducing false wakes caused by words that sound similar to the wake word. For example, here are a few words that may cause a false wake for “Alexa”: “Alex”, “election”, “Alexis”. Cloud-Based Wake Word Verification also detects media mentions of the wake word. For example, the mention of “Alexa” in an Amazon commercial.

Overview

Voice-initiated products start to stream user speech to AVS when a wake word is detected by the product’s wake word engine; the stream is closed when the user stops speaking or user’s intent has been identified and the service returns a StopCapture directive. For cloud-based wake word verification to work, the audio streamed to AVS must include precisely 500 milliseconds of pre-roll, the wake word, and any user speech that is captured until a StopCapture directive is received. This allows AVS to verify the wake word included in the stream, reducing the number of erroneous responses due to false wakes.

  • Pre-roll, or the audio captured prior to the detected start of the wake word, is used to calibrate the ambient noise level of the recording, which enhances speech recognition.
  • Inclusion of the wake word in the stream allows AVS to perform cloud-based wake word verification, which reduces false wakes (the number of times the wake word engine falsely recognizes the wake word).
  • If the wake word is not detected during cloud-based wake word verification the audio samples are discarded.

This document provides a recommendation to implement a shared memory ring buffer for reading and writing audio samples, and an update to the SpeechRecognizer.Recognize message structure to include wake word start and stop indices and a new initiator parameter that informs AVS how an interaction with Alexa was triggered. Keep in mind, this is only a recommendation, alternative implementations are allowed as long as they provide the requisite functionality to identify 500 milliseconds of pre-roll, the wake word start and stop indices, and stream the user’s request in it’s entirety.

Wake Word Indices

The Recognize event is used to stream user speech to AVS and translate that speech into one or more directives. It is a multipart message that includes a JSON-formatted object and a binary audio attachment. As recommended above, the stream should contain the pre-roll, wake word, and the user’s utterance. Chunking is recommended to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

If a user request is initiated using a wake word, the Recognize event must contain the start and stop indices for the occurrence of the wake word. The timestamps are relative to the start of the audio stream and the units are in samples.

For example, a wake word that starts at the 500 milliseconds mark and has a duration of 500 millisecond will have the following indices with a 16 kHz sample rate:

  • startIndexInSamples - 8000
  • endIndexInSamples - 16000

Sample Message

{
  "context": [
    // This is an array of context objects that are used to communicate the
    // state of all client components to Alexa. See Context for details.
  ],
  "event": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "Recognize",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "profile": "{{STRING}}",
      "format": "{{STRING}}",
      "initiator": {
        "type": "{{STRING}}",
        "payload": {
          "wakeWordIndices": {
            "startIndexInSamples": {{LONG}},
            "endIndexInSamples": {{LONG}}
          }   
        }
      }
    }
  }
}

Context

This event requires your product to report the status of all client component states to Alexa in the context object. For additional information see Context.

Payload Parameters

Parameter Description Type
profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
Accepted values: "CLOSE_TALK", "NEAR_FIELD", "FAR_FIELD".
string
format Identifies the format of captured audio.
Accepted value: "AUDIO_L16_RATE_16000_CHANNELS_1"
string
initiator Includes information about how an interaction with AVS was initiated.
IMPORTANT: initiator is required i) for wake word enabled products that use cloud-based wake word verification, and ii) when it is included in an ExpectSpeech directive.
object
initiator.type Represents the action taken by the user to start streaming audio to AVS.
Accepted values: "PRESS_AND_HOLD", "TAP", and "WAKEWORD".
string
initiator.payload Includes information about the initiator, such as type and as start and stop indices. object
initiator.payload.wakeWordIndices This object is only required for wake word enabled products that use cloud-based wake word verification.
wakeWordIndices includes the startIndexInSamples and endIndexInSamples.
object
initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). he start index should be accurate to within 50ms of wake word detection. long
initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long

For additional information, see Enable Cloud-Based Wake Word Verification.

Shared Memory Ring Buffer

A shared memory ring buffer is one implementation that satisfies the requirement to stream a continuous utterance to AVS that includes 500 milliseconds of pre-roll, the wake word, and the user’s request for cloud-based wake word verification. This reduces latency and the need to copy audio samples multiple times.

The diagram below illustrates one approach; alternative implementations are allowed as long as the requisite functionality is maintained:

Ring Buffer
  1. Audio Subsystem - Usually middle-ware, such as Advanced Linux Sound Architecture (ALSA), that provides APIs to open a stream from a recording device, and receive an audio sample stream.
  2. Audio Capture - A process that opens a recording device, and writes audio samples into the shared memory ring buffer.
  3. Shared Memory Ring Buffer - A shared memory ring buffer object that is thread safe, and accessible by multiple processes. Access to this memory block should support at least one writer and two readers simultaneously. You may choose to implement these executables as a shared library.
    • Write API - Used by the recording process to write audio samples to the shared memory ring buffer. Samples should be written in such a way that the oldest audio samples are overwritten. This is an example of a typical API signature: WriteSamples(BufferIn, BufferInLength).
    • Read API - Used by the wake word engine and the AVS client to read from specific locations in the shared memory ring buffer. A typical implementation represents location as a 64-bit integer. For example, reading 200 samples from the location 100 is reading the 100th to 300th samples that were written to the shared memory ring buffer. This is an example of a typical API signature: ReadSamples(BufferIn, BufferInLength). Note: Reading from an old location that was overwritten with new audio samples will result in an unintended sample read.
  4. Wake Word Engine - The component that reads the audio samples written to the shared memory ring buffer and analyzes the audio samples for occurrences of the wake word. The procedure of reading and processing audio samples occurs in a loop. For example, the loop reads N samples starting at location 0; then analyzes the samples for an occurrence of the wake word. In the next iteration, it reads and analyzes the next N samples. When an occurrence of the wake word is detected, the wake word engine must identify two locations: 1) the start of the wake word, and 2) the end of the wake word. These values are stored as startIndexInSamples and endIndexInSamples.
  5. WakeWordDetectedSignal - A signal sent from the wake word engine to the AVS client when an occurrence of the wake word is detected. This signal includes two locations, the startIndexInSamples and endIndexInSamples.
  6. AVS Client - The client receives the WakeWordDetectedSignal, extracts the startIndexInSamples, then reads and streams audio samples to AVS including 500 milliseconds of pre-roll.
    • The pre-roll is the number of samples corresponding to 500 milliseconds. For example, if the audio samples were recorded at a rate of 16 kilohertz (kHz), then pre-roll is 8,000 samples.

Next Steps