Wake Word Verification Requirements

Important: Alexa Voice Service (AVS) developer tools are no longer generally available for Alexa Built-in. Please visit the Works with Alexa program if you are interested in building devices that connect to Alexa.

Cloud-based wake word verification improves wake word accuracy for devices that implement the Alexa Voice Service (AVS) by reducing false device wakes caused by utterances that sound similar to the wake word. For example, words that might cause a false wake for "Alexa" include "Alex", "election", and "Alexis." Cloud-based wake word verification also detects media mentions of the wake word. For example, the mention of "Alexa" in an Amazon commercial.

Requirements for Cloud-Based wake word verification

Voice-initiated devices begin to stream user speech to AVS when the wake word engine detects a spoken wake word, such as "Alexa." The stream closes when the user stops speaking or when AVS identifies a user intent, and AVS returns a StopCapture directive to the device. Cloud-based wake word verification has the following requirements:

Wake word – Include the wake word in the stream so that AVS can perform cloud-based wake word verification, which reduces false wakes. If AVS can't detect the wake word detected during cloud-based wake word verification, AVS discards the utterance.
500 milliseconds of pre-roll – Pre-roll is the audio captured before AVS detects the wake word and helps calibrate the ambient noise level of the recording to enhance speech recognition.
User speech – Any user speech that the device captures until receiving a StopCapture directive. This allows AVS to verify the wake word included in the stream, reducing the number of erroneous responses due to false wakes.

Read the following sections to learn how to implement a shared memory ring buffer for writing and reading audio samples and to include the start and stop indices for wake word detection in each Recognize event sent to AVS.

Wake Word indexes

The Recognize event streams user speech to AVS and translates that speech into one or more directives. This multipart message that includes a JSON-formatted object and a binary audio attachment. The stream must contain the pre-roll, wake word, and the user's utterance. Use chunking to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

When the user request is voice-initiated, the initiator.type SHALL be WAKEWORD and the Recognize event must contain the start and stop indices for the occurrence of the wake word. The timestamps are relative to the start of the audio stream and the units are in samples.

Important: The pre-roll and wake word start and stop indexes must be precise. The start index should be accurate to within 50ms of wake word detection. The end index should be accurate to within 150ms of the end of the detected wake word.

For example, a wake word that starts at the 500 milliseconds mark and has a duration of 500 millisecond has the following indexes with a 16 kHz sample rate:

startIndexInSamples - 8000
endIndexInSamples - 16000

Note: The startIndexInSamples will always be 8,000 samples. The endIndexInSamples will vary depending on how long it takes the user to speak the wake word.

Sample Message

{
  "context": [
    // This is an array of context objects that are used to communicate the
    // state of all client components to Alexa. See Context for details.
  ],
  "event": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "Recognize",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "profile": "{{STRING}}",
      "format": "{{STRING}}",
      "initiator": {
        "type": "{{STRING}}",
        "payload": {
          "wakeWordIndices": {
            "startIndexInSamples": {{LONG}},
            "endIndexInSamples": {{LONG}}
          },
          "token": "{{STRING}}"  
        }
      }
    }
  }
}

Context

This event requires your product to report the status of all client component states to Alexa in the context object.

Payload Parameters

Parameter	Description	Type
profile	Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances. Accepted values: `CLOSE_TALK`, `NEAR_FIELD`, `FAR_FIELD`.	string
format	Identifies the format of captured audio. Accepted value: `AUDIO_L16_RATE_16000_CHANNELS_1`	string
initiator	Lets Alexa know how an interaction was initiated.	object
initiator.type	Represents the action taken by a user to initiate an interaction with Alexa. Accepted values: `PRESS_AND_HOLD`, `TAP`, and `WAKEWORD`. Additionally, an `initiator.type` is provided in an `ExpectSpeech` directive, that string must be returned as `initiator.type` in the following `Recognize` event.	string
initiator.payload	Includes information about the initiator, such as start and stop indices for voice-initiated products.	object
initiator.payload.wakeWordIndices	This object is required when `initiator.type` is set to `WAKEWORD`. `wakeWordIndices` includes the `startIndexInSamples` and `endIndexInSamples`.	object
initiator.payload.wakeWordIndices.startIndexInSamples	Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection.	long
initiator.payload.wakeWordIndices.endIndexInSamples	Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word.	long
initiator.payload.token	This value is only required if present in the payload of a preceding `ExpectSpeech` directive.	string

For additional information, see SpeechRecognizer.Recognize.

Shared Memory Ring Buffer

A shared memory ring buffer satisfies the requirement to stream a continuous utterance to AVS that includes 500 milliseconds of pre-roll, the wake word, and the user's request for cloud-based wake word verification. This reduces latency and the need to copy audio samples multiple times.

The following diagram illustrates one approach; alternative implementations are allowed as long as the requisite functionality is maintained:

Audio Subsystem - Middleware, such as Advanced Linux Sound Architecture (ALSA), that provides APIs to open a stream from a recording device, and receive an audio sample stream.
Audio Capture - A process that opens a recording device, and writes audio samples into the shared memory ring buffer.
Shared Memory Ring Buffer - A shared memory ring buffer object that is thread safe, and accessible by multiple processes. Access to this memory block should support at least one writer and two readers simultaneously. You may choose to implement these executables as a shared library.
- Write API - Used by the recording process to write audio samples to the shared memory ring buffer. Write samples should move data so that the oldest audio samples are overwritten. This example shows a typical API signature: WriteSamples(BufferIn, BufferInLength).
- Read API - Used by the wake word engine and the AVS client to read from specific locations in the shared memory ring buffer. A typical implementation represents location as a 64-bit integer. For example, reading 200 samples from the location 100 is reading the 100th to 300th samples that were written to the shared memory ring buffer. This is an example of a typical API signature: ReadSamples(BufferIn, BufferInLength). Note: Reading from an old location that was overwritten with new audio samples will result in an unintended sample read.
Wake Word Engine - The component that reads the audio samples written to the shared memory ring buffer and analyzes the audio samples for occurrences of the wake word. The procedure of reading and processing audio samples occurs in a loop. For example, the loop reads N samples starting at location 0; then analyzes the samples for an occurrence of the wake word. In the next iteration, it reads and analyzes the next N samples. When an occurrence of the wake word is detected, the wake word engine must identify two locations: 1) the start of the wake word, and 2) the end of the wake word. These values are stored as startIndexInSamples and endIndexInSamples.
WakeWordDetectedSignal - A signal sent from the wake word engine to the AVS client when an occurrence of the wake word is detected. This signal includes two locations, the startIndexInSamples and endIndexInSamples.
AVS Client - The client receives the WakeWordDetectedSignal, extracts the startIndexInSamples, then reads and streams audio samples to AVS including 500 milliseconds of pre-roll.
- The pre-roll is the number of samples corresponding to 500 milliseconds. For example, if the audio samples were recorded at a rate of 16 kilohertz (kHz), then pre-roll is 8,000 samples.

Wake Word Verification Requirements

Requirements for Cloud-Based wake word verification

Wake Word indexes

Shared Memory Ring Buffer

Related topics

Was this page helpful?