Wake Word Verification Requirements


Cloud-based wake word verification improves wake word accuracy for devices that implement the Alexa Voice Service (AVS) by reducing false device wakes caused by utterances that sound similar to the wake word. For example, words that might cause a false wake for "Alexa" include "Alex", "election", and "Alexis." Cloud-based wake word verification also detects media mentions of the wake word. For example, the mention of "Alexa" in an Amazon commercial.

Requirements for Cloud-Based wake word verification

Voice-initiated devices begin to stream user speech to AVS when the wake word engine detects a spoken wake word, such as "Alexa." The stream closes when the user stops speaking or when AVS identifies a user intent, and AVS returns a StopCapture directive to the device. Cloud-based wake word verification has the following requirements:

  • Wake word – Include the wake word in the stream so that AVS can perform cloud-based wake word verification, which reduces false wakes. If AVS can't detect the wake word detected during cloud-based wake word verification, AVS discards the utterance.
  • 500 milliseconds of pre-roll – Pre-roll is the audio captured before AVS detects the wake word and helps calibrate the ambient noise level of the recording to enhance speech recognition.
  • User speech – Any user speech that the device captures until receiving a StopCapture directive. This allows AVS to verify the wake word included in the stream, reducing the number of erroneous responses due to false wakes.

Read the following sections to learn how to implement a shared memory ring buffer for writing and reading audio samples and to include the start and stop indices for wake word detection in each Recognize event sent to AVS.

Wake Word indexes

The Recognize event streams user speech to AVS and translates that speech into one or more directives. This multipart message that includes a JSON-formatted object and a binary audio attachment. The stream must contain the pre-roll, wake word, and the user's utterance. Use chunking to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

When the user request is voice-initiated, the initiator.type SHALL be WAKEWORD and the Recognize event must contain the start and stop indices for the occurrence of the wake word. The timestamps are relative to the start of the audio stream and the units are in samples.

For example, a wake word that starts at the 500 milliseconds mark and has a duration of 500 millisecond has the following indexes with a 16 kHz sample rate:

  • startIndexInSamples - 8000
  • endIndexInSamples - 16000

Sample Message

{
  "context": [
    // This is an array of context objects that are used to communicate the
    // state of all client components to Alexa. See Context for details.
  ],
  "event": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "Recognize",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "profile": "{{STRING}}",
      "format": "{{STRING}}",
      "initiator": {
        "type": "{{STRING}}",
        "payload": {
          "wakeWordIndices": {
            "startIndexInSamples": {{LONG}},
            "endIndexInSamples": {{LONG}}
          },
          "token": "{{STRING}}"  
        }
      }
    }
  }
}

Context

This event requires your product to report the status of all client component states to Alexa in the context object.

Payload Parameters

Parameter Description Type
profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
Accepted values: CLOSE_TALK, NEAR_FIELD, FAR_FIELD.
string
format Identifies the format of captured audio.
Accepted value: AUDIO_L16_RATE_16000_CHANNELS_1
string
initiator Lets Alexa know how an interaction was initiated. object
initiator.type Represents the action taken by a user to initiate an interaction with Alexa.
Accepted values: PRESS_AND_HOLD, TAP, and WAKEWORD. Additionally, an initiator.type is provided in an ExpectSpeech directive, that string must be returned as initiator.type in the following Recognize event.
string
initiator.payload Includes information about the initiator, such as start and stop indices for voice-initiated products. object
initiator.payload.wakeWordIndices This object is required when initiator.type is set to WAKEWORD.
wakeWordIndices includes the startIndexInSamples and endIndexInSamples.
object
initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection. long
initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long
initiator.payload.token This value is only required if present in the payload of a preceding ExpectSpeech directive. string

For additional information, see SpeechRecognizer.Recognize.

Shared Memory Ring Buffer

A shared memory ring buffer satisfies the requirement to stream a continuous utterance to AVS that includes 500 milliseconds of pre-roll, the wake word, and the user's request for cloud-based wake word verification. This reduces latency and the need to copy audio samples multiple times.

The following diagram illustrates one approach; alternative implementations are allowed as long as the requisite functionality is maintained:

Ring Buffer
  1. Audio Subsystem - Middleware, such as Advanced Linux Sound Architecture (ALSA), that provides APIs to open a stream from a recording device, and receive an audio sample stream.
  2. Audio Capture - A process that opens a recording device, and writes audio samples into the shared memory ring buffer.
  3. Shared Memory Ring Buffer - A shared memory ring buffer object that is thread safe, and accessible by multiple processes. Access to this memory block should support at least one writer and two readers simultaneously. You may choose to implement these executables as a shared library.
    • Write API - Used by the recording process to write audio samples to the shared memory ring buffer. Write samples should move data so that the oldest audio samples are overwritten. This example shows a typical API signature: WriteSamples(BufferIn, BufferInLength).
    • Read API - Used by the wake word engine and the AVS client to read from specific locations in the shared memory ring buffer. A typical implementation represents location as a 64-bit integer. For example, reading 200 samples from the location 100 is reading the 100th to 300th samples that were written to the shared memory ring buffer. This is an example of a typical API signature: ReadSamples(BufferIn, BufferInLength). Note: Reading from an old location that was overwritten with new audio samples will result in an unintended sample read.
  4. Wake Word Engine - The component that reads the audio samples written to the shared memory ring buffer and analyzes the audio samples for occurrences of the wake word. The procedure of reading and processing audio samples occurs in a loop. For example, the loop reads N samples starting at location 0; then analyzes the samples for an occurrence of the wake word. In the next iteration, it reads and analyzes the next N samples. When an occurrence of the wake word is detected, the wake word engine must identify two locations: 1) the start of the wake word, and 2) the end of the wake word. These values are stored as startIndexInSamples and endIndexInSamples.
  5. WakeWordDetectedSignal - A signal sent from the wake word engine to the AVS client when an occurrence of the wake word is detected. This signal includes two locations, the startIndexInSamples and endIndexInSamples.
  6. AVS Client - The client receives the WakeWordDetectedSignal, extracts the startIndexInSamples, then reads and streams audio samples to AVS including 500 milliseconds of pre-roll.
    • The pre-roll is the number of samples corresponding to 500 milliseconds. For example, if the audio samples were recorded at a rate of 16 kilohertz (kHz), then pre-roll is 8,000 samples.

Related topics


Was this page helpful?

Last updated: Nov 27, 2023