Your Alexa Dashboards Settings

Enable Cloud-Based Wake Word Verification

Cloud-Based Wake Word Verification is a feature that improves wake word accuracy for Alexa-enabled products by reducing false wakes caused by words that sound similar to the wake word. For example, here are a few words that may cause a false wake for “Alexa”: “Alex”, “election”, “Alexis”. Cloud-Based Wake Word Verification also detects media mentions of the wake word. For example, the mention of “Alexa” in an Amazon commercial.

Initial detection is performed by the wake word engine on the product, then the wake word is verified in the cloud. If a false wake is detected, AVS sends a StopCapture directive to the product in the downchannel that instructs it to close the audio stream, and if applicable, to turn off the blue LEDs to indicate that Alexa has stopped listening.

The following sections detail the work necessary to support this service.

Review the Streaming Requirements for Cloud-Based Wake Word Verification

Voice-initiated products start to stream user speech to AVS when a wake word, such as “Alexa”, is detected by the wake word engine; the stream is closed when the user stops speaking or user’s intent has been identified and the service returns a StopCapture directive. For cloud-based wake word verification to work, the audio streamed to AVS must include the wake word, 500 milliseconds of pre-roll, and any user speech that is captured until a StopCapture directive is received. This allows AVS to verify the wake word included in the stream, reducing the number of erroneous responses due to false wakes.

  • Pre-roll, or the audio captured prior to the detection of the wake word, is used to calibrate the ambient noise level of the recording, which enhances speech recognition.
  • Inclusion of the wake word in the stream allows AVS to perform cloud-based wake word verification, which reduces false wakes (the number of times the wake word engine falsely recognizes the wake word).
  • If the wake word is not detected during cloud-based wake word verification the audio samples are discarded.

The following document provides a recommendation to implement a shared memory ring buffer for writing and reading audio samples, and the specification for including the start and stop indices for wake word detection in each Recognize event sent to AVS.

Read the doc »

Adjust Client Code for a New Context Object: RecognizerState

Context is a container used to communicate the state of your client components to AVS. To support cloud-based wake word verification, all wake word enabled products, regardless of how an interaction with Alexa is initiated, are required to send a new context object, RecognizerState, with each applicable event.

The following is a list of events that require context:

Sample Message


{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "RecognizerState"
    },
    "payload": {
        "wakeword": "ALEXA"
    }
}

Payload Parameters

Parameter Description Type
wakeword Identifies the current wake word.
Accepted Value: "ALEXA"
string

Example

The following example illustrates a SpeechRecognizer.Recognize event from a wake word enabled product.

Click here to expand +

Adjust Client Code for Updated Directives/Events

New key/value pairs have been added to the Recognize event and ExpectSpeech directive to support wake word verification. Please make sure you update your client code accordingly.

SpeechRecognizer.Recognize

The Recognize event has been updated to include the initiator object. It provides AVS with information about the interaction used to trigger Alexa, and if applicable, the start and stop indexes required for cloud-based wake word verification.


The Recognize event is used to send user speech to AVS and translate that speech into one or more directives. This event must be sent as a multipart message: the first part a JSON-formatted object, the second part binary audio captured by the product’s microphone. We encourage streaming (chunking) captured audio to the Alexa Voice Service to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

Additionally, the Recognize event includes the profile parameter and initiator object, that tell Alexa which ASR profile should be used to best understand captured audio, and how the interaction with Alexa was initiated.

During multi-turn interactions, where Alexa requires additional information to act on a request, you will receive an ExpectSpeech directive that may include an initiator value. If present in ExpectSpeech, it must be sent to Alexa in the subsequent Recognize event. If initiator is not present in ExpectSpeech do not include it in the payload of the Recognize event sent in response to ExpectSpeech.

All captured audio sent to AVS should be encoded as:

  • 16bit Linear PCM
  • 16kHz sample rate
  • Single channel
  • Little endian byte order

For a protocol specific example, see Structuring an HTTP/2 Request.

Sample Message

{
  "context": [
      // This is an array of context objects that are used to communicate the
      // state of all client components to Alexa. See Context for details.
  ],
  "event": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "Recognize",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "profile": "{{STRING}}",
      "format": "{{STRING}}",
      "initiator": {
        "type": "{{STRING}}",
        "payload": {
          "wakeWordIndices": {
            "startIndexInSamples": {{LONG}},
            "endIndexInSamples": {{LONG}}
          }   
        }
      }
    }
  }
}

Binary Audio Attachment

Each Recognize event requires a corresponding binary audio attachment as one part of the multipart message. The following headers are required for each binary audio attachment:

Content-Disposition: form-data; name="audio"
Content-Type: application/octet-stream

{{BINARY AUDIO ATTACHMENT}}

Context

This event requires your product to report the status of all client component states to Alexa in the context object. For additional information see Context.

Context Required
AlertsState Yes
PlaybackState Yes
VolumeState Yes
SpeechState Yes
RecognizerState Optional

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique identifier that your client must create for each Recognize event sent to Alexa. This parameter is used to correlate directives sent in response to a specific Recognize event. string

Payload Parameters

Parameter Description Type
profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
Accepted values: "CLOSE_TALK", "NEAR_FIELD", "FAR_FIELD".
string
format Identifies the format of captured audio.
Accepted value: "AUDIO_L16_RATE_16000_CHANNELS_1"
string
initiator Includes information about how an interaction with AVS was initiated.
IMPORTANT: initiator is required i) for wake word enabled products that use cloud-based wake word verification, and ii) when it is included in an ExpectSpeech directive.
object
initiator.type Represents the action taken by the user to start streaming audio to AVS.
Accepted values: "PRESS_AND_HOLD", "TAP", and "WAKEWORD".
string
initiator.payload Includes information about the initiator, such as type and as start and stop indices. object
initiator.payload.wakeWordIndices This object is only required for wake word enabled products that use cloud-based wake word verification.
wakeWordIndices includes the startIndexInSamples and endIndexInSamples.
object
initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). long
initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). long

Profiles

ASR profiles are tuned for different products, form factors, acoustic environments and use cases. Use the table below to learn more about accepted values for the profile parameter.

Value Optimal Listening Distance
CLOSE_TALK 0 to 2.5 ft.
NEAR_FIELD 0 to 5 ft.
FAR_FIELD 0 to 20+ ft.

Initiator

The initiator object tells Alexa how an interaction was triggered; and determines two things:

  1. If StopCapture will be sent to your client when the end of speech is detected in the cloud.
  2. If cloud-based wake word verification will be performed on the stream.

The following values are accepted:

Value Description Supported Profile(s) StopCapture Enabled Wake Word Verification Enabled Wake Word Indices Required
PRESS_AND_HOLD Audio stream initiated by pressing a button (physical or GUI) and terminated by releasing it. CLOSE_TALK N N N
TAP Audio stream initiated by the tap and release of a button (physical or GUI) and terminated when a StopCapture directive is received. NEAR_FIELD, FAR_FIELD Y N N
WAKEWORD Audio stream initiated by the use of a wake word and terminated when a StopCapture directive is received. NEAR_FIELD, FAR_FILED Y Y Y

SpeechRecognizer.ExpectSpeech

This directive has been updated to include initiator. In a multi-turn scenario, where Alexa requires additional information from the user to complete a request, the initiator sent to a client must be returned to AVS in the subsequent Recognize event.


ExpectSpeech is sent when Alexa requires additional information to fulfill a user’s request. It instructs your client to open the microphone and begin streaming user speech. If the microphone is not opened within the specified timeout window, an ExpectSpeechTimedOut event must be sent from your client to AVS.

During a multi-turn interaction with Alexa, your device will receive at least one SpeechRecognizer.ExpectSpeech directive instructing your client to start listening for user speech. The initiator value included in the payload of the SpeechRecognizer.ExpectSpeech directive must be passed in as the initiator in the subsequent SpeechRecognizer.Recognize event.

For information on the rules that govern audio prioritization, please review the Interaction Model.

Sample Message

{
  "directive": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "ExpectSpeech",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "timeoutInMilliseconds": {{LONG}},
      "initiator": "{{STRING}}"
    }
  }
}

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event.

Note: dialogRequestId is only sent in response to a speech request.
string

Payload Parameters

Parameter Description Type
timeoutInMilliseconds Specifies how long a microphone will remain open in milliseconds before issuing a timeout. long
initiator An opaque string may be passed from AVS to your client. If the initiator is present it must be sent back to AVS as the initiator in the subsequent Recognize event. string

Resources