Home > Alexa > Alexa Voice Service

SpeechRecognizer Interface

Overview

Overview

Every user utterance leverages SpeechRecognizer. It is the core interface of the Alexa Voice Service (AVS). It exposes directives and events for capturing user speech and prompting a client when Alexa needs additional speech input.

Additionally, this interface allows your client to inform AVS of how an interaction with Alexa was initiated (press and hold, tap and release, voice-initiated/wake word enabled), and choose the appropriate Automatic Speech Recognition (ASR) profile for your product, which allows Alexa to understand user speech and respond with precision.

This page covers the following topics:

  • SpeechRecognizer State Diagram
  • SpeechRecognizer Directives and Events

State Diagram

The following diagram illustrates state changes driven by SpeechRecognizer components. Boxes represent SpeechRecognizer states and the connectors indicate state transitions.

SpeechRecognizer has the following states:

IDLE: Prior to capturing user speech, SpeechRecognizer should be in an idle state. SpeechRecognizer should also return to an idle state after a speech interaction with AVS has concluded. This can occur when a speech request has been successfully processed or when an ExpectSpeechTimedOut event has elapsed.

Additionally, SpeechRecognizer may return to an idle state during a multiturn interaction, at which point, if additional speech is required from the user, it should transition from the idle state to the expecting speech state without a user starting a new interaction.

RECOGNIZING: When a user begins interacting with your client, specifically when captured audio is streamed to AVS, SpeechRecognizer should transition from the idle state to the recognizing state. It should remain in the recognizing state until the client stops recording speech (or streaming is complete), at which point your SpeechRecognizer component should transition from the recognizing state to the busy state.

BUSY: While processing the speech request, SpeechRecognizer should be in the busy state. You cannot start another speech request until the component transitions out of the busy state. From the busy state, SpeechRecognizer will transition to the idle state if the request is successfully processed (completed) or to the expecting speech state if Alexa requires additional speech input from the user.

EXPECTING SPEECH: SpeechRecognizer should be in the expecting speech state when additional audio input is required from a user. From expecting speech, SpeechRecognizer should transition to the recognizing state when a user interaction occurs or the interaction is automatically started on the user’s behalf. It should transition to the idle state if no user interaction is detected within the specified timeout window.

SpeechRecognizer State Diagram
Click to enlarge

Recognize Event

The Recognize event is used to send user speech to AVS and translate that speech into one or more directives. This event must be sent as a multipart message: the first part a JSON-formatted object, the second, binary audio captured by the product’s microphone. We encourage streaming (chunking) captured audio to the Alexa Voice Service to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

After an interaction with Alexa is initiated, the microphone must remain open until:

  • A StopCapture directive is received.
  • The stream is closed by the Alexa service.
  • The user manually closes the microphone. For example, a press and hold implementation.

The profile parameter and initiator object tell Alexa which ASR profile should be used to best understand the captured audio being sent, and how the interaction with Alexa was initiated.

If your product is wake word enabled, and you are using wake word verification, make sure that your client adheres to the Streaming Requirements for Cloud-Based Wake Word Verification.

All captured audio sent to AVS should be encoded as:

  • 16bit Linear PCM
  • 16kHz sample rate
  • Single channel
  • Little endian byte order

For a protocol specific example, see Structuring an HTTP/2 Request.

Sample Message

{
  "context": [
      {{...}}
  ],
  "event": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "Recognize",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "profile": "{{STRING}}",
      "format": "{{STRING}}",
      "initiator": {
        "type": "{{STRING}}",
        "payload": {
          "wakeWordIndices": {
            "startIndexInSamples": {{LONG}},
            "endIndexInSamples": {{LONG}}
          }
        }
      }
    }
  }
}

Binary Audio Attachment

Each Recognize event requires a corresponding binary audio attachment as one part of the multipart message. The following headers are required for each binary audio attachment:

Content-Disposition: form-data; name="audio"
Content-Type: application/octet-stream

{{BINARY AUDIO ATTACHMENT}}

Context

The Recognize event requires a client to send the status of its client component states. For additional information see Context.

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique identifier that your client must create for each Recognize event sent to Alexa. This parameter is used to correlate directives sent in response to a specific Recognize event. string

Payload Parameters

Parameter Description Type
profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
Accepted values: "CLOSE_TALK", "NEAR_FIELD", "FAR_FIELD".
string
format Identifies the format of captured audio.
Accepted value: "AUDIO_L16_RATE_16000_CHANNELS_1"
string
initiator Includes information about how an interaction with AVS was initiated.
IMPORTANT: initiator is required i) for wake word enabled products that use cloud-based wake word verification, and ii) when it is included in an ExpectSpeech directive.
object
initiator.type Represents the action taken by the user to start streaming audio to AVS.
Accepted values: "PRESS_AND_HOLD", "TAP", and "WAKEWORD".
string
initiator.payload Includes information about the initiator, such as type and as start and stop indices. object
initiator.payload.wakeWordIndices This object is only required for wake word enabled products that use cloud-based wake word verification.
wakeWordIndices includes the startIndexInSamples and endIndexInSamples.
object
initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection. long
initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long

Profiles

ASR profiles are tuned for different products, form factors, acoustic environments and use cases. Use the table below to learn more about accepted values for the profile parameter.

Value Optimal Listening Distance
CLOSE_TALK 0 to 2.5 ft.
NEAR_FIELD 0 to 5 ft.
FAR_FIELD 0 to 20+ ft.

Initiator

The initiator parameter tells AVS how an interaction with Alexa was triggered; and determines two things:

  1. If StopCapture will be sent to your client when the end of speech is detected in the cloud.
  2. If cloud-based wake word verification will be performed on the stream.

initiator must be included in the payload of each SpeechRecognizer.Recognize event. The following values are accepted:

Value Description Supported Profile(s) StopCapture Enabled Wake Word Verification Enabled Wake Word Indices Required
PRESS_AND_HOLD Audio stream initiated by pressing a button (physical or GUI) and terminated by releasing it. CLOSE_TALK N N N
TAP Audio stream initiated by the tap and release of a button (physical or GUI) and terminated when a StopCapture directive is received. NEAR_FIELD, FAR_FIELD Y N N
WAKEWORD Audio stream initiated by the use of a wake word and terminated when a StopCapture directive is received. NEAR_FIELD, FAR_FILED Y Y Y

StopCapture Directive

This directive instructs your client to stop capturing a user’s speech after AVS has identified the user’s intent or when end of speech is detected. When this directive is received, your client must immediately close the microphone and stop listening for the user’s speech.

Sample Message

{
  "directive": {
        "header": {
            "namespace": "SpeechRecognizer",
            "name": "StopCapture",
            "messageId": "{{STRING}}",
            "dialogRequestId": "{{STRING}}"
        },
        "payload": {
        }
    }
}

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event.

Note: dialogRequestId is only sent in response to a speech request. dialogRequestId is not included in directives sent to your client on the downchannel stream.
string

ExpectSpeech Directive

ExpectSpeech is sent when Alexa requires additional information to fulfill a user’s request. It instructs your client to open the microphone and begin streaming user speech. If the microphone is not opened within the specified timeout window, an ExpectSpeechTimedOut event must be sent from your client to AVS.

During a multi-turn interaction with Alexa, your device will receive at least one SpeechRecognizer.ExpectSpeech directive instructing your client to start listening for user speech. The initiator object included in the payload of the SpeechRecognizer.ExpectSpeech directive must be passed in as the initiator object in the subsequent SpeechRecognizer.Recognize event.

For information on the rules that govern audio prioritization, please review the Interaction Model.

Sample Message

{
  "directive": {
    "header": {
      "namespace": "SpeechRecognizer",
      "name": "ExpectSpeech",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "timeoutInMilliseconds": {{LONG}},
      "initiator": "{{STRING}}"
    }
  }
}

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

Payload Parameters

Parameter Description Type
timeoutInMilliseconds Specifies, in milliseconds, how long your client should wait for the microphone to open and begin streaming user speech to AVS. If the microphone is not opened within the specified timeout window, then the ExpectSpeechTimedOut event must be sent. The primary use case for this behavior is a PRESS_AND_HOLD implementation. long
initiator An opaque string may be passed from AVS to your client. If the initiator is present it must be sent back to AVS as the initiator in the subsequent Recognize event. object

ExpectSpeechTimedOut Event

This event must be sent to AVS if an ExpectSpeech directive was received, but was not satisfied within the specified timeout window.

Sample Message

{
  "event": {
        "header": {
            "namespace": "SpeechRecognizer",
            "name": "ExpectSpeechTimedOut",
            "messageId": "{{STRING}}",
        },
        "payload": {
        }
    }
}

Header Parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string

Payload Parameters

An empty payload should be sent.

Additional Interfaces

Jump to the top of this document. Use the right-hand sidebar to navigate to additional interfaces.

Resources