Merci de votre visite. Cette page est disponible en anglais uniquement.

SpeechRecognizer Interface

    Every user utterance leverages SpeechRecognizer. It is the core interface of the Alexa Voice Service (AVS). It exposes directives and events for capturing user speech and prompting a client when Alexa needs additional speech input.

    Additionally, this interface allows your client to inform AVS of how an interaction with Alexa was initiated (press and hold, tap and release, voice-initiated/wake word enabled), and choose the appropriate Automatic Speech Recognition (ASR) profile for your product, which allows Alexa to understand user speech and respond with precision.

    Version Changes

    State Diagram

    The following diagram illustrates state changes driven by SpeechRecognizer components. Boxes represent SpeechRecognizer states and the connectors indicate state transitions.

    SpeechRecognizer has the following states:

    IDLE: Prior to capturing user speech, SpeechRecognizer should be in an idle state. SpeechRecognizer should also return to an idle state after a speech interaction with AVS has concluded. This can occur when a speech request has been successfully processed or when an ExpectSpeechTimedOut event has elapsed.

    Additionally, SpeechRecognizer may return to an idle state during a multiturn interaction, at which point, if additional speech is required from the user, it should transition from the idle state to the expecting speech state without a user starting a new interaction.

    RECOGNIZING: When a user begins interacting with your client, specifically when captured audio is streamed to AVS, SpeechRecognizer should transition from the idle state to the recognizing state. It should remain in the recognizing state until the client stops recording speech (or streaming is complete), at which point your SpeechRecognizer component should transition from the recognizing state to the busy state.

    BUSY: While processing the speech request, SpeechRecognizer should be in the busy state. You cannot start another speech request until the component transitions out of the busy state. From the busy state, SpeechRecognizer will transition to the idle state if the request is successfully processed (completed) or to the expecting speech state if Alexa requires additional speech input from the user.

    EXPECTING SPEECH: SpeechRecognizer should be in the expecting speech state when additional audio input is required from a user. From expecting speech, SpeechRecognizer should transition to the recognizing state when a user interaction occurs or the interaction is automatically started on the user's behalf. It should transition to the idle state if no user interaction is detected within the specified timeout window.

    SpeechRecognizer State Diagram
    Click to enlarge

    Capabilities API

    To use version 2.3 of the SpeechRecognizer interface, it must be declared in your call to the Capabilities API. Note that SpeechRecognizer 2.3 depends on version 2.0 or higher of the System interface.

    Sample Object

    {
      "type": "AlexaInterface",
      "interface": "SpeechRecognizer",
      "version": "2.3",
      "configurations": {
        "wakeWords": [
          {
            "scopes": ["DEFAULT"],
            "values": [
              ["ALEXA"]
            ]
          }
        ]
      }
    }
    

    wakeWords

    This list informs Alexa of the wake words that the device can be set to listen for through the SetWakeWords, WakeWordsReport, and WakeWordsChanged messages.

    Currently, the only wake word available for AVS devices is ALEXA, which is applicable to every locale the device might be set to. Therefore, specify ALEXA in the global DEFAULT scope, as shown above.

    Recognize Event

    The Recognize event is used to send user speech to AVS and translate that speech into one or more directives. This event must be sent as a multipart message, consisting of two parts:

    • A JSON-formatted object
    • The binary audio captured by the product's microphone.

    Captured audio that is streamed to AVS should be separated into chunks in order to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

    After an interaction with Alexa is initiated, the microphone must remain open until:

    • A StopCapture directive is received.
    • The stream is closed by the Alexa service.
    • The user manually closes the microphone. For example, a press and hold implementation.

    The profile parameter and initiator object tell Alexa which ASR profile should be used to best understand the captured audio, and how the interaction was initiated.

    All captured audio must be sent to AVS in either PCM or Opus, and adhere to the following specifications:

    PCM Opus
    16bit Linear PCM 16bit Opus
    16kHz sample rate 16kHz sample rate
    Single channel 32k bit rate
    Little endian byte order Little endian byte order

    For a protocol specific example, see Structuring an HTTP/2 Request.

    Sample Message

    {
      "context": [
        // This is an array of context objects that are used to communicate the
        // state of all client components to Alexa. See Context for details.      
      ],
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "Recognize",
          "messageId": "{{STRING}}",
          "dialogRequestId": "{{STRING}}"
        },
        "payload": {
          "profile": "{{STRING}}",
          "format": "{{STRING}}",
          "initiator": {
            "type": "{{STRING}}",
            "payload": {
              "wakeWordIndices": {
                "startIndexInSamples": {{LONG}},
                "endIndexInSamples": {{LONG}}
              },
              "wakeWord": "{{STRING}}",
              "token": "{{STRING}}"
            }
          }
          "startOfSpeechTimestamp": "{{STRING}}"
        }
      }
    }
    

    Binary Audio Attachment

    Each Recognize event requires a corresponding binary audio attachment as one part of the multipart message. The following headers are required for each binary audio attachment:

    Content-Disposition: form-data; name="audio"
    Content-Type: application/octet-stream
    
    {{BINARY AUDIO ATTACHMENT}}
    

    Context

    This event requires your product to report the status of all client component states to Alexa in the context object. For additional information see Context.

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string
    dialogRequestId A unique identifier that your client must create for each Recognize event sent to Alexa. This parameter is used to correlate directives sent in response to a specific Recognize event. string

    Payload Parameters

    Parameter Description Type
    profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
    Accepted values: CLOSE_TALK, NEAR_FIELD, FAR_FIELD.
    string
    format Identifies the format of captured audio.
    Accepted Values: AUDIO_L16_RATE_16000_CHANNELS_1 (PCM), OPUS
    string
    initiator Lets Alexa know how an interaction was initiated.

    This object is required when an interaction is originated by the end user (wake word, tap, push and hold).

    If initiator is present in an ExpectSpeech directive then it must be returned in the following Recognize event. If initiator is absent from the ExpectSpeech directive, then it should not be included in the following Recognize event.
    object
    initiator.type Represents the action taken by a user to initiate an interaction with Alexa.
    Accepted Values: PRESS_AND_HOLD, TAP, and WAKEWORD

    If an initiator.type is provided in an ExpectSpeech directive, that string must be returned as initiator.type in the following Recognize event.
    string
    initiator.payload Includes information about the initiator. object
    initiator.payload.wakeWordIndices This object is required when initiator.type is set to WAKEWORD.
    wakeWordIndices includes the startIndexInSamples and endIndexInSamples. For additional details, see Requirements for Cloud-Based Wake Word Verification.
    object
    initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection. long
    initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long
    initiator.payload.wakeWord The wake word that woke the device, in all capital letters.

    Accepted Values: ALEXA
    string
    initiator.payload.token An opaque string. This value is only required if present in the payload of a preceding ExpectSpeech directive. string
    startOfSpeechTimestamp The timestamp of the start of the user's speech. If provided, this optional field will be returned verbatim by Alexa in a subsequent SetEndOfSpeechOffset directive for use in device-side UPL calculations. Because the value is opaque to Alexa, it can be in any timestamp format. string

    Profiles

    ASR profiles are tuned for different products, form factors, acoustic environments and use cases. Use the table below to learn more about accepted values for the profile parameter.

    Value Optimal Listening Distance
    CLOSE_TALK 0 to 2.5 ft.
    NEAR_FIELD 0 to 5 ft.
    FAR_FIELD 0 to 20+ ft.

    Initiator

    The initiator parameter tells AVS how an interaction with Alexa was triggered; and determines two things:

    1. If StopCapture will be sent to your client when the end of speech is detected in the cloud.
    2. If cloud-based wake word verification will be performed on the stream.

    initiator must be included in the payload of each SpeechRecognizer.Recognize event. The following values are accepted:

    Value Description Supported Profile(s) StopCapture Enabled Wake Word Verification Enabled Wake Word Indices Required
    PRESS_AND_HOLD Audio stream initiated by pressing a button (physical or GUI) and terminated by releasing it. CLOSE_TALK N N N
    TAP Audio stream initiated by the tap and release of a button (physical or GUI) and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y N N
    WAKEWORD Audio stream initiated by the use of a wake word and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y Y Y

    StopCapture Directive

    This directive instructs your client to stop capturing a user’s speech after AVS has identified the user’s intent or when end of speech is detected. When this directive is received, your client must immediately close the microphone and stop listening for the user’s speech.

    Sample Message

    {
        "directive": {
            "header": {
                "namespace": "SpeechRecognizer",
                "name": "StopCapture",
                "messageId": "{{STRING}}",
                "dialogRequestId": "{{STRING}}"
            },
            "payload": {
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string
    dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

    ExpectSpeech Directive

    ExpectSpeech is sent when Alexa requires additional information to fulfill a user's request. It instructs your client to open the microphone and begin streaming user speech. If the microphone is not opened within the specified timeout window, an ExpectSpeechTimedOut event must be sent from your client to AVS.

    During a multi-turn interaction with Alexa, your device will receive at least one ExpectSpeech directive instructing your client to start listening for user speech. If present, the initiator object included in the payload of the ExpectSpeech directive must be passed back to Alexa as the initiator object in the following Recognize event. If initiator is absent from the payload, the following Recognize event should not include initiator.

    For information on the rules that govern audio prioritization, please review the Interaction Model.

    Sample Message

    {
        "directive": {
            "header": {
                "namespace": "SpeechRecognizer",
                "name": "ExpectSpeech",
                "messageId": "{{STRING}}",
                "dialogRequestId": "{{STRING}}"
            },
            "payload": {
                "timeoutInMilliseconds": {{LONG}},
                "initiator": {
                    "type": "{{STRING}}",
                    "payload": {
                        "token": "{{STRING}}"
                    }
                }
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string
    dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

    Payload Parameters

    Parameter Description Type
    timeoutInMilliseconds Specifies, in milliseconds, how long your client should wait for the microphone to open and begin streaming user speech to AVS. If the microphone is not opened within the specified timeout window, then the ExpectSpeechTimedOut event must be sent. The primary use case for this behavior is a PRESS_AND_HOLD implementation. long
    initiator Contains information about the interaction. If present it must be sent back to Alexa in the following Recognize event. object
    initiator.type An opaque string. If present it must be sent back to Alexa in the following Recognize event. string
    initiator.payload Includes information about the initiator. object
    initiator.payload.token An opaque string. If present it must be sent back to Alexa in the following Recognize event. string

    ExpectSpeechTimedOut Event

    This event must be sent to AVS if an ExpectSpeech directive was received, but was not satisfied within the specified timeout window.

    Sample Message

    {
        "event": {
            "header": {
                "namespace": "SpeechRecognizer",
                "name": "ExpectSpeechTimedOut",
                "messageId": "{{STRING}}",
            },
            "payload": {
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    An empty payload should be sent.

    SetEndOfSpeechOffset Directive

    The SetEndOfSpeechOffset provides the device with the values necessary to calculate user-perceived latency (UPL) associated with a user's utterance.

    UPL is defined as the duration between the end of a user's speech and the beginning of the resulting activity, such as audio output from a Speak directive.

    The device only stops streaming audio to Alexa as part of the Recognize event after receiving the StopCapture. As a result, it does not detect the duration of the user's speech in the audio stream.

    The SetEndOfSpeechOffset directive contains the duration of the user's speech in the endOfSpeechOffsetInMilliseconds field. By correlating with the timestamp that the speech started, the timestamp of the end of speech can be calculated. Finally, the device can calculate the duration between the end of speech and the beginning of the resulting activity.

    Example UPL Calculation

    1. User says "Alexa, What's the weather?"
    2. Device stores timestamp of the beginning of speech. Call this t0.
    3. Device sends the Recognize event with the user's speech.
    4. The device receives the SetEndOfSpeechOffset directive with the endOfSpeechOffsetInMilliseconds value. Call this speech duration d.
    5. The device receives a Speak directive with the audio stream containing the weather.
    6. The device begins playing the audio at timestamp t1.
    7. The device calculates the UPL as t1 - (t0 + d).

    Sample Message

    {
      "directive": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SetEndOfSpeechOffset",
          "messageId": "{{STRING}}",
          "dialogRequestId": "{{STRING}}"
        },
        "payload": {
          "endOfSpeechOffsetInMilliseconds": {{LONG}},
          "startOfSpeechTimestamp": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string
    dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

    Payload Parameters

    Parameter Description Type
    endOfSpeechOffsetInMilliseconds This is the duration in milliseconds of the user's speech sent up in the corresponding Recognize event. integer
    startOfSpeechTimestamp If provided in the corresponding Recognize event, this is the same value passed back. By adding the endOfSpeechOffsetInMilliseconds to it, the timestamp of the end of the user's speech can be directly calculated. string

    SetWakeWords Directive

    Alexa will send the SetWakeWords directive to the device to instruct it to set its wake word(s). This may result from an end user's setting the wake word(s) in the Alexa companion app. The device must send the WakeWordsReport event in response to this directive.

    Sample Message

    {
      "directive": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SetWakeWords",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWords": ["{{STRING}}", ...]
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWords The list of wake words the device should set to active. If multiple values are present, the device must load the required wake word models to respond to any of the listed wake words. list
    wakeWords[i] A wake word that the device must recognize to begin streaming audio to Alexa through the Recognize event.

    Possible Values: ALEXA
    string

    WakeWordsReport Event

    The device sends the WakeWordsReport event in response to a SetWakeWords directive sent by Alexa. (For wake word changes initiated by the device, including when the change is first triggered by a peripheral such as a companion app, the WakeWordsChanged event must be sent instead.)

    This event must be sent both in cases of success and failure, reporting the wake word(s) actually set on the device after processing the SetWakeWords directive.

    The event object, without the messageId in its header, must be included in the StateReport event's states list when responding to the ReportState directive.

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "WakeWordsReport",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWords": ["{{STRING}}", ...]
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWords The list of wake words the active on the device. If the device responds to multiple wake words, they must each be listed. list
    wakeWords[i] A wake word the device currently recognizes to begin streaming audio to Alexa through the Recognize event.

    Accepted Values: ALEXA
    string

    WakeWordsChanged Event

    The device sends the WakeWordsChanged event when a wake word change is initiated by the device. Such changes include those triggered by peripherals, such as third-party companion apps that instruct the device to change its wake word(s) without otherwise informing Alexa directly. (For wake word changes initiated by Alexa via the SetWakeWords directive, the WakeWordsReport event must be sent instead.)

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "WakeWordsChanged",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWords": ["{{STRING}}", ...]
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWords The list of wake words the active on the device. If the device responds to multiple wake words, they must each be listed. list
    wakeWords[i] A wake word the device currently recognizes to begin streaming audio to Alexa through the Recognize event.

    Accepted Values: ALEXA
    string

    SetWakeWordConfirmation Directive

    Alexa will send the SetWakeWordCofirmation directive to the device to instruct it whether to play a tone when it detects a user's speaking a wake word. This may result from an end user's setting the behavior in the Alexa companion app. The device must send the WakeWordConfirmationReport event in response to this directive.

    Sample Message

    {
      "directive": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SetWakeWordConfirmation",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWordConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWordConfirmation The expected behavior when the wake word has been detected.

    Possible Values:
    TONE: The device must emit an audible tone when the wake word is detected.
    NONE: The device must not emit an audible tone when the wake word is detected.
    string

    WakeWordConfirmationReport Event

    The device sends the WakeWordConfirmationReport event in response to a SetWakeWordConfirmation directive sent by Alexa. (For setting changes initiated by the device, including when the change is first triggered by a peripheral such as a companion app, the WakeWordConfirmationChanged event must be sent instead.)

    This event must be sent both in cases of success and failure, reporting the value actually set on the device after processing the SetWakeWordConfirmation directive.

    The event object, without the messageId in its header, must be included in the StateReport event's states list when responding to the ReportState directive.

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "WakeWordConfirmationReport",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWordConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWordConfirmation The behavior when the wake word has been detected.

    Accepted Values:
    TONE: The device emits an audible tone when the wake word is detected.
    NONE: The device does not emit an audible tone when the wake word is detected.
    string

    WakeWordConfirmationChanged Event

    The device sends the WakeWordConfirmationChanged event when a setting change is initiated by the device. Such changes include those triggered by peripherals, such as third-party companion apps that instruct the device to change the setting without otherwise informing Alexa directly. (For changes initiated by Alexa via the SetWakeWordConfirmation directive, the WakeWordConfirmationReport event must be sent instead.)

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "WakeWordsChanged",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "wakeWordConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    wakeWordConfirmation The behavior when the wake word has been detected.

    Accepted Values:
    TONE: The device emits an audible tone when the wake word is detected.
    NONE: The device does not emit an audible tone when the wake word is detected.
    string

    SetSpeechConfirmation Directive

    Alexa will send the SetSpeechConfirmation directive to the device to instruct it whether to play a tone when it stops capturing user audio input. This may result from an end user's setting the behavior in the Alexa companion app. The device must send the SpeechConfirmationReport event in response to this directive.

    Sample Message

    {
      "directive": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SetSpeechConfirmation",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "speechConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    speechConfirmation The expected behavior when the device stops capturing user audio.

    Possible Values:
    TONE: The device must emit an audible tone when it stops capturing audio.
    NONE: The device must not emit an audible tone when it stops capturing audio.
    string

    SpeechConfirmationReport Event

    The device sends the SpeechConfirmationReport event in response to a SetSpeechConfirmation directive sent by Alexa. (For setting changes initiated by the device, including when the change is first triggered by a peripheral such as a companion app, the SpeechConfirmationChanged event must be sent instead.)

    This event must be sent both in cases of success and failure, reporting the value actually set on the device after processing the SetSpeechConfirmation directive.

    The event object, without the messageId in its header, must be included in the StateReport event's states list when responding to the ReportState directive.

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SpeechConfirmationReport",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "speechConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    speechConfirmation The behavior when the device stops capturing user audio.

    Accepted Values:
    TONE: The device emits an audible tone when stops capturing user audio.
    NONE: The device does not emit an audible tone when it stops capturing user audio.
    string

    SpeechConfirmationChanged Event

    The device sends the SpeechConfirmationChanged event when a setting change is initiated by the device. Such changes include those triggered by peripherals, such as third-party companion apps that instruct the device to change the setting without otherwise informing Alexa directly. (For changes initiated by Alexa via the SetSpeechConfirmation directive, the SpeechConfirmationReport event must be sent instead.)

    Sample Message

    {
      "event": {
        "header": {
          "namespace": "SpeechRecognizer",
          "name": "SpeechConfirmationChanged",
          "messageId": "{{STRING}}"
        },
        "payload": {
          "speechConfirmation": "{{STRING}}"
        }
      }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    speechConfirmation The behavior when the device stops capturing user audio.

    Accepted Values:
    TONE: The device emits an audible tone when stops capturing user audio.
    NONE: The device does not emit an audible tone when it stops capturing user audio.
    string

      Every user utterance leverages SpeechRecognizer. It is the core interface of the Alexa Voice Service (AVS). It exposes directives and events for capturing user speech and prompting a client when Alexa needs additional speech input.

      Additionally, this interface allows your client to inform AVS of how an interaction with Alexa was initiated (press and hold, tap and release, voice-initiated/wake word enabled), and choose the appropriate Automatic Speech Recognition (ASR) profile for your product, which allows Alexa to understand user speech and respond with precision.

      Version Changes

      • Opus is now a supported format for captured audio. For more details, see the specification under the Recognize event.

      State Diagram

      The following diagram illustrates state changes driven by SpeechRecognizer components. Boxes represent SpeechRecognizer states and the connectors indicate state transitions.

      SpeechRecognizer has the following states:

      IDLE: Prior to capturing user speech, SpeechRecognizer should be in an idle state. SpeechRecognizer should also return to an idle state after a speech interaction with AVS has concluded. This can occur when a speech request has been successfully processed or when an ExpectSpeechTimedOut event has elapsed.

      Additionally, SpeechRecognizer may return to an idle state during a multiturn interaction, at which point, if additional speech is required from the user, it should transition from the idle state to the expecting speech state without a user starting a new interaction.

      RECOGNIZING: When a user begins interacting with your client, specifically when captured audio is streamed to AVS, SpeechRecognizer should transition from the idle state to the recognizing state. It should remain in the recognizing state until the client stops recording speech (or streaming is complete), at which point your SpeechRecognizer component should transition from the recognizing state to the busy state.

      BUSY: While processing the speech request, SpeechRecognizer should be in the busy state. You cannot start another speech request until the component transitions out of the busy state. From the busy state, SpeechRecognizer will transition to the idle state if the request is successfully processed (completed) or to the expecting speech state if Alexa requires additional speech input from the user.

      EXPECTING SPEECH: SpeechRecognizer should be in the expecting speech state when additional audio input is required from a user. From expecting speech, SpeechRecognizer should transition to the recognizing state when a user interaction occurs or the interaction is automatically started on the user's behalf. It should transition to the idle state if no user interaction is detected within the specified timeout window.

      SpeechRecognizer State Diagram
      Click to enlarge

      Capabilities API

      To use version 2.0 of the SpeechRecognizer interface, it must be declared in your call to the Capabilities API. For additional details, see Capabilities API.

      Sample Object

      {
          "type": "AlexaInterface",
          "interface": "SpeechRecognizer",
          "version": "2.0"
      }

      SpeechRecognizer Context

      Alexa expects all clients to report the currently set wake word, if wake word enabled.

      To learn more about reporting Context, see Context Overview.

      Sample Message

      
      {
          "header": {
              "namespace": "SpeechRecognizer",
              "name": "RecognizerState"
          },
          "payload": {
              "wakeword": "ALEXA"
          }
      }
      
      

      Payload Parameters

      Parameter Description Type
      wakeword Identifies the current wake word.
      Accepted Value: "ALEXA"
      string

      Recognize Event

      The Recognize event is used to send user speech to AVS and translate that speech into one or more directives. This event must be sent as a multipart message, consisting of two parts:

      • A JSON-formatted object
      • The binary audio captured by the product's microphone.

      Captured audio that is streamed to AVS should be chunked to reduce latency. The stream should contain 10ms of captured audio per chunk (320 bytes).

      After an interaction with Alexa is initiated, the microphone must remain open until:

      • A StopCapture directive is received.
      • The stream is closed by the Alexa service.
      • The user manually closes the microphone. For example, a press and hold implementation.

      The profile parameter and initiator object tell Alexa which ASR profile should be used to best understand the captured audio, and how the interaction was initiated.

      All captured audio must be sent to AVS in either PCM or Opus, and adhere to the following specifications:

      PCM Opus
      16bit Linear PCM 16bit Opus
      16kHz sample rate 16kHz sample rate
      Single channel 32k bit rate
      Little endian byte order Little endian byte order

      For a protocol specific example, see Structuring an HTTP/2 Request.

      Sample Message

      {
          "context": [
              // This is an array of context objects that are used to communicate the
              // state of all client components to Alexa. See Context for details.      
          ],   
          "event": {
              "header": {
                  "namespace": "SpeechRecognizer",
                  "name": "Recognize",
                  "messageId": "{{STRING}}",
                  "dialogRequestId": "{{STRING}}"
              },
              "payload": {
                  "profile": "{{STRING}}",
                  "format": "{{STRING}}",
                  "initiator": {
                      "type": "{{STRING}}",
                      "payload": {
                          "wakeWordIndices": {
                              "startIndexInSamples": {{LONG}},
                              "endIndexInSamples": {{LONG}}
                          },
                          "token": "{{STRING}}"   
                      }
                  }
              }
          }
      }
      

      Binary Audio Attachment

      Each Recognize event requires a corresponding binary audio attachment as one part of the multipart message. The following headers are required for each binary audio attachment:

      Content-Disposition: form-data; name="audio"
      Content-Type: application/octet-stream
      
      {{BINARY AUDIO ATTACHMENT}}
      

      Context

      This event requires your product to report the status of all client component states to Alexa in the context object. For additional information see Context.

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string
      dialogRequestId A unique identifier that your client must create for each Recognize event sent to Alexa. This parameter is used to correlate directives sent in response to a specific Recognize event. string

      Payload Parameters

      Parameter Description Type
      profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
      Accepted values: CLOSE_TALK, NEAR_FIELD, FAR_FIELD.
      string
      format Identifies the format of captured audio.
      Accepted value: AUDIO_L16_RATE_16000_CHANNELS_1 (PCM), OPUS.
      string
      initiator Lets Alexa know how an interaction was initiated.

      This object is required when an interaction is originated by the end user (wake word, tap, push and hold).

      If initiator is present in an ExpectSpeech directive then it must be returned in the following Recognize event. If initiator is absent from the ExpectSpeech directive, then it should not be included in the following Recognize event.
      object
      initiator.type Represents the action taken by a user to initiate an interaction with Alexa.
      Accepted values: PRESS_AND_HOLD, TAP, and WAKEWORD. If an initiator.type is provided in an ExpectSpeech directive, that string must be returned as initiator.type in the following Recognize event.
      string
      initiator.payload Includes information about the initiator. object
      initiator.payload.wakeWordIndices This object is required when initiator.type is set to WAKEWORD.
      wakeWordIndices includes the startIndexInSamples and endIndexInSamples. For additional details, see Requirements for Cloud-Based Wake Word Verification.
      object
      initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection. long
      initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long
      initiator.payload.token An opaque string. This value is only required if present in the payload of a preceding ExpectSpeech directive. string

      Profiles

      ASR profiles are tuned for different products, form factors, acoustic environments and use cases. Use the table below to learn more about accepted values for the profile parameter.

      Value Optimal Listening Distance
      CLOSE_TALK 0 to 2.5 ft.
      NEAR_FIELD 0 to 5 ft.
      FAR_FIELD 0 to 20+ ft.

      Initiator

      The initiator parameter tells AVS how an interaction with Alexa was triggered; and determines two things:

      1. If StopCapture will be sent to your client when the end of speech is detected in the cloud.
      2. If cloud-based wake word verification will be performed on the stream.

      initiator must be included in the payload of each SpeechRecognizer.Recognize event. The following values are accepted:

      Value Description Supported Profile(s) StopCapture Enabled Wake Word Verification Enabled Wake Word Indices Required
      PRESS_AND_HOLD Audio stream initiated by pressing a button (physical or GUI) and terminated by releasing it. CLOSE_TALK N N N
      TAP Audio stream initiated by the tap and release of a button (physical or GUI) and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y N N
      WAKEWORD Audio stream initiated by the use of a wake word and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y Y Y

      StopCapture Directive

      This directive instructs your client to stop capturing a user’s speech after AVS has identified the user’s intent or when end of speech is detected. When this directive is received, your client must immediately close the microphone and stop listening for the user’s speech.

      Sample Message

      {
          "directive": {
              "header": {
                  "namespace": "SpeechRecognizer",
                  "name": "StopCapture",
                  "messageId": "{{STRING}}",
                  "dialogRequestId": "{{STRING}}"
              },
              "payload": {
              }
          }
      }
      

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string
      dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

      ExpectSpeech Directive

      ExpectSpeech is sent when Alexa requires additional information to fulfill a user's request. It instructs your client to open the microphone and begin streaming user speech. If the microphone is not opened within the specified timeout window, an ExpectSpeechTimedOut event must be sent from your client to AVS.

      During a multi-turn interaction with Alexa, your device will receive at least one ExpectSpeech directive instructing your client to start listening for user speech. If present, the initiator object included in the payload of the ExpectSpeech directive must be passed back to Alexa as the initiator object in the following Recognize event. If initiator is absent from the payload, the following Recognize event should not include initiator.

      For information on the rules that govern audio prioritization, please review the Interaction Model.

      Sample Message

      {
          "directive": {
              "header": {
                  "namespace": "SpeechRecognizer",
                  "name": "ExpectSpeech",
                  "messageId": "{{STRING}}",
                  "dialogRequestId": "{{STRING}}"
              },
              "payload": {
                  "timeoutInMilliseconds": {{LONG}},
                  "initiator": {
                      "type": "{{STRING}}",
                      "payload": {
                          "token": "{{STRING}}"
                      }
                  }
              }
          }
      }
      

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string
      dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

      Payload Parameters

      Parameter Description Type
      timeoutInMilliseconds Specifies, in milliseconds, how long your client should wait for the microphone to open and begin streaming user speech to AVS. If the microphone is not opened within the specified timeout window, then the ExpectSpeechTimedOut event must be sent. The primary use case for this behavior is a PRESS_AND_HOLD implementation. long
      initiator Contains information about the interaction. If present it must be sent back to Alexa in the following Recognize event. object
      initiator.type An opaque string. If present it must be sent back to Alexa in the following Recognize event. string
      initiator.payload Includes information about the initiator. object
      initiator.payload.token An opaque string. If present it must be sent back to Alexa in the following Recognize event. string

      ExpectSpeechTimedOut Event

      This event must be sent to AVS if an ExpectSpeech directive was received, but was not satisfied within the specified timeout window.

      Sample Message

      {
          "event": {
              "header": {
                  "namespace": "SpeechRecognizer",
                  "name": "ExpectSpeechTimedOut",
                  "messageId": "{{STRING}}",
              },
              "payload": {
              }
          }
      }
      

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string

      Payload Parameters

      An empty payload should be sent.

        Every user utterance leverages SpeechRecognizer. It is the core interface of the Alexa Voice Service (AVS). It exposes directives and events for capturing user speech and prompting a client when Alexa needs additional speech input.

        Additionally, this interface allows your client to inform AVS of how an interaction with Alexa was initiated (press and hold, tap and release, voice-initiated/wake word enabled), and choose the appropriate Automatic Speech Recognition (ASR) profile for your product, which allows Alexa to understand user speech and respond with precision.

        State Diagram

        The following diagram illustrates state changes driven by SpeechRecognizer components. Boxes represent SpeechRecognizer states and the connectors indicate state transitions.

        SpeechRecognizer has the following states:

        IDLE: Prior to capturing user speech, SpeechRecognizer should be in an idle state. SpeechRecognizer should also return to an idle state after a speech interaction with AVS has concluded. This can occur when a speech request has been successfully processed or when an ExpectSpeechTimedOut event has elapsed.

        Additionally, SpeechRecognizer may return to an idle state during a multiturn interaction, at which point, if additional speech is required from the user, it should transition from the idle state to the expecting speech state without a user starting a new interaction.

        RECOGNIZING: When a user begins interacting with your client, specifically when captured audio is streamed to AVS, SpeechRecognizer should transition from the idle state to the recognizing state. It should remain in the recognizing state until the client stops recording speech (or streaming is complete), at which point your SpeechRecognizer component should transition from the recognizing state to the busy state.

        BUSY: While processing the speech request, SpeechRecognizer should be in the busy state. You cannot start another speech request until the component transitions out of the busy state. From the busy state, SpeechRecognizer will transition to the idle state if the request is successfully processed (completed) or to the expecting speech state if Alexa requires additional speech input from the user.

        EXPECTING SPEECH: SpeechRecognizer should be in the expecting speech state when additional audio input is required from a user. From expecting speech, SpeechRecognizer should transition to the recognizing state when a user interaction occurs or the interaction is automatically started on the user's behalf. It should transition to the idle state if no user interaction is detected within the specified timeout window.

        SpeechRecognizer State Diagram
        Click to enlarge

        Capabilities API

        To use version 1.0 of the SpeechRecognizer interface, it must be declared in your call to the Capabilities API. For additional details, see Capabilities API.

        Sample Object

        {
            "type": "AlexaInterface",
            "interface": "SpeechRecognizer",
            "version": "1.0"
        }

        SpeechRecognizer Context

        Alexa expects all clients to report the currently set wake word, if wake word enabled.

        To learn more about reporting Context, see Context Overview.

        Sample Message

        
        {
            "header": {
                "namespace": "SpeechRecognizer",
                "name": "RecognizerState"
            },
            "payload": {
                "wakeword": "ALEXA"
            }
        }
        
        

        Payload Parameters

        Parameter Description Type
        wakeword Identifies the current wake word.
        Accepted Value: "ALEXA"
        string

        Recognize Event

        The Recognize event is used to send user speech to AVS and translate that speech into one or more directives. This event must be sent as a multipart message: the first part a JSON-formatted object, the second, binary audio captured by the product's microphone. We encourage streaming (chunking) captured audio to the Alexa Voice Service to reduce latency; the stream should contain 10ms of captured audio per chunk (320 bytes).

        After an interaction with Alexa is initiated, the microphone must remain open until:

        • A StopCapture directive is received.
        • The stream is closed by the Alexa service.
        • The user manually closes the microphone. For example, a press and hold implementation.

        The profile parameter and initiator object tell Alexa which ASR profile should be used to best understand the captured audio being sent, and how the interaction with Alexa was initiated.

        If your product is voice-initiated it must adhere to the Requirements for Cloud-Based Wake Word Verification.

        All captured audio sent to AVS should be encoded as:

        • 16bit Linear PCM
        • 16kHz sample rate
        • Single channel
        • Little endian byte order

        For a protocol specific example, see Structuring an HTTP/2 Request.

        Sample Message

        {
            "context": [
                // This is an array of context objects that are used to communicate the
                // state of all client components to Alexa. See Context for details.      
            ],   
            "event": {
                "header": {
                    "namespace": "SpeechRecognizer",
                    "name": "Recognize",
                    "messageId": "{{STRING}}",
                    "dialogRequestId": "{{STRING}}"
                },
                "payload": {
                    "profile": "{{STRING}}",
                    "format": "{{STRING}}",
                    "initiator": {
                        "type": "{{STRING}}",
                        "payload": {
                            "wakeWordIndices": {
                                "startIndexInSamples": {{LONG}},
                                "endIndexInSamples": {{LONG}}
                            },
                            "token": "{{STRING}}"   
                        }
                    }
                }
            }
        }
        

        Binary Audio Attachment

        Each Recognize event requires a corresponding binary audio attachment as one part of the multipart message. The following headers are required for each binary audio attachment:

        Content-Disposition: form-data; name="audio"
        Content-Type: application/octet-stream
        
        {{BINARY AUDIO ATTACHMENT}}
        

        Context

        This event requires your product to report the status of all client component states to Alexa in the context object. For additional information see Context.

        Header Parameters

        Parameter Description Type
        messageId A unique ID used to represent a specific message. string
        dialogRequestId A unique identifier that your client must create for each Recognize event sent to Alexa. This parameter is used to correlate directives sent in response to a specific Recognize event. string

        Payload Parameters

        Parameter Description Type
        profile Identifies the Automatic Speech Recognition (ASR) profile associated with your product. AVS supports three distinct ASR profiles optimized for user speech from varying distances.
        Accepted values: CLOSE_TALK, NEAR_FIELD, FAR_FIELD.
        string
        format Identifies the format of captured audio.
        Accepted value: AUDIO_L16_RATE_16000_CHANNELS_1.
        string
        initiator Lets Alexa know how an interaction was initiated.

        This object is required when an interaction is originated by the end user (wake word, tap, push and hold).

        If initiator is present in an ExpectSpeech directive then it must be returned in the following Recognize event. If initiator is absent from the ExpectSpeech directive, then it should not be included in the following Recognize event.
        object
        initiator.type Represents the action taken by a user to initiate an interaction with Alexa.
        Accepted values: PRESS_AND_HOLD, TAP, and WAKEWORD. If an initiator.type is provided in an ExpectSpeech directive, that string must be returned as initiator.type in the following Recognize event.
        string
        initiator.payload Includes information about the initiator. object
        initiator.payload.wakeWordIndices This object is required when initiator.type is set to WAKEWORD.
        wakeWordIndices includes the startIndexInSamples and endIndexInSamples. For additional details, see Requirements for Cloud-Based Wake Word Verification.
        object
        initiator.payload.wakeWordIndices.startIndexInSamples Represents the index in the audio stream where the wake word starts (in samples). The start index should be accurate to within 50ms of wake word detection. long
        initiator.payload.wakeWordIndices.endIndexInSamples Represents the index in the audio stream where the wake word ends (in samples). The end index should be accurate to within 150ms of the end of the detected wake word. long
        initiator.payload.token An opaque string. This value is only required if present in the payload of a preceding ExpectSpeech directive. string

        Profiles

        ASR profiles are tuned for different products, form factors, acoustic environments and use cases. Use the table below to learn more about accepted values for the profile parameter.

        Value Optimal Listening Distance
        CLOSE_TALK 0 to 2.5 ft.
        NEAR_FIELD 0 to 5 ft.
        FAR_FIELD 0 to 20+ ft.

        Initiator

        The initiator parameter tells AVS how an interaction with Alexa was triggered; and determines two things:

        1. If StopCapture will be sent to your client when the end of speech is detected in the cloud.
        2. If cloud-based wake word verification will be performed on the stream.

        initiator must be included in the payload of each SpeechRecognizer.Recognize event. The following values are accepted:

        Value Description Supported Profile(s) StopCapture Enabled Wake Word Verification Enabled Wake Word Indices Required
        PRESS_AND_HOLD Audio stream initiated by pressing a button (physical or GUI) and terminated by releasing it. CLOSE_TALK N N N
        TAP Audio stream initiated by the tap and release of a button (physical or GUI) and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y N N
        WAKEWORD Audio stream initiated by the use of a wake word and terminated when a StopCapture   directive is received. NEAR_FIELD, FAR_FIELD Y Y Y

        StopCapture Directive

        This directive instructs your client to stop capturing a user’s speech after AVS has identified the user’s intent or when end of speech is detected. When this directive is received, your client must immediately close the microphone and stop listening for the user’s speech.

        Sample Message

        {
            "directive": {
                "header": {
                    "namespace": "SpeechRecognizer",
                    "name": "StopCapture",
                    "messageId": "{{STRING}}",
                    "dialogRequestId": "{{STRING}}"
                },
                "payload": {
                }
            }
        }
        

        Header Parameters

        Parameter Description Type
        messageId A unique ID used to represent a specific message. string
        dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

        ExpectSpeech Directive

        ExpectSpeech is sent when Alexa requires additional information to fulfill a user's request. It instructs your client to open the microphone and begin streaming user speech. If the microphone is not opened within the specified timeout window, an ExpectSpeechTimedOut event must be sent from your client to AVS.

        During a multi-turn interaction with Alexa, your device will receive at least one ExpectSpeech directive instructing your client to start listening for user speech. If present, the initiator object included in the payload of the ExpectSpeech directive must be passed back to Alexa as the initiator object in the following Recognize event. If initiator is absent from the payload, the following Recognize event should not include initiator.

        For information on the rules that govern audio prioritization, please review the Interaction Model.

        Sample Message

        {
            "directive": {
                "header": {
                    "namespace": "SpeechRecognizer",
                    "name": "ExpectSpeech",
                    "messageId": "{{STRING}}",
                    "dialogRequestId": "{{STRING}}"
                },
                "payload": {
                    "timeoutInMilliseconds": {{LONG}},
                    "initiator": {{STRING}}
                }
            }
        }
        

        Header Parameters

        Parameter Description Type
        messageId A unique ID used to represent a specific message. string
        dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

        Payload Parameters

        Parameter Description Type
        timeoutInMilliseconds Specifies, in milliseconds, how long your client should wait for the microphone to open and begin streaming user speech to AVS. If the microphone is not opened within the specified timeout window, then the ExpectSpeechTimedOut event must be sent. The primary use case for this behavior is a PRESS_AND_HOLD implementation. long
        initiator An opaque string. If present it must be sent back to Alexa in the following Recognize event. string

        ExpectSpeechTimedOut Event

        This event must be sent to AVS if an ExpectSpeech directive was received, but was not satisfied within the specified timeout window.

        Sample Message

        {
            "event": {
                "header": {
                    "namespace": "SpeechRecognizer",
                    "name": "ExpectSpeechTimedOut",
                    "messageId": "{{STRING}}",
                },
                "payload": {
                }
            }
        }
        

        Header Parameters

        Parameter Description Type
        messageId A unique ID used to represent a specific message. string

        Payload Parameters

        An empty payload should be sent.