SpeechSynthesizer Interface

    When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.

    For example, if you ask Alexa "What's the weather in Seattle?," your client receives a SpeechSynthesizer.Speak directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.

    The following sections cover SpeechSynthesizer directives and events.

    Version changes

    • Support for user interruption of text-to-speech (TTS) output
    • Support for cloud-initiated interruption of TTS output
      • ADDED playBehavior field to the Speak directive
    • Support for captions for TTS
      • ADDED caption field to the Speak directive

    States

    SpeechSynthesizer has the following states:

    • PLAYING - When Alexa speaks, SpeechSynthesizer is in the PLAYING state. SpeechSynthesizer transitions to the FINISHED state when speech playback completes.
    • FINISHED - When Alexa finishes speaking, SpeechSynthesizer transitions to the FINISHED state with a SpeechFinished event.
    • INTERRUPTED - When Alexa speaks and gets interrupted, SpeechSynthesizer transitions to the INTERRUPTED state. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior. INTERRUPTED is temporary until the next Speak directive starts.

    Capability Assertion

    SpeechSynthesizer 1.3 may be implemented by the device on its own behalf, but not on behalf of any connected endpoints.

    New AVS integrations must assert support through Alexa.Discovery, but Alexa will continue to support existing integrations using the Capabilities API.

    Sample Object

    {
        "type": "AlexaInterface",
        "interface": "SpeechSynthesizer",
        "version": "1.3"
    }
    

    SpeechSynthesizer context

    For each currently playing TTS that requires context, your client must report playerActivity and offsetInMilliseconds.

    To learn more about reporting Context, see Context Overview.

    Sample Message

    {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechState"
        },
        "payload": {
            "token": "{{STRING}}",
            "offsetInMilliseconds": {{LONG}},
            "playerActivity": "{{STRING}}"
        }
    }
    

    Payload Parameters

    Parameter Description Type
    token An opaque token provided in the Speak directive. string
    offsetInMilliseconds Identifies the current TTS offset in milliseconds. long
    playerActivity Identifies the component state of SpeechSynthesizer
    Accepted Values: PLAYING, FINISHED or INTERRUPTED
    string
    Player Activity Description
    PLAYING Speech is playing.
    FINISHED Speech finished playing.
    INTERRUPTED Speech gets interrupted. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior

    Speak directive

    AVS sends a Speak directive to your client every time Alexa delivers a speech response. There are two different ways to receive a Speak directive, including:

    1. When a user makes a voice request, such as asking Alexa a question. AVS sends a Speak directive to your client after it receives a Recognize event.
    2. When a user preforms an action, such as setting a timer. First, the timer starts with the SetAlert directive. Second, AVS sends a Speak directive to your client, notifying you that the timer started.

    Sample Message

    The Speak directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.

    JSON

    {
      "directive": {
        "header": {
          "namespace": "SpeechSynthesizer",
          "name": "Speak",
          "messageId": "{{STRING}}",
          "dialogRequestId": "{{STRING}}"
        },
        "payload": {
          "url": "{{STRING}}",
          "format": "{{STRING}}",
          "token": "{{STRING}}",
          "playBehavior": "{{STRING}}",
          "caption": {
            "content": "{{STRING}}",
            "type":  "{{STRING}}"
          }
        }
      }
    }
    

    Binary Audio Attachment

    The following multipart headers precede the binary audio attachment.

    Content-Type: application/octet-stream
    Content-ID: {{Audio Item CID}}
    
    {{BINARY AUDIO ATTACHMENT}}
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string
    dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

    Payload Parameters

    Parameter Description Type
    url A unique identifier for audio content. The URL always follows the prefix cid:.
    Example: cid:
    string
    format Provides the format of returned audio.
    Accepted value: "AUDIO_MPEG"
    string
    token An opaque token that represents the current text-to-speech (TTS) object. string
    playBehavior Specifies the desired playback behavior when device receives more than one Play directive.

    Possible Values:
    REPLACE_ALL: The device must (1) Send a SpeechInterrupted event, (2) transition the playerActivity to INTERRUPTED, (3) stop any TTS that's playing, (4) clear the enqueued Speak directives, and (5) start playing the new audio from this Speak directive.

    ENQUEUE: Play the new Speak content after all previously submitted and enqueued Speak directives have rendered.

    REPLACE_ENQUEUED: Replace all directives in the queue with the this Speak directive, but do not interrupt the currently playing TTS.
    string
    caption If AVS includes this object, the device can use it to generate captions for the attached TTS content. object
    caption.type The caption format.

    Possible Value: WEBVTT
    string
    caption.content The time-encoded caption text for the attached TTS. string

    SpeechStarted event

    Send the SpeechStarted event to AVS after your client processes the Speak directive and begins playback of synthesized speech.

    Sample Message

    {
        "event": {
            "header": {
                "namespace": "SpeechSynthesizer",
                "name": "SpeechStarted",
                "messageId": "{{STRING}}"
            },
            "payload": {
                "token": "{{STRING}}"
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    token The opaque token provided by the Speak directive. string

    SpeechFinished event

    When Alexa finishes speaking, send the SpeechFinished event. Send the event only after Alexa fully processes the Speak directive and finishes rendering the TTS. If a user cancels TTS playback, the SpeechFinished event doesn't send. For example, if a user interrupts the Alexa TTS with "Alexa, stop," send a SpeechFinished event.

    Sample Message

    {
        "event": {
            "header": {
                "namespace": "SpeechSynthesizer",
                "name": "SpeechFinished",
                "messageId": "{{STRING}}"
            },
            "payload": {
                "token": "{{STRING}}"
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    token The opaque token provided by the Speak directive. string

    SpeechInterrupted event

    When Alexa is speaking and a user barges in to make a new voice request, the device must do the following:

    1. transition playbackState to INTERRUPTED
    2. send the SpeechInterrupted event to AVS

    Note: The new voice request may come from a wake word detection, a physical button press on Tap-to-Talk device, or a Speak directive with a REPLACE_ALL playBehavior. The INTERRUPTED playbackState is temporary until the next Speak directive starts.

    Sample Message

    {
        "event": {
            "header": {
                "namespace": "SpeechSynthesizer",
                "name": "SpeechInterrupted",
                "messageId": {{STRING}},
            },
            "payload": {
                "token": {{STRING}},
                "offsetInMilliseconds": {{LONG}}
            }
        }
    }
    

    Header Parameters

    Parameter Description Type
    messageId A unique ID used to represent a specific message. string

    Payload Parameters

    Parameter Description Type
    token The value of the token field from the Speak directive that was interrupted. string
    offsetInMilliseconds The offset between when TTS starts and when the interruption occurs. For example, if a user interrupts Alexa 4.124 seconds after speaking, the value is 4124 string

      When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.

      For example, if you ask Alexa "What's the weather in Seattle?," your client receives a SpeechSynthesizer.Speak directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.

      The following sections cover SpeechSynthesizer directives and events.

      States

      SpeechSynthesizer has the following states:

      • PLAYING - When Alexa speaks, SpeechSynthesizer is in the PLAYING state. SpeechSynthesizer transitions to the FINISHED state when speech playback completes.
      • FINISHED - When Alexa finishes speaking, SpeechSynthesizer transitions to the FINISHED state with a SpeechFinished event.

      Capability Assertion

      SpeechSynthesizer 1.0 may be implemented by the device on its own behalf, but not on behalf of any connected endpoints.

      New AVS integrations must assert support through Alexa.Discovery, but Alexa will continue to support existing integrations using the Capabilities API.

      Sample Object

      {
          "type": "AlexaInterface",
          "interface": "SpeechSynthesizer",
          "version": "1.0"
      }
      

      SpeechSynthesizer context

      For each currently playing TTS that requires context, your client must report playerActivity and offsetInMilliseconds.

      To learn more about reporting Context, see Context Overview.

      Sample Message

      {
          "header": {
              "namespace": "SpeechSynthesizer",
              "name": "SpeechState"
          },
          "payload": {
              "token": "{{STRING}}",
              "offsetInMilliseconds": {{LONG}},
              "playerActivity": "{{STRING}}"
          }
      }
      

      Payload Parameters

      Parameter Description Type
      token An opaque token provided in the Speak directive. string
      offsetInMilliseconds Identifies the current TTS offset in milliseconds. long
      playerActivity Identifies the component state of SpeechSynthesizer
      Accepted Values: PLAYING, FINISHED or INTERRUPTED
      string
      Player Activity Description
      PLAYING Speech is playing.
      FINISHED Speech finished playing.

      Speak directive

      AVS sends a Speak directive to your client every time Alexa delivers a speech response. There are two different ways to receive a Speak directive, including:

      1. When a user makes a voice request, such as asking Alexa a question. AVS sends a Speak directive to your client after it receives a Recognize event.
      2. When a user preforms an action, such as setting a timer. First, the timer starts with the SetAlert directive. Second, AVS sends a Speak directive to your client, notifying you that the timer started.

      Sample Message

      The Speak directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.

      JSON

      {
          "directive": {
              "header": {
                  "namespace": "SpeechSynthesizer",
                  "name": "Speak",
                  "messageId": "{{STRING}}",
                  "dialogRequestId": "{{STRING}}"
              },
              "payload": {
                  "url": "{{STRING}}",
                  "format": "{{STRING}}",
                  "token": "{{STRING}}"
              }
          }
      }
      

      Binary Audio Attachment

      The following multipart headers precede the binary audio attachment.

      Content-Type: application/octet-stream
      Content-ID: {{Audio Item CID}}
      
      {{BINARY AUDIO ATTACHMENT}}
      

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string
      dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

      Payload Parameters

      Parameter Description Type
      url A unique identifier for audio content. The URL always follows the prefix cid:.
      Example: cid:{{STRING}}
      string
      format Provides the format of returned audio.
      Accepted value: "AUDIO_MPEG"
      string
      token An opaque token that represents the current text-to-speech (TTS) object. string

      SpeechStarted event

      Send the SpeechStarted event to AVS after your client processes the Speak directive and begins playback of synthesized speech.

      Sample Message

      {
          "event": {
              "header": {
                  "namespace": "SpeechSynthesizer",
                  "name": "SpeechStarted",
                  "messageId": "{{STRING}}"
              },
              "payload": {
                  "token": "{{STRING}}"
              }
          }
      }
      

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string

      Payload Parameters

      Parameter Description Type
      token The opaque token provided by the Speak directive. string

      SpeechFinished event

      When Alexa finishes speaking, send the SpeechFinished event. Send the event only after Alexa fully processes the Speak directive and finishes rendering the TTS. If a user cancels TTS playback, the SpeechFinished event doesn't send. For example, if a user interrupts the Alexa TTS with "Alexa, stop," send a SpeechFinished event.

      Sample Message

        {
            "event": {
                "header": {
                    "namespace": "SpeechSynthesizer",
                    "name": "SpeechFinished",
                    "messageId": "{{STRING}}"
                },
                "payload": {
                    "token": "{{STRING}}"
                }
            }
        }
        

      Header Parameters

      Parameter Description Type
      messageId A unique ID used to represent a specific message. string

      Payload Parameters

      Parameter Description Type
      token The opaque token provided by the Speak directive. string