SpeechSynthesizer 1.3

When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.

For example, if you ask Alexa "What's the weather in Seattle?", your client receives a Speak directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.

Version changes

  • Support for user interruption of text-to-speech (TTS) output
  • Support for cloud-initiated interruption of TTS output
    • ADDED playBehavior field to the Speak directive
  • Support for captions for TTS
    • ADDED caption field to the Speak directive

States

SpeechSynthesizer has the following states:

  • PLAYING - When Alexa speaks, SpeechSynthesizer is in the PLAYING state. SpeechSynthesizer transitions to the FINISHED state when speech playback completes.
  • FINISHED - When Alexa finishes speaking, SpeechSynthesizer transitions to the FINISHED state with a SpeechFinished event.
  • INTERRUPTED - When Alexa speaks and gets interrupted, SpeechSynthesizer transitions to the INTERRUPTED state. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior. INTERRUPTED is temporary until the next Speak directive starts.

Capability assertion

A device can implement SpeechSynthesizer 1.3 on its own behalf, but not on behalf of any connected endpoints.

New AVS integrations must assert support through Alexa.Discovery. Alexa continues to support existing integrations using the Capabilities API.

Sample object

{
    "type": "AlexaInterface",
    "interface": "SpeechSynthesizer",
    "version": "1.3"
}

Context

For each playing TTS that requires context, your client must report playerActivity and offsetInMilliseconds.

To learn more about reporting Context, see Context Overview.

Sample message

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SpeechState"
    },
    "payload": {
        "token": "{{STRING}}",
        "offsetInMilliseconds": {{LONG}},
        "playerActivity": "{{STRING}}"
    }
}

Payload parameters

Parameter Description Type
token An opaque token provided in the Speak directive. string
offsetInMilliseconds Identifies the current TTS offset in milliseconds. long
playerActivity Identifies the component state of SpeechSynthesizer
Accepted values: PLAYING, FINISHED or INTERRUPTED
string
Player Activity Description
PLAYING Speech is playing.
FINISHED Speech finished playing.
INTERRUPTED Speech gets interrupted. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior

Directives

Speak

AVS sends a Speak directive to your client every time Alexa delivers a speech response. There are two different ways to receive a Speak directive, including:

  • When a user makes a voice request, such as asking Alexa a question. AVS sends a Speak directive to your client after it receives a Recognize event.
  • When a user performs an action, such as setting a timer. First, the timer starts with the SetAlert directive. Second, AVS sends a Speak directive to your client, notifying you that the timer started.

Sample message

The Speak directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.

JSON

{
  "directive": {
    "header": {
      "namespace": "SpeechSynthesizer",
      "name": "Speak",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "url": "{{STRING}}",
      "format": "{{STRING}}",
      "token": "{{STRING}}",
      "playBehavior": "{{STRING}}",
      "caption": {
        "content": "{{STRING}}",
        "type":  "{{STRING}}"
      }
    }
  }
}

Binary audio attachment

The following multipart headers precede the binary audio attachment.

Content-Type: application/octet-stream
Content-ID: {{Audio Item CID}}

{{BINARY AUDIO ATTACHMENT}}

Header parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string
dialogRequestId A unique ID used to correlate directives sent in response to a specific Recognize event. string

Payload parameters

Parameter Description Type
url A unique identifier for audio content. The URL always follows the prefix cid:.
Example: cid:
string
format Provides the format of returned audio.
Accepted value: AUDIO_MPEG
string
token An opaque token that represents the current text-to-speech (TTS) object. string
playBehavior Specifies the desired playback behavior when device receives more than one Play directive.

Possible Values:
REPLACE_ALL: The device must (1) Send a SpeechInterrupted event, (2) transition the playerActivity to INTERRUPTED, (3) stop any TTS that's playing, (4) clear the enqueued Speak directives, and (5) start playing the new audio from this Speak directive.
ENQUEUE: Play the new Speak content after all previously submitted and enqueued Speak directives have rendered.
REPLACE_ENQUEUED: Replace all directives in the queue with the this Speak directive, but don't interrupt the playing TTS.
string
caption If AVS includes this object, the device can use it to generate captions for the attached TTS content. object
caption.type The caption format.

Possible Value: WEBVTT
string
caption.content The time-encoded caption text for the attached TTS. string

Events

SpeechStarted

Send the SpeechStarted event to AVS after your client processes the Speak directive and begins playback of synthesized speech.

Sample message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechStarted",
            "messageId": "{{STRING}}"
        },
        "payload": {
            "token": "{{STRING}}"
        }
    }
}

Header parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string

Payload parameters

Parameter Description Type
token The opaque token provided by the Speak directive. string

SpeechFinished

When Alexa finishes speaking, send the SpeechFinished event. Send the event only after Alexa fully processes the Speak directive and finishes rendering the TTS. If a user cancels TTS playback, the SpeechFinished event doesn't send. For example, if a user interrupts the Alexa TTS with "Alexa, stop," send a SpeechFinished event.

Sample message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechFinished",
            "messageId": "{{STRING}}"
        },
        "payload": {
            "token": "{{STRING}}"
        }
    }
}

Header parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string

Payload parameters

Parameter Description Type
token The opaque token provided by the Speak directive. string

SpeechInterrupted

When Alexa is in a PLAYING state and a user barges in to make a new voice request, the device must do the following:

  1. Transition the playback state to INTERRUPTED.
  2. Send the SpeechInterrupted event to AVS.

A new voice request can come from a wake word detection, a physical button press on Tap-to-Talk device, or a Speak directive with a REPLACE_ALL playBehavior. The INTERRUPTED playback state is temporary until the next Speak directive starts.

Sample message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechInterrupted",
            "messageId": "{{STRING}}",
        },
        "payload": {
            "token": "{{STRING}}",
            "offsetInMilliseconds": {{LONG}}
        }
    }
}

Header parameters

Parameter Description Type
messageId A unique ID used to represent a specific message. string

Payload parameters

.
Parameter Description Type
token The value of the token field from the Speak directive that was interrupted. string
offsetInMilliseconds The offset between when TTS starts and when the interruption occurs. For example, if a user interrupts Alexa 4.124 seconds after speaking, the value is 4124string