SpeechSynthesizer 1.3

Important: Alexa Voice Service (AVS) developer tools are no longer generally available for Alexa Built-in. Please visit the Works with Alexa program if you are interested in building devices that connect to Alexa.

When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.

For example, if you ask Alexa "What's the weather in Seattle?", your client receives a Speak directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.

Version changes

Support for user interruption of Text-To-Speech (TTS) output.
- ADDED SpeechInterrupted event.
Support for cloud-initiated interruption of TTS output.
- ADDED playBehavior field to the Speak directive.
Support for captions for TTS.
- ADDED caption field to the Speak directive.

States

SpeechSynthesizer has the following states:

PLAYING – When Alexa speaks, SpeechSynthesizer is in the PLAYING state. SpeechSynthesizer transitions to the FINISHED state when speech playback completes.
FINISHED – When Alexa finishes speaking, SpeechSynthesizer transitions to the FINISHED state with a SpeechFinished event.
INTERRUPTED – When Alexa speaks and gets interrupted, SpeechSynthesizer transitions to the INTERRUPTED state. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior. INTERRUPTED is temporary until the next Speak directive starts.

Capability assertion

A device can implement SpeechSynthesizer 1.3 on its own behalf, but not on behalf of any connected endpoints.

New AVS integrations must assert support through Alexa.Discovery. Alexa continues to support existing integrations using the Capabilities API.

Sample object

{
    "type": "AlexaInterface",
    "interface": "SpeechSynthesizer",
    "version": "1.3"
}

Context

For each playing TTS that requires context, your client must report playerActivity and offsetInMilliseconds.

To learn more about reporting Context, see Context Overview.

Example message

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SpeechState"
    },
    "payload": {
        "token": "{{STRING}}",
        "offsetInMilliseconds": {{LONG}},
        "playerActivity": "{{STRING}}"
    }
}

Payload parameters

Parameter	Description	Type
token	An opaque token provided in the `Speak` directive.	string
offsetInMilliseconds	Identifies the current TTS offset in milliseconds.	long
playerActivity	Identifies the component state of `SpeechSynthesizer`. Accepted values: `PLAYING`, `FINISHED` or `INTERRUPTED`	string

Player Activity	Description
`PLAYING`	Speech is playing.
`FINISHED`	Speech finished playing.
`INTERRUPTED`	Speech gets interrupted. Interrupted events occur through use of voice, physical Tap-to-Talk or a `Speak` directive with a `REPLACE_ALL` `playBehavior`.

Directives

Speak

AVS sends a Speak directive to your client every time Alexa delivers a speech response. Alexa can receive a Speak directive in two different ways, including:

When a user makes a voice request, such as asking Alexa a question. AVS sends a Speak directive to your client after it receives a Recognize event.
When a user performs an action, such as setting a timer. First, the timer starts with the SetAlert directive. Second, AVS sends a Speak directive to your client, notifying you that the timer started.

Example message

The Speak directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.

JSON

{
  "directive": {
    "header": {
      "namespace": "SpeechSynthesizer",
      "name": "Speak",
      "messageId": "{{STRING}}",
      "dialogRequestId": "{{STRING}}"
    },
    "payload": {
      "url": "{{STRING}}",
      "format": "{{STRING}}",
      "token": "{{STRING}}",
      "playBehavior": "{{STRING}}",
      "caption": {
        "content": "{{STRING}}",
        "type":  "{{STRING}}"
      }
    }
  }
}

Binary audio attachment

The following multipart headers precede the binary audio attachment.

Content-Type: application/octet-stream
Content-ID: {{Audio Item CID}}

{{BINARY AUDIO ATTACHMENT}}

Header parameters

Parameter	Description	Type
messageId	A unique ID used to represent a specific message.	string
dialogRequestId	A unique ID used to correlate directives sent in response to a specific `Recognize` event.	string

Payload parameters

Parameter	Description	Type
url	A unique identifier for audio content. The URL always follows the prefix `cid:` Example: `cid:`	string
format	Provides the format of returned audio. Accepted value: `AUDIO_MPEG`	string
token	An opaque token that represents the current text-to-speech (TTS) object.	string
playBehavior	Specifies the desired playback behavior when device receives more than one `Play` directive. Possible Values: • `REPLACE_ALL`: The device must (1) Send a `SpeechInterrupted` event, (2) transition the `playerActivity` to `INTERRUPTED`, (3) stop any TTS that's playing, (4) clear the enqueued `Speak` directives, and (5) start playing the new audio from this `Speak` directive. • `ENQUEUE`: Play the new `Speak` content after all previously submitted and enqueued `Speak` directives have rendered. • `REPLACE_ENQUEUED`: Replace all directives in the queue with the this `Speak` directive, but don't interrupt the playing TTS.	string
caption	If AVS includes this object, the device can use it to generate captions for the attached TTS content.	object
caption.type	The caption format. Possible Value: WEBVTT	string
caption.content	The time-encoded caption text for the attached TTS.	string

Events

SpeechStarted

Send the SpeechStarted event to AVS after your client processes the Speak directive and begins playback of synthesized speech.

Example message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechStarted",
            "messageId": "{{STRING}}"
        },
        "payload": {
            "token": "{{STRING}}"
        }
    }
}

Header parameters

Parameter	Description	Type
messageId	A unique ID used to represent a specific message.	string

Payload parameters

Parameter	Description	Type
token	The opaque token provided by the `Speak` directive.	string

SpeechFinished

When Alexa finishes speaking, send the SpeechFinished event.

Send this event only after Alexa fully processes the Speak directive and finishes rendering the TTS.

If a user cancels TTS playback, don't send the SpeechFinished event. For example, if a user interrupts the Alexa TTS with "Alexa, stop" don't send a SpeechFinished event, but instead send the SpeechInterrupted event.

Example message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechFinished",
            "messageId": "{{STRING}}"
        },
        "payload": {
            "token": "{{STRING}}"
        }
    }
}

Header parameters

Parameter	Description	Type
messageId	A unique ID used to represent a specific message.	string

Payload parameters

Parameter	Description	Type
token	The opaque token provided by the `Speak` directive.	string

SpeechInterrupted

When Alexa is interrupted, send the SpeechInterrupted event.

When Alexa is in a PLAYING state and a user barges in to make a new voice request, the device must do the following:

Transition the playback state to INTERRUPTED.
Send the SpeechInterrupted event to AVS.

A new voice request can come from a wake word detection, a physical button press on Tap-to-Talk device, or a Speak directive with a REPLACE_ALL playBehavior. The INTERRUPTED playback state is temporary until the next Speak directive starts.

Example message

{
    "event": {
        "header": {
            "namespace": "SpeechSynthesizer",
            "name": "SpeechInterrupted",
            "messageId": "{{STRING}}",
        },
        "payload": {
            "token": "{{STRING}}",
            "offsetInMilliseconds": {{LONG}}
        }
    }
}

Header parameters

Parameter	Description	Type
messageId	A unique ID used to represent a specific message.	string

Payload parameters

Parameter	Description	Type
token	The value of the `token` field from the interrupted `Speak` directive.	string
offsetInMilliseconds	The offset between when TTS starts and when the interruption occurs. For example, if a user interrupts Alexa 4.124 seconds after speaking, the value is `4124`.	long

Was this page helpful?

Provide feedback

Last updated: Nov 27, 2023