SpeechSynthesizer Interface
When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.
For example, if you ask Alexa "What's the weather in Seattle?," your client receives a SpeechSynthesizer.Speak
directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.
The following sections cover SpeechSynthesizer directives and events.
Version changes
- Support for user interruption of text-to-speech (TTS) output
- ADDED
SpeechInterrupted
event
- ADDED
- Support for cloud-initiated interruption of TTS output
- ADDED
playBehavior
field to theSpeak
directive
- ADDED
- Support for captions for TTS
- ADDED
caption
field to theSpeak
directive
- ADDED
States
SpeechSynthesizer has the following states:
- PLAYING - When Alexa speaks, SpeechSynthesizer is in the
PLAYING
state. SpeechSynthesizer transitions to theFINISHED
state when speech playback completes. - FINISHED - When Alexa finishes speaking, SpeechSynthesizer transitions to the
FINISHED
state with aSpeechFinished
event. - INTERRUPTED - When Alexa speaks and gets interrupted, SpeechSynthesizer transitions to the
INTERRUPTED
state. Interrupted events occur through use of voice, physical Tap-to-Talk or aSpeak
directive with aREPLACE_ALL
playBehavior
. INTERRUPTED is temporary until the nextSpeak
directive starts.
Capability Assertion
SpeechSynthesizer 1.3 may be implemented by the device on its own behalf, but not on behalf of any connected endpoints.
New AVS integrations must assert support through Alexa.Discovery, but Alexa will continue to support existing integrations using the Capabilities API.
Sample Object
{ "type": "AlexaInterface", "interface": "SpeechSynthesizer", "version": "1.3" }
SpeechSynthesizer context
For each currently playing TTS that requires context, your client must report playerActivity
and offsetInMilliseconds
.
To learn more about reporting Context, see Context Overview.
Sample Message
{ "header": { "namespace": "SpeechSynthesizer", "name": "SpeechState" }, "payload": { "token": "{{STRING}}", "offsetInMilliseconds": {{LONG}}, "playerActivity": "{{STRING}}" } }
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | An opaque token provided in the Speak directive. |
string |
offsetInMilliseconds | Identifies the current TTS offset in milliseconds. | long |
playerActivity | Identifies the component state of SpeechSynthesizer
Accepted Values: PLAYING , FINISHED or INTERRUPTED |
string |
Player Activity | Description |
---|---|
PLAYING |
Speech is playing. |
FINISHED |
Speech finished playing. |
INTERRUPTED |
Speech gets interrupted. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior |
Speak directive
AVS sends a Speak
directive to your client every time Alexa delivers a speech response. There are two different ways to receive a Speak
directive, including:
- When a user makes a voice request, such as asking Alexa a question. AVS sends a
Speak
directive to your client after it receives a Recognize event. - When a user preforms an action, such as setting a timer. First, the timer starts with the
SetAlert
directive. Second, AVS sends aSpeak
directive to your client, notifying you that the timer started.
Sample Message
The Speak
directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.
JSON
{ "directive": { "header": { "namespace": "SpeechSynthesizer", "name": "Speak", "messageId": "{{STRING}}", "dialogRequestId": "{{STRING}}" }, "payload": { "url": "{{STRING}}", "format": "{{STRING}}", "token": "{{STRING}}", "playBehavior": "{{STRING}}", "caption": { "content": "{{STRING}}", "type": "{{STRING}}" } } } }
Binary Audio Attachment
The following multipart headers precede the binary audio attachment.
Content-Type: application/octet-stream Content-ID: {{Audio Item CID}} {{BINARY AUDIO ATTACHMENT}}
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
dialogRequestId | A unique ID used to correlate directives sent in response to a specific Recognize event. |
string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
url | A unique identifier for audio content. The URL always follows the prefix cid: .
Example: cid: |
string |
format | Provides the format of returned audio.
Accepted value: "AUDIO_MPEG" |
string |
token | An opaque token that represents the current text-to-speech (TTS) object. | string |
playBehavior |
Specifies the desired playback behavior when device receives more than one Play directive.Possible Values: REPLACE_ALL : The device must (1) Send a SpeechInterrupted event, (2) transition the playerActivity to INTERRUPTED , (3) stop any TTS that's playing, (4) clear the enqueued Speak directives, and (5) start playing the new audio from this Speak directive.ENQUEUE : Play the new Speak content after all previously submitted and enqueued Speak directives have rendered.REPLACE_ENQUEUED : Replace all directives in the queue with the this Speak directive, but do not interrupt the currently playing TTS.
|
string |
caption | If AVS includes this object, the device can use it to generate captions for the attached TTS content. | object |
caption.type | The caption format. Possible Value: WEBVTT |
string |
caption.content | The time-encoded caption text for the attached TTS. | string |
SpeechStarted event
Send the SpeechStarted
event to AVS after your client processes the Speak
directive and begins playback of synthesized speech.
Sample Message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechStarted", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |
SpeechFinished event
When Alexa finishes speaking, send the SpeechFinished
event. Send the event only after Alexa fully processes the Speak
directive and finishes rendering the TTS. If a user cancels TTS playback, the SpeechFinished
event doesn't send. For example, if a user interrupts the Alexa TTS with "Alexa, stop," send a SpeechFinished
event.
Sample Message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechFinished", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |
SpeechInterrupted event
When Alexa is speaking and a user barges in to make a new voice request, the device must do the following:
- transition
playbackState
toINTERRUPTED
- send the
SpeechInterrupted
event to AVS
Note: The new voice request may come from a wake word detection, a physical button press on Tap-to-Talk device, or a Speak
directive with a REPLACE_ALL
playBehavior
. The INTERRUPTED
playbackState
is temporary until the next Speak
directive starts.
Sample Message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechInterrupted", "messageId": {{STRING}}, }, "payload": { "token": {{STRING}}, "offsetInMilliseconds": {{LONG}} } } }
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | The value of the token field from the Speak directive that was interrupted. |
string |
offsetInMilliseconds | The offset between when TTS starts and when the interruption occurs. For example, if a user interrupts Alexa 4.124 seconds after speaking, the value is 4124 |
string |
When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.
For example, if you ask Alexa "What's the weather in Seattle?," your client receives a SpeechSynthesizer.Speak
directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.
The following sections cover SpeechSynthesizer directives and events.
States
SpeechSynthesizer has the following states:
- PLAYING - When Alexa speaks, SpeechSynthesizer is in the
PLAYING
state. SpeechSynthesizer transitions to theFINISHED
state when speech playback completes. - FINISHED - When Alexa finishes speaking, SpeechSynthesizer transitions to the
FINISHED
state with aSpeechFinished
event.
Capability Assertion
SpeechSynthesizer 1.0 may be implemented by the device on its own behalf, but not on behalf of any connected endpoints.
New AVS integrations must assert support through Alexa.Discovery, but Alexa will continue to support existing integrations using the Capabilities API.
Sample Object
{ "type": "AlexaInterface", "interface": "SpeechSynthesizer", "version": "1.0" }
SpeechSynthesizer context
For each currently playing TTS that requires context, your client must report playerActivity
and offsetInMilliseconds
.
To learn more about reporting Context, see Context Overview.
Sample Message
{ "header": { "namespace": "SpeechSynthesizer", "name": "SpeechState" }, "payload": { "token": "{{STRING}}", "offsetInMilliseconds": {{LONG}}, "playerActivity": "{{STRING}}" } }
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | An opaque token provided in the Speak directive. |
string |
offsetInMilliseconds | Identifies the current TTS offset in milliseconds. | long |
playerActivity | Identifies the component state of SpeechSynthesizer
Accepted Values: PLAYING , FINISHED or INTERRUPTED |
string |
Player Activity | Description |
---|---|
PLAYING |
Speech is playing. |
FINISHED |
Speech finished playing. |
Speak directive
AVS sends a Speak
directive to your client every time Alexa delivers a speech response. There are two different ways to receive a Speak
directive, including:
- When a user makes a voice request, such as asking Alexa a question. AVS sends a
Speak
directive to your client after it receives a Recognize event. - When a user preforms an action, such as setting a timer. First, the timer starts with the
SetAlert
directive. Second, AVS sends aSpeak
directive to your client, notifying you that the timer started.
Sample Message
The Speak
directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.
JSON
{ "directive": { "header": { "namespace": "SpeechSynthesizer", "name": "Speak", "messageId": "{{STRING}}", "dialogRequestId": "{{STRING}}" }, "payload": { "url": "{{STRING}}", "format": "{{STRING}}", "token": "{{STRING}}" } } }
Binary Audio Attachment
The following multipart headers precede the binary audio attachment.
Content-Type: application/octet-stream Content-ID: {{Audio Item CID}} {{BINARY AUDIO ATTACHMENT}}
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
dialogRequestId | A unique ID used to correlate directives sent in response to a specific Recognize event. |
string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
url | A unique identifier for audio content. The URL always follows the prefix cid: .
Example: cid:{{STRING}} |
string |
format | Provides the format of returned audio.
Accepted value: "AUDIO_MPEG" |
string |
token | An opaque token that represents the current text-to-speech (TTS) object. | string |
SpeechStarted event
Send the SpeechStarted
event to AVS after your client processes the Speak
directive and begins playback of synthesized speech.
Sample Message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechStarted", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |
SpeechFinished event
When Alexa finishes speaking, send the SpeechFinished
event. Send the event only after Alexa fully processes the Speak
directive and finishes rendering the TTS. If a user cancels TTS playback, the SpeechFinished
event doesn't send. For example, if a user interrupts the Alexa TTS with "Alexa, stop," send a SpeechFinished
event.
Sample Message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechFinished", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header Parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload Parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |