SpeechSynthesizer 1.3
When you ask Alexa a question, the SpeechSynthesizer interface returns the appropriate speech response.
For example, if you ask Alexa "What's the weather in Seattle?", your client receives a Speak
directive from the Alexa Voice Service (AVS). This directive contains a binary audio attachment with the appropriate answer, which you must process and play.
Version changes
- Support for user interruption of Text-To-Speech (TTS) output.
- ADDED
SpeechInterrupted
event.
- ADDED
- Support for cloud-initiated interruption of TTS output.
- ADDED
playBehavior
field to theSpeak
directive.
- ADDED
- Support for captions for TTS.
- ADDED
caption
field to theSpeak
directive.
- ADDED
States
SpeechSynthesizer has the following states:
- PLAYING – When Alexa speaks, SpeechSynthesizer is in the
PLAYING
state. SpeechSynthesizer transitions to theFINISHED
state when speech playback completes. - FINISHED – When Alexa finishes speaking, SpeechSynthesizer transitions to the
FINISHED
state with aSpeechFinished
event. - INTERRUPTED – When Alexa speaks and gets interrupted, SpeechSynthesizer transitions to the
INTERRUPTED
state. Interrupted events occur through use of voice, physical Tap-to-Talk or aSpeak
directive with aREPLACE_ALL
playBehavior
.INTERRUPTED
is temporary until the nextSpeak
directive starts.
Capability assertion
A device can implement SpeechSynthesizer 1.3 on its own behalf, but not on behalf of any connected endpoints.
New AVS integrations must assert support through Alexa.Discovery. Alexa continues to support existing integrations using the Capabilities API.
Sample object
{ "type": "AlexaInterface", "interface": "SpeechSynthesizer", "version": "1.3" }
Context
For each playing TTS that requires context, your client must report playerActivity
and offsetInMilliseconds
.
To learn more about reporting Context, see Context Overview.
Example message
{ "header": { "namespace": "SpeechSynthesizer", "name": "SpeechState" }, "payload": { "token": "{{STRING}}", "offsetInMilliseconds": {{LONG}}, "playerActivity": "{{STRING}}" } }
Payload parameters
Parameter | Description | Type |
---|---|---|
token | An opaque token provided in the Speak directive. |
string |
offsetInMilliseconds | Identifies the current TTS offset in milliseconds. | long |
playerActivity | Identifies the component state of SpeechSynthesizer .Accepted values: PLAYING , FINISHED or INTERRUPTED |
string |
Player Activity | Description |
---|---|
PLAYING |
Speech is playing. |
FINISHED |
Speech finished playing. |
INTERRUPTED |
Speech gets interrupted. Interrupted events occur through use of voice, physical Tap-to-Talk or a Speak directive with a REPLACE_ALL playBehavior . |
Directives
Speak
AVS sends a Speak
directive to your client every time Alexa delivers a speech response. Alexa can receive a Speak
directive in two different ways, including:
- When a user makes a voice request, such as asking Alexa a question. AVS sends a
Speak
directive to your client after it receives a Recognize event. - When a user performs an action, such as setting a timer. First, the timer starts with the
SetAlert
directive. Second, AVS sends aSpeak
directive to your client, notifying you that the timer started.
Example message
The Speak
directive is a multipart message containing two different formats – one JSON-formatted directive and one binary audio attachment.
JSON
{ "directive": { "header": { "namespace": "SpeechSynthesizer", "name": "Speak", "messageId": "{{STRING}}", "dialogRequestId": "{{STRING}}" }, "payload": { "url": "{{STRING}}", "format": "{{STRING}}", "token": "{{STRING}}", "playBehavior": "{{STRING}}", "caption": { "content": "{{STRING}}", "type": "{{STRING}}" } } } }
Binary audio attachment
The following multipart headers precede the binary audio attachment.
Content-Type: application/octet-stream Content-ID: {{Audio Item CID}} {{BINARY AUDIO ATTACHMENT}}
Header parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
dialogRequestId | A unique ID used to correlate directives sent in response to a specific Recognize event. |
string |
Payload parameters
Parameter | Description | Type |
---|---|---|
url | A unique identifier for audio content. The URL always follows the prefix cid:
Example: cid: |
string |
format | Provides the format of returned audio.
Accepted value: AUDIO_MPEG |
string |
token | An opaque token that represents the current text-to-speech (TTS) object. | string |
playBehavior |
Specifies the desired playback behavior when device receives more than one Play directive.Possible Values: • REPLACE_ALL : The device must (1) Send a SpeechInterrupted event, (2) transition the playerActivity to INTERRUPTED , (3) stop any TTS that's playing, (4) clear the enqueued Speak directives, and (5) start playing the new audio from this Speak directive.• ENQUEUE : Play the new Speak content after all previously submitted and enqueued Speak directives have rendered.• REPLACE_ENQUEUED : Replace all directives in the queue with the this Speak directive, but don't interrupt the playing TTS.
|
string |
caption | If AVS includes this object, the device can use it to generate captions for the attached TTS content. | object |
caption.type | The caption format. Possible Value: WEBVTT |
string |
caption.content | The time-encoded caption text for the attached TTS. | string |
Events
SpeechStarted
Send the SpeechStarted
event to AVS after your client processes the Speak
directive and begins playback of synthesized speech.
Example message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechStarted", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |
SpeechFinished
When Alexa finishes speaking, send the SpeechFinished
event.
Send this event only after Alexa fully processes the Speak
directive and finishes rendering the TTS.
If a user cancels TTS playback, don't send the SpeechFinished
event. For example, if a user interrupts the Alexa TTS with "Alexa, stop" don't send a SpeechFinished
event, but instead send the SpeechInterrupted
event.
Example message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechFinished", "messageId": "{{STRING}}" }, "payload": { "token": "{{STRING}}" } } }
Header parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload parameters
Parameter | Description | Type |
---|---|---|
token | The opaque token provided by the Speak directive. |
string |
SpeechInterrupted
When Alexa is interrupted, send the SpeechInterrupted
event.
When Alexa is in a PLAYING
state and a user barges in to make a new voice request, the device must do the following:
- Transition the playback state to
INTERRUPTED
. - Send the
SpeechInterrupted
event to AVS.
A new voice request can come from a wake word detection, a physical button press on Tap-to-Talk device, or a Speak
directive with a REPLACE_ALL
playBehavior
. The INTERRUPTED
playback state is temporary until the next Speak
directive starts.
Example message
{ "event": { "header": { "namespace": "SpeechSynthesizer", "name": "SpeechInterrupted", "messageId": "{{STRING}}", }, "payload": { "token": "{{STRING}}", "offsetInMilliseconds": {{LONG}} } } }
Header parameters
Parameter | Description | Type |
---|---|---|
messageId | A unique ID used to represent a specific message. | string |
Payload parameters
Parameter | Description | Type |
---|---|---|
token | The value of the token field from the interrupted Speak directive. |
string |
offsetInMilliseconds | The offset between when TTS starts and when the interruption occurs. For example, if a user interrupts Alexa 4.124 seconds after speaking, the value is 4124 . |
long |
Last updated: Oct 14, 2020