Alexa.Gadget.SpeechData Interface

Note: Sign in to the developer console to build or publish your skill.

Note: On December 31, 2021, we paused support for third-party device makers working with Alexa Gadgets, while we work to create an even better developer and customer experience. Please stay tuned to the Amazon developer portal for updates. In the interim, please visit the landing pages for Alexa Voice Service, Alexa Connect Kit, and Alexa Skills Kit to discover ways you can provide new customer experiences with voice.

This interface sends your gadget speechmark data. Speechmarks are metadata that enable your gadget to synchronize speech with visual experiences. One example of an action that your gadget can take based on this data is to lip sync to Alexa's text-to-speech (TTS).

Note: Each directive and event is a compilation of three separate .proto files: a header, a payload, and a file that combines the two. You can download the .proto files from the Alexa Gadgets Sample Code GitHub repository. In this topic, the .proto files combine all fields into one file for descriptive purposes.

Supporting this interface
Directives
- Speechmarks directive

Supporting this interface

To support this interface, the gadget must respond to the Echo device's Discover directive with a Discover.Response event that includes the following entry in its array of Capabilities:

{
   "type": "AlexaInterface",
   "interface": "Alexa.Gadget.SpeechData",
   "version": "1.0",
   "configurations": {
      "supportedTypes": [
        {
           "name":"viseme"
        }
      ]
    }
}

Directives

This interface includes one directive: Speechmarks, as described next.

Speechmarks directive

This directive provides speechmark data to your gadget. The .proto file contents are as follows:

message SpeechmarksDirectiveProto {
   Directive directive = 1;
   message Directive {
      alexaGadgetSpeechData.SpeechmarksDirectivePayloadProto payload = 2;
      header.DirectiveHeaderProto header = 1;
   }
}

message DirectiveHeaderProto {
   string namespace = 1; 
   string name = 2; 
   string messageId = 3; 
   string dialogRequestId = 4;      
}

message SpeechmarksDirectivePayloadProto {
   repeated SpeechmarksData speechmarksData = 2;
   message SpeechmarksData {
      int32 startOffsetInMilliSeconds = 3;
      string type = 2;
      string value = 1;
   }
   int32 playerOffsetInMilliseconds = 1;
}

SpeechmarksDirectiveProto

The fields in this message are as follows:

Field	Description	Type
`directive`	Contains a complete `Speechmarks` directive.	`Directive`

Directive

The fields of the message are as follows:

Field	Description	Type
`header`	Contains the header for this directive.	`DirectiveHeaderProto`
`payload`	Contains the payload for this directive.	`SpeechmarksDirectivePayloadProto`

DirectiveHeaderProto

The fields of the message are as follows:

Field	Description	Type
`namespace`	The namespace of this directive, which is `Alexa.Gadget.SpeechData`.	`string`
`name`	The name of this directive, which is `Speechmarks`.	`string`
`messageId`	An ID that uniquely defines an instance of this directive. This string can be empty.	`string`
`dialogRequestId`	A unique ID that correlates this directive with a specific voice interaction from a user. You can ignore this field.	`string`

SpeechmarksDirectivePayloadProto

The fields of the message are as follows:

Field	Description	Type
`speechmarksData`	An object that represents speechmark data. It specifies the type of data, values, and offset.	`SpeechmarksData`
`playerOffsetInMilliseconds`	Where the speech currently is in its stream, in milliseconds.	`int32`

SpeechmarksData

The fields of the message are as follows:

Field	Description	Type
`type`	The type of speechmark data that this directive contains. Currently, the only possible value is `"VISEME"`. Viseme is a mouth position that corresponds to a spoken sound.	`string`
`value`	The value of the speechmark.	`string`
`startOffsetInMilliSeconds`	The start offset of the `value`, in milliseconds. To determine how to sync speechmark data to Alexa's speech, use `startOffsetInMilliSeconds` minus `playerOffsetInMilliseconds`. For example, say your gadget receives the following viseme speechmark data at `playerOffsetInMilliseconds` = `7000`: `value`: `"t"`, `startOffsetInMilliSeconds`: `3000` `value`: `"a"`, `startOffsetInMilliSeconds`: `5000` `value`: `"p"`, `startOffsetInMilliSeconds`: `9000` `value`: `"e"`, `startOffsetInMilliSeconds`: `11000` Because `playerOffsetInMilliseconds` is `7000`, your gadget should start considering the values at `"p"`, and ignore the earlier values. If `playerOffsetInMilliseconds` and `startOffsetInMilliSeconds` are both zero, the gadget should process the data immediately.	`int32`

Was this page helpful?

Provide feedback

Last updated: Mar 31, 2022