Alexa.Gadget.SpeechData Interface

This interface sends your gadget speechmark data. Speechmarks are metadata that enable your gadget to synchronize speech with visual experiences. One example of an action that your gadget can take based on this data is to lip sync to Alexa's text-to-speech (TTS).

Supporting this interface

To support this interface, the gadget must respond to the Echo device's Discover directive with a Discover.Response event that includes the following entry in its array of Capabilities:

   "type": "AlexaInterface",
   "interface": "Alexa.Gadget.SpeechData",
   "version": "1.0",
   "configurations": {
      "supportedTypes": [


This interface includes one directive: Speechmarks, as described next.

Speechmarks directive

This directive provides speechmark data to your gadget. The .proto file contents are as follows:

message SpeechmarksDirectiveProto {
   Directive directive = 1;
   message Directive {
      alexaGadgetSpeechData.SpeechmarksDirectivePayloadProto payload = 2;
      header.DirectiveHeaderProto header = 1;

message DirectiveHeaderProto {
   string namespace = 1; 
   string name = 2; 
   string messageId = 3; 
   string dialogRequestId = 4;      

message SpeechmarksDirectivePayloadProto {
   repeated SpeechmarksData speechmarksData = 2;
   message SpeechmarksData {
      int32 startOffsetInMilliSeconds = 3;
      string type = 2;
      string value = 1;
   int32 playerOffsetInMilliseconds = 1;


The fields in this message are as follows:

Field Description Type
directive Contains a complete Speechmarks directive. Directive


The fields of the message are as follows:

Field Description Type
header Contains the header for this directive. DirectiveHeaderProto
payload Contains the payload for this directive. SpeechmarksDirectivePayloadProto


The fields of the message are as follows:

Field Description Type
namespace The namespace of this directive, which is Alexa.Gadget.SpeechData. string
name The name of this directive, which is Speechmarks. string
messageId An ID that uniquely defines an instance of this directive. This string can be empty. string
dialogRequestId A unique ID that correlates this directive with a specific voice interaction from a user. You can ignore this field. string


The fields of the message are as follows:

Field Description Type
speechmarksData An object that represents speechmark data. It specifies the type of data, values, and offset. SpeechmarksData
playerOffsetInMilliseconds Where the speech currently is in its stream, in milliseconds. int32


The fields of the message are as follows:

Field Description Type
type The type of speechmark data that this directive contains. Currently, the only possible value is "VISEME". Viseme is a mouth position that corresponds to a spoken sound. string
value The value of the speechmark. string
startOffsetInMilliSeconds The start offset of the value, in milliseconds.

To determine how to sync speechmark data to Alexa's speech, use startOffsetInMilliSeconds minus playerOffsetInMilliseconds.

For example, say your gadget receives the following viseme speechmark data at playerOffsetInMilliseconds = 7000:

value: "t", startOffsetInMilliSeconds: 3000
value: "a", startOffsetInMilliSeconds: 5000
value: "p", startOffsetInMilliSeconds: 9000
value: "e", startOffsetInMilliSeconds: 11000

Because playerOffsetInMilliseconds is 7000, your gadget should start considering the values at "p", and ignore the earlier values.

If playerOffsetInMilliseconds and startOffsetInMilliSeconds are both zero, the gadget should process the data immediately.

Was this page helpful?

Last updated: Mar 31, 2022