Interaction Model Overview
A client interacting with the Alexa Voice Service will regularly encounter events/directives that produce competing audio. For instance, a user may ask a question while Alexa is speaking or a previously scheduled alarm may trigger while music is streaming. The rules that govern the prioritization and handling of these inputs and outputs are referred to as the interaction model. In the following sections we’ll cover:
- InteractionModel API
- Interfaces, Directives, and Events
- Client Interaction with AVS
- Processing JSON
- Voice Request Lifecycle
- Testing the Interaction Model
- Next Steps
Declaring the InteractionModel 1.0 interface via the Capabilities API will enable Alexa Routines for your product. InteractionModel 1.0 includes the new
NewDialogRequest directive and modifications to the AVS interaction model voice request lifecycle.
Interfaces, Directives, and Events
The Alexa Voice Service (AVS) API is an aggregation of various fine-grained interfaces. Each interface is a collection of directives and events, which correspond to specific client functionality.
- Directives are messages sent from AVS telling a client to perform a specific action like playing audio from a distinct URL or setting an alarm.
- Events are messages sent from a client to AVS notifying Alexa something has occurred. The most common event is a speech request from your user.
This table provides a brief description of each interface exposed by the AVS API:
|Alerts||The interface for setting, stopping, and deleting timers and alarms. For a conceptual overview, see Alerts Overview.|
|AudioActivityTracker||The interface that is used to inform Alexa which interface last occupied an audio channel.|
|AudioPlayer||The interface for managing and controlling audio playback that originates from an Alexa-managed queue. For a conceptual overview, see AudioPlayer Overview.|
|Bluetooth||The interface for managing connections with peer Bluetooth devices, such as smart phones and speakers.|
|EqualizerController||This interface allows a product to adjust equalizer settings using Alexa, such as decibel (dB) levels and modes.|
|InputController||This interface enables selecting and switching inputs on an Alexa-enabled product.|
|Notifications||The interface that delivers visual and audio indicators when notifications are available. For a conceptual overview, see Notifications Overview.|
|PlaybackController||The interface for navigating a playback queue via button presses or GUI affordances.|
|Settings||The interface that is used to manage the Alexa settings on your product, such as locale.|
|Speaker||The interface for controlling the volume of Alexa originated content on your product, including mute and unmute.|
|SpeechRecognizer||The core interface for the Alexa Voice Service. Each user utterance leverages the Recognize event.|
|SpeechSynthesizer||The interface that returns Alexa TTS.|
|System||The interface that is used to send Alexa information about your product.|
|TemplateRuntime||The interface for rendering visual metadata. For a conceptual overview, see Display Cards Overview.|
|VisualActivityTracker||The interface that is used to inform Alexa when content is actively displayed to an end user.|
Client Interaction with AVS
Interactions between your client and the AVS are initiated in two ways:
- In a typical interaction, your client sends an event to AVS. The event is processed, then AVS returns zero or more directives to your client as a result. For example, when a customer asks Alexa, "What time is it?" The client streams the captured audio to the Alexa Voice Service, and once Alexa has processed the event, a directive is returned instructing your client to output speech. In this case, Alexa might respond, "It is 10:00 a.m."
- In a cloud-initiated interaction, the client may receive directives without any preceding client events. For example, when a user adjusts client volume from the Amazon Alexa app there is no event sent directly from the client to Alexa. Alexa interprets the action taken on the Amazon Alexa app and sends a directive to the client, which the client then acts upon.
You must send each event to the cloud in its own event stream. Directives and corresponding audio attachments may be returned to your client from the cloud in the same stream or in a separate downchannel stream. The downchannel is a stream primarily used to deliver cloud-initiated directives to your client. The downchannel remains open in a half-closed state from the device and open from the Alexa Voice Service for the life of a connection. Event streams and the downchannel can be implemented in various ways depending on transport protocol. We provide guidance to help you establish both over HTTP/2. For more information, see Managing an HTTP/2 Connection.
When new features are introduced AVS may add new properties to a directive's JSON payload, while maintaining backward compatibility for existing properties. Your code must be resilient to such changes. For example, your code for parsing JSON must not break when a new, unknown property is encountered.
MessageInterpreter.cpp in the AVS Device SDK for example code.
Voice Request Lifecycle
When designing your client, you must ensure that only one voice request is active at any given time. This means that your client must create a unique
dialogRequestId for each
Recognize event that is sent to the cloud, and keep track of the active
dialogRequestId is used to correlate the
Recognize event with directives sent to your client from the cloud.
dialogRequestIds during a session. Each
Recognizeevent must have a unique
When the next
Recognize event is sent to the cloud, whether a new request or a response to an
ExpectSpeech directive, it must have a unique
dialogRequestId (it cannot match the
dialogRequestId of the previous
Recognize event in that session); all directives associated with the previously active
dialogRequestId must be dropped. You must also make sure that your client supports interactions generated by Alexa. The
dialogRequestId in the payload of
InteractionModel.NewDialogRequest directives must be set to active, executed immediately when received from Alexa, and all directives associated with the previously active
dialogRequestId must be dropped.
If an interaction is initiated by Alexa, a
NewDialogRequest directive is sent with a
dialogRequestId in the payload. This
dialogRequestId replaces any previously set
dialogRequestId. Directives associated with the previous
dialogRequestId must be dropped.
When events are sent to Alexa, these rules must be enforced by your client:
- For each
Recognizeevent, you must create a unique
dialogRequestIds cannot be reused during a session.
dialogRequestIdmust be included in the Recognize event’s header.
dialogRequestIdmust remain active until the next
Recognizeevent is sent to the cloud. When this occurs, all directives associated with the previously active
dialogRequestIdmust be dropped.
When directives are received from Alexa, the following rules must be enforced by your client:
- Directives sent from the cloud with a
dialogRequestIdin the header that matches the active
dialogRequestIdmust be processed in sequence.
dialogRequestIdin the payload of
InteractionModel.NewDialogRequestdirectives must be set to active and executed immediately. Directives with a
dialogRequestIdin the header that match the
NewDialogRequestdirectives must be processed in sequence.
- Directives without a
dialogRequestIdmust be executed immediately.
- When new, unknown directives are encountered your client must send an
ExceptionEncounteredevent to Alexa.Important: Receiving an unknown directive must not break your code.
- If you receive a
Speakdirective (which is issued when Alexa returns spoken text) you must fully playback the associated audio before processing subsequent directives.
AudioInputProcessor.cpp in the AVS Device SDK for example code.
All audio handled by your client can be organized into three categories called channels. These channels are: Dialog Channel, Alerts Channel and Content Channel. Channels are a concept used to govern how your client should prioritize audio inputs and outputs. Each channel is associated with one or more AVS interfaces. Each channel can be active or inactive.
For example, SpeechSynthesizer is associated with the Dialog channel, and when Alexa returns a Speak directive to your client, the Dialog channel is active and remains active until Alexa has finished responding. Similarly, when a timer goes off, the Alerts channel becomes active and remains active until the timer is cancelled.
The following table shows which interfaces map to each channel:
- The Dialog channel is active when either a user or Alexa is speaking.
- The Alerts channel is active when a timer or alarm is sounding.
- The Content channel is active when your client is playing media, such as audio streams.
It is possible for multiple channels to be active at once. For instance, if a user is listening to music and asks Alexa a question, the Content and Dialog channels are concurrently active as long as the user or Alexa is speaking.
Channels can either be in the foreground or background. At any given time, only one channel can be in the foreground. If multiple channels are active, you need to respect the following priority order: Dialog > Alerts > Content. When a channel that is in the foreground becomes inactive, the next active channel in the priority order moves into the foreground.
The following rules govern how channels interact:
- Inactive channels are always in the background.
- The Dialog channel is always in the foreground when active.
- The Alerts channel is only in the foreground if the dialog channel is inactive.
- The Content channel is only in the foreground if the other channels are inactive.
- When a channel in the foreground becomes inactive, the next active channel in the priority order moves into the foreground.
ExpectSpeechdirective in response to a
Recognizeevent prompting a user for additional speech, the Dialog channel should remain active until all directives associated with the request/response scenario are processed.
The way you will handle a directive for a given interface will depend on the state of that channel (whether it is active or inactive; in the foreground or background). For instance, if the Dialog channel is in the foreground, and an alarm sounds, the alarm should play in short alert mode as long as the Dialog channel is active. If an alarm sounds and the Dialog channel is inactive, a long alert should play.
For specifics on how to handle each directive, see AVS API Overview.
Testing the Interaction Model
Here are a few tests you can run to ensure that your implementation is working. Each of these can be tested on an Amazon Echo device or using the [AVS Device SDK][avs-device-sdk].
- Ask Alexa to set a timer for 5 seconds. Once Alexa notifies you that the timer has been set, ask Alexa for the weather forecast. As Alexa provides you with the forecast, the timer should go off as a short alert until Alexa has finished speaking. This is because the Dialog channel is active, which means that the Alerts channel must be in the background. Once Alexa has finished speaking, the Alerts channel moves to the foreground, and a long alert should continue to play until a user stops the timer.
- Ask Alexa to set a timer for 1 minute. When Alexa notifies you that the timer has been set, ask Alexa to play your favorite song. The song should begin playing, and approximately 1 minute into playback the music should be backgrounded* while your timer goes off. This is because the Content channel can only be in the foreground if the other channels are inactive. The music should remain backgrounded until a user stops the timer, at which point, your favorite song should return to normal volume or resume from a paused state.
- Ask Alexa to play your favorite song (if you’re not bored by it yet!). Once playback begins, ask Alexa for local news. The music should be backgrounded for your entire voice request and Alexa’s response. This is because the Dialog channel is always in the foreground while active. When Alexa is finished responding, music should return to normal volume or resume from a paused state.
When the Content Channel is backgrounded, this refers to the pausing or attenuation of audio playback.
- Alerts Overview
- AudioPlayer Overview
- Display Cards Overview
- Notifications Overview
- Recommended Media Support