Process Overview for Implementing VSK on Multimodal Devices

Learn about how to enable an existing video skill to stream content on multimodal devices. For a description of all possible voice capabilities on multimodal devices, see the Video Skills Kit for Multimodal Devices Overview.

General Architecture

The Overview guide provides a High-level Workflow. This section walks you through the workflow in greater detail.

On multimodal devices, voice interaction results for video skills are sent back directly to Alexa to drive the experience on the device, unlike the Fire TV app video skill architecture where requests must be sent from the Lambda to your Fire TV app through Amazon Device Messaging (ADM).

The following diagram presents the video skill workflow on multimodal devices:

Video Skill diagram and workflow for multimodal devices
Video Skill diagram and workflow for multimodal devices

As a developer, you do not have to take actions in every stage of this process. However, you must have a clear idea of what happens behind the scenes to provide a successful user experience.

User utters a phrase

Alexa listens for natural language commands from users. Supported utterances include phrases for search, play, changing the channel, fast-forwarding or rewinding (transport controls), and more.

Multimodal device sends utterance to Alexa cloud for processing/intent determination

The multimodal device sends these utterances to Alexa in the cloud. In the cloud, Alexa processes the user's utterances using automatic speech recognition and converts the speech to text. As a developer, you get all of this language processing and interpretation for free.

The output from Alexa in the cloud handles parsing and interpretation of the user's utterances/commands. This forms a "request", a set of data and instructions, expressed as a JSON object, that provides direction on how to respond to the user's utterances.

For example, when a user says "Play Bosch," Alexa in the cloud converts this into a GetPlayableItems directive with a specific JSON structure, like this:

     "directive": {
         "profile": null,
         "payload": {
             "minResultLimit": 1,
             "entities": [
                     "externalIds": null,
                     "type": "MediaType",
                     "value": "MOVIE",
                     "entityMetadata": null,
                     "mergedGroupId": 0
                     "externalIds": {
                         "catalog_name": "123456"
                     "type": "Video",
                     "value": "Bosch",
                     "entityMetadata": null,
                     "mergedGroupId": 1
             "timeWindow": null,
             "locale": "en-US",
             "contentType": null,
             "maxResultLimit": 40
         "endpoint": {
             "cookie": {},
             "scope": {
                 "token": "1dc32f5e-1694-38a0-1af6-e948f45adad9",
                 "type": "BearerToken"
         "header": {
             "payloadVersion": "3",
             "messageId": "01c46fa2-fcca-4c24-93bd-e6bed03ef906",
             "namespace": "Alexa.VideoContentProvider",
             "name": "GetPlayableItems",
             "correlationToken": null

The following table lists the generated kinds of requests:

Feature Sample Utterance
Quick Play "Alexa, play <TV show> on <video provider>", "Alexa, watch <TV show> on <video provider>"
Channel Navigation "Alexa, tune to <channel>"
Playback Controls "Alexa, pause", "Alexa, fast forward"
Search "Alexa, find comedies on <video provider>"
Browse "Alexa, show me videos", "Alexa, open <video provider>"
Video Home "Alexa, show me videos", "Alexa, go to Video Home", "Alexa, Video Home."

Alexa cloud determines the video skill target and sends intent payload to partner Lambda

Alexa identifies which video skill to target with the request, and then sends this request to your AWS Lambda function using the Video Skill API. The video skill provides the resource ID for your Lambda function.

Lambda is an AWS service that runs code in the cloud without requiring you to have a server to host the code (serverless computing). Your Lambda function can be written in a variety of programming languages, however, the sample Lambda code uses Node.js. Note that you are responsible for developing the logic in your Lambda function, but the sample code speeds up this process for you.

The request sent from Alexa cloud contains an intent payload that your Lambda acts upon.

Partner Lambda discovery is validated and video skill is invoked on the multimodal device

After Alexa cloud connects with your Lambda, your video skill gets invoked on the multimodal device. The video skill contains a URI for your web player so that the skill knows which web player to load on the customer's device.

Partner Lambda further processes the request and sends payload to partner services

Your Lambda function contacts backend services for information on processing the request. For example, depending on the request your Lambda receives, you might need to do lookups, queries, or other information retrieval functions.

Partner services process the request and respond to Lambda

Your backend services process the request. Exactly what your services backend looks like differs from partner to partner, and it is beyond the scope of the documentation to describe these processes. However your processes work, your Lambda needs to get the requested information and return it to Alexa.

Alexa JavaScript library receives payload from Lambda and executes request (playback, search or transport)

Your web player integrates with the Alexa JavaScript library. This library enables communication with Alexa cloud, manages lifecycle events, and more.

Partner web player environment receives assets/stream URL from partner services

Your web player receives information from your backend services about the content to stream, such as assets, URLs, and so on. For example, executing on the request to play a certain title, your web player might receive thumbnails related to the title and a URL to stream this title.

Alexa JavaScript Library communicates response to partner Lambda

The Alexa JavaScript Library communicates a response back to your partner Lambda, verifying receipt of assets, URLs and any other actions taken.

Partner Lambda sends response to Alexa cloud

Your Lambda sends a response to Alexa cloud. The response needs to conform to a specific JSON structure (defined in the reference documentation) depending on the request received.

For example, if your Lambda receives a GetPlayableItems request, your Lambda could respond with a GetPlayableItemsResponse response that looks as follows:

        "event": {
            "header": {
                "correlationToken": "dFMb0z+PgpgdDmluhJ1LddFvSqZ/jCc8ptlAKulUj90jSqg==",
                "messageId": "5f0a0546-caad-416f-a617-80cf083a05cd",
                "name": "GetPlayableItemsResponse",
                "namespace": "Alexa.VideoContentProvider",
                "payloadVersion": "3"
            "payload": {
                "nextToken": "fvkjbr20dvjbkwOpqStr",
                "mediaItems": [{
                    "mediaIdentifier": {
                        "id": "videoId://amzn1.av.rp.1234-2345-63434-asdf"

More details about what responses are allowed for the requests your Lambda receives are described in the reference documentation.

Note that sending information back to Alexa (rather than pushing instructions to an app through ADM) is the main difference in workflows between video skills on Fire TV apps and multimodal devices. With Fire TV apps, your Lambda simply provides a brief status message about having received the request and then sends the instruction to your app through ADM. But with multimodal devices, you actually provide the requested information back to Alexa.

Alexa cloud processes response and provides appropriate text-to-speech (TTS) back through multimodal device (if required)

Alexa communicates with the multimodal device to fulfill the user's request. Most likely this involves supplying a content URI for media playback or search results. The response might also include text-to-speech commands that Alexa communicates to the user. For example, Alexa might ask the user to disambiguate requests among similar matching.

Implementation Steps

The process for implementing a video skill on a multimodal device is the following:


The initial implementation steps here provide a quickstart setup with a sample Lambda function, web player, catalog, and skill assets so you can see the basic flow. This basic setup lets you say "Alexa, play the movie Big Buck Bunny" to play a video on an Echo Show device.

Next Steps

Get started on building out your video skill by going to Step 1: Create Your Video Skill and Lambda Function.