Gracias por tu visita. Esta página solo está disponible en inglés.

Process Overview for Implementing VSK on Multimodal Devices

Follow this step-by-step guide to enable an existing video skill to stream content to a multimodal device. For a description of all possible voice capabilities on multimodal devices, see the Video Skills Kit for Multimodal Devices Overview.

Architecture Overview

The Overview provided a High-level Workflow. Now that you're digging deeper into implementation, let's walk through the workflow in greater detail.

Unlike the Fire TV app video skill architecture where requests must be sent from the Lambda to your Fire TV app through Amazon Device Messaging (ADM), results for video skills on multimodal devices are sent back directly to Alexa to drive the experience on the device.

The following diagram shows the video skill workflow on multimodal devices:

Video Skill diagram and workflow for multimodal devices
Video Skill diagram and workflow for multimodal devices

Here's the high-level description of the workflow:

  1. User utters a phrase such as "Alexa, Play Bosch."
  2. Multimodal device sends utterance to Alexa cloud for processing/intent determination.
  3. Alexa cloud determines video skill target and sends intent payload to partner Lambda.
  4. Partner Lambda discovery is validated, and video skill is invoked on the multimodal device.
  5. Partner Lambda further processes the request and sends payload to partner services.
  6. Partner services processes request and responds to Lambda with appropriate response.
  7. Alexa JavaScript library receives payload from Lambda and executes request (playback/search/transport).
  8. Partner web player environment receives assets/stream URL from partner services.
  9. Alexa JS Library communicates response to partner Lambda.
  10. Partner Lambda sends response to Alexa cloud.
  11. Alexa cloud processes response and provides appropriate text-to-speech (TTS) back through multimodal device (if required).

Detailed Workflow Explanation

More detail about each workflow step is described in the following sections.

User utters a phrase such as "Alexa, Play Bosch"

On your multimodal devices, Alexa listens for natural language commands from users. Supported utterances (as they're called) include phrases that involve search, play, changing the channel, fast-forwarding or rewinding (transport controls), and more. For example, "Play Bosch."

Multimodal device sends utterance to Alexa cloud for processing/intent determination

The multimodal device sends these utterances to Alexa in the cloud. In the cloud, Alexa processes the user's utterances using automatic speech recognition and converts the speech to text. Alexa also processes the commands with natural language understanding to recognize the intent of the text. As an app developer, you get all of this language processing and interpretation for free.

The output from Alexa in the cloud, which handles the parsing and interpretation of the user's utterances/commands, is a "request." A request is a set of data and instructions, expressed as a JSON object, that provides direction on how to respond to the user's utterances.

For example, when a user says "Play Bosch," Alexa in the cloud converts this into a GetPlayableItems directive that has a specific JSON structure, like this:

 {
     "directive": {
         "profile": null,
         "payload": {
             "minResultLimit": 1,
             "entities": [
                 {
                     "externalIds": null,
                     "type": "MediaType",
                     "value": "MOVIE",
                     "entityMetadata": null,
                     "mergedGroupId": 0
                 },
                 {
                     "externalIds": {
                         "catalog_name": "123456"
                     },
                     "type": "Video",
                     "value": "Bosch",
                     "entityMetadata": null,
                     "mergedGroupId": 1
                 }
             ],
             "timeWindow": null,
             "locale": "en-US",
             "contentType": null,
             "maxResultLimit": 40
         },
         "endpoint": {
             "cookie": {},
             "endpointId": "ALEXA_VOICE_SERVICE_EXTERNAL_MEDIA_PLAYER_VIDEO_PROVIDER",
             "scope": {
                 "token": "1dc32f5e-1694-38a0-1af6-e948f45adad9",
                 "type": "BearerToken"
             }
         },
         "header": {
             "payloadVersion": "3",
             "messageId": "01c46fa2-fcca-4c24-93bd-e6bed03ef906",
             "namespace": "Alexa.VideoContentProvider",
             "name": "GetPlayableItems",
             "correlationToken": null
         }
     }
 }

The following table lists the kinds of requests Alexa generates:

Feature Sample Utterance
Quick Play "Alexa, play <TV show> on <video provider>", "Alexa, watch <TV show> on <video provider>"
Channel Navigation "Alexa, tune to <channel>"
Playback Controls "Alexa, pause", "Alexa, fast forward"
Search "Alexa, find comedies on <video provider>"
Browse "Alexa, show me videos", "Alexa, open <video provider>"
Video Home "Alexa, show me videos", "Alexa, go to Video Home", "Alexa, Video Home."

You can read more details about each request in Capabilities Provided with Video Skills for Multimodal Devices.

Alexa cloud determines the video skill target and sends intent payload to partner Lambda

Alexa determines which video skill to target with the request and then sends this request to your AWS Lambda function using the Video Skill API. The video skill provides the resource ID for your Lambda function.

Lambda is an AWS service that runs code in the cloud without requiring you to have a server to host the code (serverless computing). Lambda is a key component that interfaces with other services as it processes the request. Your Lambda function can be coded in a variety of programming languages, but the sample Lambda code in this documentation uses Node JS. Note that you are responsible for programming the logic in your Lambda function.

The request sent from Alexa cloud contains an intent payload that your Lambda will act upon.

Partner Lambda discovery is validated and video skill is invoked on the multimodal device

After Alexa cloud connects with your Lambda, your video skill gets invoked on the multimodal device. The video skill contains a URI for your web player so that the skill will know which web player to load on the device.

Partner Lambda further processes the request and sends payload to partner services

Your Lambda function will need to communicate with a services backend for information to process the request. The Lambda is somewhat like a brain that sends signals to other parts of your body for action. For example, depending on the request your Lambda receives, you might need to do lookups, queries, or other information retrieval functions.

Partner services processes request and responds to Lambda with appropriate response.

Your services backend processes requests based on the preference/logic in your particular cloud environment. Exactly what your services backend looks like differs from partner to partner, and it is beyond the scope of the documentation to describe processes here. Basically, your Lambda function might need to consult with other services in your environment to provide the needed information. However your processes work, your Lambda essentially needs to get the requested information and return it to Alexa.

Alexa JavaScript library receives payload from Lambda and executes request (playback/search/transport)

Your web player integrates with the Alexa JavaScript library. This library enables communication with Alexa cloud, manages lifecycle events, and more.

Partner web player environment receives assets/stream URL from partner services

Your web player will receive assets and URLs for content to stream from your backend services. For example, executing on the request to play a certain title, your web player might receive thumbnails related to the title and a URL to stream this title.

Alexa JavaScript Library communicates response to partner Lambda

The Alexa JavaScript Library communicates a response back to your partner Lambda, verifying receipt of assets and URLs and other actions taken.

Partner Lambda sends response to Alexa cloud

Your Lambda sends a response to Alexa cloud. The response needs to conform to a specific JSON structure (defined in the reference documentation) depending on the request received.

For example, if your Lambda receives a GetPlayableItems request, your Lambda could respond with a GetPlayableItemsResponse response that looks as follows:

{
        "event": {
            "header": {
                "correlationToken": "dFMb0z+PgpgdDmluhJ1LddFvSqZ/jCc8ptlAKulUj90jSqg==",
                "messageId": "5f0a0546-caad-416f-a617-80cf083a05cd",
                "name": "GetPlayableItemsResponse",
                "namespace": "Alexa.VideoContentProvider",
                "payloadVersion": "3"
            },
            "payload": {
                "nextToken": "fvkjbr20dvjbkwOpqStr",
                "mediaItems": [{
                    "mediaIdentifier": {
                        "id": "videoId://amzn1.av.rp.1234-2345-63434-asdf"
                    }
                }]
            }
        }
    };

More details about what responses are allowed for the requests your Lambda receives are described in the reference documentation.

Note that sending information back to Alexa (rather than pushing instructions to your app through ADM) is the main difference in workflows between video skills on Fire TV apps and multimodal devices. With Fire TV apps, your Lambda simply provides a brief status message about having received the request and then sends the instruction to your app through ADM. But with multimodal devices, you actually provide the requested information back to Alexa.

Alexa cloud processes response and provides appropriate text-to-speech (TTS) back through multimodal device (if required)

Alexa then communicates with the multimodal device to fulfill the user's request. Most likely this involves supplying a content URI for media playback, search results, or other information. The response might include text-to-speech commands that Alexa communicates to the user. For example, Alexa might ask the user to disambiguate requests among similar matching.

Implementation Steps

The process for implementing a video skill on a multimodal device are broken out as follows:

Quickstart Within Implementation

The initial implementation steps here provide a quickstart setup with a sample Lambda function, web player, catalog, and skill assets so you can see the basic flow. This This basic setup lets you say "Alexa, play the movie Big Buck Bunny" to your Echo Show to play a video.

These initial steps give you a sense of how video gets delivered on multimodal devices. You can get through the first three steps in about an hour. (These initial steps are the equivalent of the sample app in the Fire TV apps documentation.) The later steps of the implementation involve more extensive coding tasks, as you customize your Lambda and web player code to interact with your backend services and media. The advanced customization can take several weeks of development.

Next Steps

Get started on building out your video skill by going to Step 1: Create Your Video Skill and Lambda Function.