Process Overview for Creating Video Skills for Multimodal Devices

Follow this step-by-step guide to enable an existing video skill to stream content to a multimodal device.

The benefits of the multimodal implementation over a traditional video skill implementation include top-level utterance support for natural invocation, more precise content selection, and a better overall user experience the multimodal device. For the full collection of video features, see the Introduction.

Architecture overview

The traditional video skill architecture uses the Alexa.RemoteVideoPlayer API to send commands through the video provider's cloud service to a video client. The API sends commands through a channel established by Alexa, which allows for a new interaction model using the same API. The API continues to send search directives and play directives to the AWS Lambda function for the skill. However, unlike the traditional video skill architecture where directives must be sent from the Lambda to a separate device by the developer, results for video skills on multimodal devices are sent back directly to Alexa to drive the experience on the device.

Multimodal Device API Architecture

In simpler form, an end-to-end interaction for a video skill on a multimodal device includes the following sequence:

  • Customer speaks into multimodal device
  • Structured directives arrive in AWS lambda
  • Lambda sends structured responses back to Alexa
  • Alexa displays search results on screen (if needed)
  • Alexa communicates with web player to render playback on screen (if needed)

Primary components

AWS Lambda

The AWS Lambda function configured in your skill definition is the interface between Alexa and your backend services. To support streaming content to Alexa endpoints such as a multimodal device, you need to implement a separate set of APIs in your AWS Lambda function.

Backend services

The experience you intend to deliver determines the breadth and depth of required supporting services. Common backend services that accompany a video skill include content metadata retrieval, category lookup, and several forms of search.

Web player

When a user plays content from your service, a URL that you provide in your skill definition opens your multimodal-optimized web player. The web player receives voice commands through a JavaScript library provided by Alexa and included in the HTML for your web player. AWS Lambda and the JavaScript library pass playback lifecycle events and the metadata of the current content to Alexa to provide users with the full set of video features.

The web player opens in a web browser that supports the following codecs, formats, and standards:


  • MP4 H.264
  • Widevine DRM Level 1
  • Encrypted Media Extensions (EME)
  • Media Source Extensions (MSE)


  • MP4 with AAC
  • WebM with Vorbis
  • WebM with Opus

Steps to Create a Video Skill Web Player

At a high level, you will complete the following steps to integrate your content with a video skill for a multimodal device:

  • Register and configure a video skill
  • Develop an AWS lambda
  • Implement Video Skill API responses in AWS Lambda implement-responses-in-lambda
  • Develop a web player
  • Integrate the web player with the Alexa Video JavaScript Library
  • Implement event handlers in the web player
  • Implement account linking (or use Login With Amazon)
  • Enable skill via Alexa App, and perform an end-to-end test

The documentation breaks out these steps in the following topics: