アクセスいただきありがとうございます。こちらのページは現在英語のみのご用意となっております。順次日本語化を進めてまいりますので、ご理解のほどよろしくお願いいたします。

VSK for Multimodal Devices Overview

Multimodal devices, such as Echo Show, refer to devices with interfaces that offer both voice and screen-based experiences. Even though multimodal devices are app-less, you can deliver your video content by implementing the VSK and setting up a web app player. This guide describes the features for VSK on multimodal devices that you can build and those features that are built-in.

What Are Multimodal Devices?

Multimodal devices refer to devices with interfaces that offer both voice and screen-based experiences. Each input (whether voice or video mode) changes the way a customer can interact with the experience, but the two modes work together fluidly.

Multimodal devices are "always on" devices, typically located on high density areas like the kitchen or the living room. They are app-less devices and leverage cloud-based skills (such as video skills from the Alexa Skills Kit) and generic on-device components.

A common scenario for using a video skill on a multimodal device might be a user, cooking in the kitchen, who says "Alexa, play Bosch" to her Echo Show sitting on an adjacent countertop.

Sample scenario for using a multimodal device
Sample scenario for using a multimodal device

Multimodal devices from Amazon currently include only Echo Show (1st and 2nd Generation), Echo Show 5, Echo Show Mode on Fire tablets and third-party devices. Echo Spot is not supported.

Both the Echo Show (Gen 1 and Gen 2) are reviewed above 4 stars with 15k+ reviewers. The Echo Show (Gen 2) launched in October 2018 and has a 10.1-inch HD screen. The Echo Show 5 launched in June 2019 and has a more compact 5.5-inch screen. Looking beyond Echo Show, given the current market direction, we anticipate thousands of Alexa-enabled devices with screens, manufactured by both Amazon and third-party companies.

Customers engage twice as much with Alexa Skills on multimodal devices than other Echo Devices. And more than 75% of multimodal customers use video at least once a month. Overall, video and voice are a great match. Multimodal devices take the customer experience a step further in ways single mode devices can't match.

How Video Content Gets Delivered on App-less Multimodal Devices

Multimodal devices don't support apps. Instead, this is an "app-less" model that relies on both a Lambda that responds to requests from Alexa and a web app player that you provide. At a high level, here's how it works:

  • Customers authenticate with your video skill through Account Linking.
  • As a video partner, you use generic but customizable templates to provide Search and Browse experiences for users. The templates are rendered on-device, with content that you populate.
  • Playback takes place using a custom web player that you provide and own.

Prerequisite: Catalog Integration

Similar to Fire TV apps, to incorporate the VSK for a multimodal device, your app must also be catalog-integrated. Catalog integration refers to the process of describing your app's media according to Amazon's Catalog Data Format (CDF), which is an XML schema, and regularly uploading your catalog into an S3 bucket following the processes described in catalog documentation.

Catalog integration is restricted to apps that have long-form movies or episodic TV shows that are significant enough to be integrated in and matched to IMDb, Amazon Video, or Gracenote. If your catalog consists of content that might not be included in these sources, reach out to your Amazon Business contact for guidance.

If you don't qualify for catalog integration, then you cannot implement the VSK for a multimodal device.

Supported Countries

The VSK for multimodal devices is not supported in every country. If you live in a country where the VSK isn't supported, you cannot implement the VSK for your multimodal device.

Additionally, the AWS regions you must use for your Lambda function are strictly enforced rather than optional. For example, if you're in the UK, you must use the EU (Ireland) region in AWS for your Lambda function.

For a detailed list of countries and support, see Supported Countries for VSK for Multimodal Devices. See also AWS Regions and the VSK in that same topic.

What You'll Need

You will need the following to implement the VSK for your multimodal device:

  • AWS Account
  • Amazon Developer Account
  • Multimodal device (e.g., Echo Show)
  • Logo image
  • Background image
  • AWS Lambda
  • Web player optimized for the device form factor (see Web Player Requirements for more details)
  • Support for Account Linking if required to view your content
  • Catalog-integrated media and catalog name

In contrast to Fire TV, with multimodal devices you must provide a web player on your own web server, along with backend services. The experience you intend to deliver on multimodal devices determines the breadth and depth of required supporting services. Common backend services that accompany the VSK include content metadata retrieval, category lookup, and several forms of search. Without this backend service, integration with Echo Show cannot happen. The backend service is not the Lambda. This backend serves the multimodal device and web player.

High-level Workflow

At a high-level, to integrate the VSK for your multimodal device, you first create a video skill in the Alexa Developer Console and associate it with a Lambda function on AWS. When users interact with your skill through voice, Alexa voice services in the cloud convert the user's commands into JSON objects, called directives.

Alexa sends these directives to your Lambda function. Your Lambda function inspects the request and then usually interacts with a backend service (doing lookups, queries, etc.) to retrieve the needed information. The needed information might be the URI for the requested content, or available titles matching the request. Once your Lambda retrieves this information, Lambda responds back to Alexa with the information.

For a more detailed workflow, see Architecture Overview.

Capabilities Provided through the VSK for Multimodal Devices

Integrating the VSK for a multimodal device gives customers the following capabilities:

Feature Sample Utterance
Login and Skill Enablement No utterances needed. When users say "Alexa, open <video provider>" or "Alexa, play <title> on <video provider>," Alexa prompts customers to enable the video skill.
Quick Play "Alexa, play <TV show> on <video provider>", "Alexa, watch <TV show> on <video provider>"
Channel Navigation "Alexa, tune to <channel>"
Playback Controls "Alexa, pause", "Alexa, fast forward"
Search "Alexa, find comedies on <video provider>"
Browse "Alexa, show me videos", "Alexa, open <video provider>"
Video Home "Alexa, show me videos", "Alexa, go to Video Home", "Alexa, Video Home."

Login and Skill Enablement

Customers must enable your video skill on their multimodal device before they can access your video content. As a video partner, you choose whether account linking is required for your customers to enable your skill. When you require account linking, the customer must log in to your service to access your content.

The VSK uses OAuth 2.0 to enable account linking on multimodal devices. You provide an authorization URL that Alexa surfaces as a web view for customers to log in. For information about OAuth 2.0, see Understand Account Linking.

You can add to your login page an option for customers to start a new subscription or create an account. (If you do this, you should ensure that this upsell is not made available on iOS.) The customer can disable your video skill at any time in the Alexa app.

On the device, customers can enable a video skill in the following ways:

  • By saying, "Alexa, open <video provider>."
  • By saying, "Alexa, play <title> on <video provider>." When the customer explicitly targets the video provider in a search, play, or channel navigation utterance, Alexa prompts customers to enable the video skill.
  • By tapping on video provider's icon in the Video Home page of the device.

In the Alexa app, customers can enable video skills through the Music, Video & Books section or in the Alexa Skills Store. See Enable Alexa Skills for details. If account linking is required, the app prompts the customer to log in. If account linking is not required, the app asks the customer to confirm they want to enable the skill.

The customer needs to sign-in only once to enable the video skill on all multimodal devices in their Alexa account.

Quick Play

Customers can play content from a provider by title or non-title (genre, actor, media type, etc.) by either being explicit or implicit. Explicit means the customer's request includes the provider. Implicit requests do not.

Playback takes place on a web player you control. If a customer has your skill enabled on a multimodal device, Alexa makes a call to your AWS Lambda with a play request. You receive play requests for the following scenarios:

  • Explicit play requests: "Alexa play Bosch on Prime Video."
  • Implicit play requests when the video provider is active: "Alexa, play Bosch."
  • Implicit play requests when the video provider is exclusive (available only through one provider): "Alexa, play Bosch."

If the customer makes an implicit play request (not referencing the provider name), and the title is available on multiple providers, Alexa asks the customer to disambiguate the request. Alexa responds, "I can play that on <video provider 1> or <video provider 2>, which would you like?"

Channel Navigation

Customers can navigate to the different channels that a video provider offers with explicit or implicit utterances. Customers can navigate to a linear channel you offer, even if your offering includes only one channel.

Playback takes place on a web player you control. If a customer has your skill enabled on a multimodal device, Alexa makes a call to your AWS Lambda with a channel navigation request. You receive channel navigation requests for the following scenarios (assuming the content is in your catalog):

  • Explicit requests for channels: "Alexa, tune to <channel> on <video provider>."
  • Implicit requests for channels when the video provider is active: "Alexa, tune to <channel>."

When more than one video provider can tune to a channel, Alexa disambiguates between providers. For example, Alexa responds: "I can play that on <video provider 1> or <video provider 2>. Which would you like?" Alexa makes a call to your AWS Lambda when the customer chooses your skill because of the disambiguation prompt.

Playback Controls

Customers can use both voice and touch to control playback. The following table shows all the playback controls Alexa supports on multimodal devices.

Playback Control User Experience
"Alexa, pause," "Alexa, stop" Pauses playback
"Alexa, play," "Alexa, resume" Plays or resumes playback
"Alexa, fast-forward" Fast forwards 10 seconds
"Alexa, rewind" Rewinds 10 seconds
"Alexa, fast-forward [duration]" Fast forwards by the specified amount. For example, "Alexa, fast-forward 2 minutes."
"Alexa, rewind by [duration]"" Rewinds by the specified amount. For example, "Alexa, rewind 2 minutes."
"Alexa, next" Video provider decides the title that is played
"Alexa, previous" Video provider decides the title that is played
"Alexa, closed captions on", "Alexa, closed captions off" Turns closed captions on and off

Customers can also initiate these commands through on-screen controls in the web player.

Customers can search for content from a video provider. Customers can search for titles or non-titles (media type, genre, actor, etc.) either by being explicit or implicit.

If a customer has your skill enabled on a multimodal device, Alexa makes a call to your AWS Lambda with a search request in the following scenarios:

If a customer has your skill enabled on a multimodal device, Alexa makes a call to your AWS Lambda with a search request. You receive search requests for the following scenarios (assuming the content is available in your catalog):

  • Explicit search requests for titles: "Alexa, search for Bosch on Prime Video."
  • Explicit search requests without titles: "Alexa, show me TV shows on Prime Video."
  • Implicit search requests with titles when the video provider is active: "Alexa, show me Bosch."
  • Implicit search requests without titles when the video provider is active: "Alexa, show me TV shows."
  • Exclusive search requests for titles: "Alexa, show me Bosch."

When more than one video provider offers the requested title or non-title, Alexa disambiguates between providers. For example, Alexa responds, "I can find that on <video provider 1> or <video provider 2>, which would you like?" Alexa makes a call to your AWS Lambda when the customer chooses your skill as a result of the disambiguation prompt.

Additional considerations for search:

  • For each search query you receive, you must return a list of titles (with their respective metadata), and in the order in which those titles should be displayed to users. The titles you return are rendered in a Search Results template, which you can customize with your logo. For an example of the expected response, see the GetDisplayableItemsMetadata Directives.
  • Customers can select search results by means of voice and touch. The search results template also offers numbered results, so that the customer can ask Alexa to "Play number 3" from the list of results.
  • Alexa does not show results that combine titles from multiple video providers. If more than one video provider can offer search results in one category (i.e., "Alexa, show me movies."), Alexa disambiguates across providers and shows search results from the provider that the customer chooses.

Browse

Customers can browse your video content with utterances such as "Alexa, open <video provider>". Or they can browse your video content when they tap on the video provider's icon in the Video Home page. From the Video Home, customers then tap or select a video provider to see its landing page. Both browsing actions take customers to your landing page.

Browsing facilitates content discovery. Although you use templates to render this landing page, you control its content and branding. Customers can navigate the landing page through voice or touch. With voice, they can say "Alexa, scroll down." Or they can make selections such as "Alexa, go to Trending now."

Note the following about the landing page:

  • This landing page uses a voice- and touch-optimized template that you can configure with your logo, a hero title, and a list of categories of your choosing.
  • This landing page also includes a More Categories section that the customer can access by voice or touch. More Categories displays tiles of video categories that you provide. Examples of categories you can provide include Recently Added, Watch List, Trending Now, Because you watched <title>, seasonal content, etc.
  • When the customer selects a category, Alexa displays individual titles included in that category, in the same template Alexa uses for Search Results.

The following shows an example landing page for a video skill:

Sample landing page
Sample landing page

Video Home

Video Home is a feature automatically built into multimodal devices. Customers bring up the Video Home page when they say utterances such as "Alexa, show me videos" or "Alexa, go to Video Home." The video home shows a list of all video skills on the device.

The Video Home page has a tile for each video provider, which responds to both voice and touch interactions. If the customer selects a video provider's skill that he or she has enabled, Alexa launches that provider's landing page. If the customer selects a disabled skill, Alexa kicks off the skill-enablement process.

PIN-Protected Playback

PIN-Protected Playback is also a feature automatically built into multimodal devices. PIN-protected playback allows customers to set a personal identification number (PIN) to restrict playback of content from any video provider.

Customers can use this PIN to set parental controls. If the customer sets a PIN for any video provider, Alexa prompts the customer to enter a PIN every time he or she wants to access content from that provider.

Estimated Development Time

It can take anywhere from several weeks to several months to fully integrate the VSK for a multimodal device. Assuming that your content is already catalog-integrated, the bulk of the development work for the VSK involves creating logic to handle the incoming directives from Lambda function.

The process for integrating the VSK for your multimodal device app is broken out into a series of steps. See Implementation Steps in "Process Overview for Implementing VSK on Multimodal Devices" for details.

You can complete the initial integration steps (steps Step 1: Create Your Video Skill and Lambda Function and Step 2: Enable your Video Skill on an Echo Device and Test), which will allow you to see the directives sent from Alexa to your Lambda function in the cloud, in about two hours. Seeing the directives will give you a better sense of the scope of the implementation.

Primary Components

The primary components of the multimodal implementation are as follows:

AWS Lambda
The AWS Lambda function configured in your skill definition is the interface between Alexa and your backend services. To support streaming content to Alexa endpoints such as a multimodal device, you need to implement a separate set of APIs in your AWS Lambda function.
Backend services
The experience you intend to deliver determines the breadth and depth of required supporting services. Common backend services that accompany a video skill include content metadata retrieval, category lookup, and several forms of search.
Web player
When a user plays content from your service, a URL that you provide in your skill definition opens your multimodal-optimized web player. The web player receives voice commands through a JavaScript library provided by Alexa and included in the HTML for your web player. AWS Lambda and the JavaScript library pass playback lifecycle events and the metadata of the current content to Alexa to provide users with the full set of video features.

Web Player Requirements

The web player must open in a web browser and support the following codecs, formats, and standards:

Video requirements:

  • HLS/MPEG-DASH
  • MP4 H.264
  • Widevine DRM Level 1
  • Encrypted Media Extensions (EME)
  • Media Source Extensions (MSE)

Audio requirements:

  • MP4 with AAC
  • WebM with Vorbis
  • WebM with Opus
  • AAC-LC is supported, but AAC-SBR is not. Media should follow the audio specifications defined for "chromium" to ensure proper playback. More information on chromium can be found in Audio/Video in the Chromium Projects site.

Image Asset Requirements

Background Image

  • Aspect ratio: 16/9, +/- 0.05
  • Minimum size: 512px wide x 288px high
  • Maximum size: 1280px wide x 720px high

The Background image appears on Video Home screen (say "Alexa, go to Video Home" to view this screen). Video Home shows the various video skills represented by Background images. In the following screenshot, "Video Skill Logo" represents one of these Background images.

Background image
Example of Background image

Logo Image

  • Aspect ratio: 1.0 - 5.0
  • Minimum size: 48px wide x 48px high
  • Maximum size: 1280px wide x 720px high

The Logo image appears in the upper-right corner on search and browse templates (say "Alexa, Open Prime Video" to open the category template). Consider using a transparent background in your Logo image. In the following screenshot, the Logo image says "Prime Video."

Logo image
Example of Logo image

Video Skill API versus Custom Skills with Screen Displays

The Video Skill API is intended for video providers whose catalog content is often in IMDb (or for device manufacturers making their devices voice interactive). The implementation involves handling directives from Alexa with Lambda and your own video service, so that you can support requests such as “Alexa, play Interstellar.”

In contrast, if you just want to provide accompanying visuals for your Alexa skill (e.g., some images, short video clips, or text displayed on a screen), you create a custom skill (rather than a video skill) and render the visual experiences on display templates using the Alexa Presentation Language (APL). For example, you might want to show text or images related to a quiz skill on an Echo Show screen. If that’s what you’re trying to build (instead of the more involved interactive voice experience with your video content that leverages the Video Skill API), then see Create Skills for Alexa-Enabled Devices with a Screen. The implementation process for custom skills with screen displays is simpler and does not require extensive developer expertise.

Next Steps

To get started building a video skill for multimodal devices, see Process Overview for Creating Video Skills for Multimodal Devices.