How to Choose the Right Multimodal Technology for your Alexa Skill

Joe Muoio Aug 20, 2020
Design Build Smart Screens Multimodal

There are many types of Alexa-enabled devices beyond speakers — Alexa is enabled across devices that have multiple modalities for customers to interact with, including devices with screens from Amazon, like the FireTV and Echo Show, and devices from other manufacturers, such as some LG and Samsung TVs, or Portal from Facebook. When customers use Alexa, they expect the full range of capabilities from their device, with or without a screen. For example, the Echo Dot with Clock supports a seven-segment character display, letting customers view the time or simple text with just a glance. Alexa skill developers have the ability to interact with this functionality in a number of different ways to make their skill more engaging. This blog will go over the different ways you can enhance your skill with multi-modal features available in Alexa Skills Kit.

Reach All Multimodal Devices with Alexa Presentation Language

Alexa Presentation Language (APL) is the best way to create a custom visual experience to reach all customers on all Alexa-enabled devices. Customers prefer Alexa experiences that are quick to launch, glance-able, and available with a similar level of visual/audio fidelity on every Alexa device they use. APL was built from the ground up with this in mind. The open source, C++ based library renders at near native speed on device and enables APL visuals to render the same way across all Alexa-enabled devices. This also lets device makers integrate APL consistently and without imposing many additional costs. For instance, Portal from Facebook will render your visuals the same way as an Echo Show 8. This is important because as a developer, you do not need to worry about building for each Alexa-enabled device, specifically, to develop a multi-modal Alexa skill. To handle different screen resolutions, you can create responsive visuals using viewport profiles when defining your own visual style or our responsive components and templates. The rules of how these behave are the same across all devices.

An image showing a text list of grocery items with a gray or green background across a large variety of Alexa enabled devices.

Some examples of an APL List Template across different devices (above)

APL is very rich in terms of what you can do. The basics of displaying text, images, and video are covered as are applying animations to these components via commands. You can display vector images and animations with Alexa Vector Graphics (a subset of SVG). In addition, there are components to help layout other components such as the ScrollView or Pager, giving you a lot of flexibility in terms of visual design. APL gives you the tools to decouple your presentation from your content with data sources and includes some additional information like time and math primitives in the data binding context. Part of the design of APL is to have deep integration with voice and speech technology. There is built in voice navigation and features like speech/text synchronization. APL defines transformers which can do tasks such as converting text on screen into speech, further deepening the integration with Alexa-enabled devices. APL 1.4 has brought more to this space with more responsive components and templates, an improved AVG format, new components and commands, and support for user gestures to name a bit of the new functionality. It has been nearly a year since APL has been generally available and I am excited to see how much richer this product has become: and it is only getting better.

A showcase of different skills using APL visuals.

Some examples of APL visuals (above)

APL has the widest reach, the fastest rendering speed, and is constantly improving on richness. If you have never used APL before or want a refresher, we have a step-by-step tutorial explaining the basic concepts behind creating your own APL visuals which I highly encourage you to follow. We also have some great code samples, like the pet tales skill, and responsive layouts example skill. If you are making an Alexa skill with visuals to complement the voice experience, APL is a great tool for the job.

NOTE: Display Templates were the first way to create visuals on Alexa. The seven templates are very rigid and allow little customization in terms of layout and presentation. While it is supported on some devices (Echo Show, Echo Spot, FireTV, and some Fire Tablets) APL gives you much greater flexibility and control to customize the design and works on all Alexa-enabled devices. If you like the look of the display template and want to use it in your skill or if you are migrating off of this technology to APL, I encourage you to use a responsive template or start from one of the pre-defined starter templates in the APL Authoring tool.

An image of the APL authoring tool starting page with the title: Choose how you'd like to create visuals for your skill. The template options listed from left to right are: Image Display Sample, Long Text Sample, Short Text Sample, Image Right Detail Sample, Image Left Detail Sample, Image Forward List Sample, Text Forward List Sample, Start from Scratch, Upload Code

APL Authoring tool starting page (above) has customizable samples to start from.

Reach Character Only Displays with APL for Text

On that note, APL applies to more than just defining screen based visuals. The Echo Dot with Clock (7 segment alphanumeric display) is also a multimodal device with a character display. You can use the same APL paradigm to reach these devices as you would on screen devices, but note, there are differences to the components supported. If your skill heavily relies on numbers or time, consider developing for this device, as well. While you can display text, the character set is limited by the segments supported on device. This is a useful feature in addition to APL. Check out the sample in the Alexa cookbook to get started.

An image of an Echo Dot with Clock device.

NOTE: APL for Audio (beta) also follows the same APL paradigm, but is used for creating rich, dynamic audio. Learn more at the announcement blog post.

Build Immersive Games with Alexa Web API for Games

The Alexa Web API for Games, fresh out of developer preview, allows you to create multimodal Alexa games using existing web technology. It consists of a local JS API which is loaded on device, the ability to launch a web application from your Alexa skill backend, and a communication bridge between the locally running web application and the backend Alexa skill handler code. This enables new forms of game experiences which are not possible with APL alone. For graphics options, you can use WebGL with 3D images and animations, apply custom shaders, and use HTML elements (with CSS) to create the layout of your choice. On the game play side, you have access to local touch handlers and web sockets, letting you create non-conversational experiences. You can even bring your favorite web framework like Three.js or Vue.js to help create your experience.

Because the visual content for an Alexa Web API game is served from your web endpoint, Alexa endpoints must fetch the voice and visuals separately. On average, it takes about 2 seconds to start seeing visuals when content is served from a content delivery network, compared to half of a second to render visuals using APL. This tradeoff is worth it for an immersive game where the customer is in a long running experience. Customers have more patience for load times for a game, but not for other kinds of skills where the visual is complementary to the conversation such as in a trivia or weather skill. In addition to speed, the Alexa Web API for games only reaches the following devices:

  • FireTV Stick (3rd Gen)
  • FireTV Stick 4K
  • FireTV Cube (all generations)
  • Echo Show family of devices

Customers on devices not in the list will not be able to see the visual experience you built. That said, the richness and ability to create novel experiences is not matched by APL and may lead you to choose the Alexa Web API for Games for your next Alexa game skill. However, if what you want to create can be done using APL alone, consider using this for latency and reach reasons, even if it is a game. You can also have the best of both worlds by creating APL visuals to reach all Alexa-enabled, visual devices.


Supported Devices
APL All Alexa-enabled screen devices Text, Images, Videos, Vector Graphics, Animations, speech/visual integrations Complementary visuals for all skills
APL for character displays Echo Dot with Clock Numbers and some text characters Supplementary experience for number/time based skills
Alexa Web API for Games Echo Show Family, Select Fire TV devices HTML, CSS, JS, WebGL Immersive game skills
Display Templates Echo Show Family, Echo Spot, Fire TV devices Rigid, pre-made that do not reach all devices Consider APL instead


For most Alexa skill experiences, APL is the best technology to use for complementary visuals. If you are primarily displaying numbers, consider also adding support for APL for character displays. And if you want to create immersive games with more than just complementary visuals using web technologies, use the Alexa Web API for Games and consider using APL as a fallback to reach even more customers. Let me know what you are making @JoeMoCode on Twitter.


Related Content