How to Build a Multimodal Alexa Skill

Why Build Multimodal Alexa Skills?

Table of Contents

Multimodal is the addition of other forms of communication, such as visual aids, to the voice experience that Alexa already provides. While Alexa is always voice first, adding visuals as a secondary mode can greatly enrich your customer’s experience for Alexa-enabled devices with screens. There are more than 100 million Alexa devices, including Echo devices such as the Echo Spot and Echo Show, FireTV, Fire tablets, and devices from other manufacturers with Alexa built-in like the Lenovo Smart Tab devices and LG TVs. Adding another mode to your Alexa Skill can enhance the experience for your customers on these devices in the following ways.

Increase the Level of Detail In Your Skill’s Response

Multimodal skills can provide more information through visual aids leading to a better customer experience. Voice is a fantastic way to interact with users because it’s intuitive and efficient; however, it’s not a great choice for presenting complex or large amounts of information at once. Instead, keep the voice response as succinct as possible while providing more detailed information in your visuals. For example, a weather skill that provides the daily forecast could tell you the temperature and precipitation, while on a screen, it could display a breakdown richer weather data such as humidity, wind speed, and temperature hour by hour. During the voice response, it could display a breakdown of detailed weather data such as: hour by hour temperature, humidity, and wind speed. All of the information provided is potentially important, but it would take far too long for Alexa to read out all of these stats without many of the customers getting frustrated and bored. If the requesting device does not have a screen, more information such as humidity and wind speed can be read out to the customer. In both cases, you would want to allow them to ask for more details. Even if a user has a multimodal device, we can’t assume they’re always looking at the screen. Later in this course, we will cover the specifics of how to determine if the requesting Alexa device has a screen.

Complementary Visual Aids

Multimodal devices provide an opportunity to add visual branding to your experience. You can add your own brand logo, color palette and styling to create a unique visual experience for your customers. For example, if you have a business skill, you might want to show a sales summary chart with your skill’s logo at the top and set colors representing metrics such as gains and losses. High quality visuals improves your brand image for your skill in addition to making a more well-rounded customer experience. For instance, it’s common for music skills to display visuals such as album art on devices with a screen while playing music. If the customer looks at the screen, they immediately know which service is playing the music as well as the artist, album, and song. Music skills run for long periods of time so having a visual aid can act as a constant, non-obtrusive reminder of the experience to the customer. High quality visuals also provide an important way to teach your customers how to interact with your skill. One way is to use hints, a short (one sentence) message at the bottom of the screen that describes how you can perform specific actions. Hint messages can be specific to the ongoing interaction that the customer is currently having. For example, in a weather skill you may want to provide a hint that describes how the customer can ask for a breakdown of today’s wind speed by hour. By using the screen to convey tangential information, you are not impeding the customer or overloading them with information by voice alone.

Rich Media Experiences

While you always need a voice experience, some skills are used primarily as a way to showcase media. Multimodal interfaces give you the ability to provide videos, images, and animations in conjunction with voice. Consider a skill which displays user photos, videos, and photo metadata from a user’s hosted account. On a device without a screen, you’d only have it read out the metadata or play the audio portion of a video file. However, on a device with a screen you’d be able to display, play, and search for visual content easily. Multimodal devices give you the opportunity to do this which does not exist otherwise.

Alexa Presentation Language

The Alexa Presentation Language (APL), is designed for rendering visuals across the ever increasing category of Alexa-enabled multimodal devices, allowing you to add graphics, images, slideshows, video, and animations to create a consistent visual experience for your skill. APL gives you reach with a design language that scales across the wide range of Alexa-enabled device types without individually targeting devices. In addition to screens, APL has variations to target non-screen multimodal devices such as the Echo dot with clock. As the types of devices grows, APL grows with it. By understanding and learning to use Alexa Presentation Language, you will have the knowledge and tools needed to reach the most multimodal devices.

APL skill flow

APL Skill Flow diagram

Alexa skills all follow a diagram similar to the one above.

  1. Customers talk to their Alexa-enabled device.
  2. The speech is sent to the Alexa service in the cloud to translate that voice into text, then text into an intention.
  3. The intent (with slots) goes to the correct backend which formulates an appropriate speakOutput with optional additional directives. These directives each tell the device to perform an action.
  4. This response through Alexa services to perform text to speech from the skill’s speakOutput and renders visuals on the device if the appropriate directive (RenderDocument) is sent.

Some events can trigger requests to the skill’s backend in response to user actions. These are similar to the handlers created for an intent or launch request.

APL documents

An APL document holds all of the definitions of UI elements and their visual hierarchy in a RenderDocument directive. It also holds styles associated with those components, and points where you can bind data. The visual elements on the screen are made up of APL components, and layouts. Components are the most basic building blocks of APL and represent small, self-contained UI elements which display on the viewportLayouts are used like components, but are not primitive elements. Rather, they combine other layouts and primitive components to create a UI pattern. You can create your own layouts or import pre-defined layouts from other sources. Every APL document has a "mainTemplate". This simply represents the start state of the APL screen to be rendered. Here is an example blank APL document with all of its top-level properties:

    "type": "APL",
    "version": "1.1",
    "settings": {},
    "theme": "dark",
    "import": [],
    "resources": [],
    "styles": {},
    "onMount": [],
    "graphics": {},
    "commands": {},
    "layouts": {},
    "mainTemplate": {
        "parameters": [
        "items": []

As you progress through the course, we’ll cover the parts of an APL document in more detail. Continue to section 2 where you’ll use the APL authoring tool to create a simple APL document for the Cake Time skill’s LaunchRequest.