Voice Expert Q&A: How Discovery Designs Multimodal Alexa Skills

Cami Williams May 20, 2019
Spotlight Multimodal

Customers embrace voice because it’s simple, natural, and conversational. Adding visual elements and touch to deliver multimodal, voice-first experiences can make can make your Alexa skill even more engaging and easy to use. Developers are already building multimodal skills using the Alexa Presentation Language (APL), creating immersive visuals with information that complements the voice experience.

We had the opportunity speak with one voice leader—Tim McElreath, director of technology for mobile and emerging platforms at Discovery, Inc.—to learn more about how Discovery is leveraging voice, explore his team’s process for building multimodal skills, and dive deep into their Food Network Alexa skill.

Senior Solutions Architect Akersh Srivastava and I sat down with Tim during Alexa Live, a free online conference for voice developers. Below is recap of our discussion, which has been edited for brevity and clarity. You can also watch the full 45-minute interview below.

Akersh Srivastava: Tim, tell us about what you do. If you had to pitch yourself to the community, how would you do it?

Tim McElreath: I come from both a design and an engineering background. I'm a graduate of an art and design school but I also grew up around computers. The way I see myself is trying to bridge that gap between product design and engineering, and developing a user experience with a focus on how users really want to interact with digital interfaces. Now, at Discovery, I work very closely with the Food Network and HGTV. We're also brands like Motor Trend, Animal Planet, and brands that allow people to build their lives around the things that are important, like how they eat, how they create their home, and their past times. There's a lot of content and experiences to play with.

Akersh: How did you discover Alexa and the Alexa community?

Tim: I started working with Alexa back in 2016, so fairly early on. We have a great product team at Discovery and they recognized, because of the rate of adoption of Amazon Echo devices, that voice was going to be much more than just a novelty. This was a new way that customers could engage with our content now, and we didn’t want to wait around to see how the technology would evolve and jump in later. We wanted to start exploring how we could use voice interfaces and conversational interfaces to deliver our content, our experiences, our personalities, and our information in a more direct way to our customers. We started building a Food Network skill back in 2016 and we've been expanding on that ever since.

Cami Williams: Here at Alexa, we’re spearheading a voice-first initiative, but many skills also include some sort of component that would require you to think about multimodal experiences. I think it depends on the brand, the brand’s content, and how their customers typically engage. It's important to not only consider your voice-first approach but also previous generations of technology, like web and mobile, and recognize their influence within the voice community. With that in mind, what makes you most excited about voice?

Tim: We're in the beginning of a shift in the way humans interact with digital interfaces. We went from the early days of PC into the web into mobile 10 years ago. When you see that shift, we have to re-teach ourselves how to interact with digital interfaces. The expectation is that digital interfaces are going to understand us. But as engineers and designers, we're going to do the heavy lifting so that users can talk in their most natural language. For me, it's really an entirely new way of connecting with customers and users, and we're still figuring it out. That's really the exciting part. We don't know exactly what those expectations are going to be in the future, so being involved in it now feels very exploratory and very innovative.

Cami: Interacting with touch- and screen-based devices has become second nature. With the Alexa Presentation Language, we're excited to see how developers marry touch, screens, and voice, bringing conversation to this second-natured touch and screen experience. When you think about developing multimodal skills for Discovery, how can you marry the voice experience with the visual experience? And what's your perspective on multiple modalities for voice interfaces?

Tim: I think it's a fascinating challenge because one of the shifts in application design is that you're creating a single application that is meant to be delivered on anything, from an Echo Dot, to a small speaker, to a smart screen on your counter, to a connected TV, to auto, to headphones, and the list goes on. It's all the same experience but you have to tailor that to not only the device capabilities and the device modality, but the way the users are expected to be using that device in their current situation.

When you're thinking about delivering a response through Alexa to a customer on a particular device, how do you change that response to make it fit their situation if it's on their night table or if they're standing six feet away from it on a kitchen counter? And how much attention are they going to be paying to that screen? For example, if you're delivering a response to a connected TV, you can expect that they're going to be actually paying attention to that screen because they're in "lean back" mode. However, if it's a smart screen on a kitchen counter, they may not be looking at that screen at all. You have to make sure that you're giving the information through your speech response, just in case they're not fully engaged with that screen in that particular context. If there's no screen at all, you have to be able to give them the complete information of what they're looking for via voice alone. You have to pay attention to what the user is asking for and what the device is capable of presenting. It's about adapting your interface to the user to make it as easy as possible for the user to get what they need.

Cami: What’s the skill-building process like for you and your team?

Tim: We start by approaching every interface as a conversational interface. Meaning, if we’re building a system, we think of every interaction as part of an ongoing conversation with context and history. We start by designing every interaction from that point of view, rather than starting with the visual UI or system design. We actually get people into a room and we role play. One person will be the application and knows a certain set of information and can communicate it. How would you talk to that application, that person, in a way that most naturally gives you that information using the minimum visual feedback that's necessary to give you what you need? With the minimal text input and the minimal haptic input, what is the easiest way to use people's natural language to fulfill some utility, entertainment, or need?

Our engineers participate in the process as well. They're closest to how the technology can actually work and how we can design it from a technical point of view. They have more insight on some of the features that could assist with some of those conversational patterns. It's a combination of engineering, interaction design, the language being used in order to fulfill requests, and how we break those requests up into intense and slot values.

During the second half of the interview (starting at 23:22 in the video), we asked Tim to walk us through how his team designed the voice, visual, and touch experience for the Food Network skill. We loved having a chance to chat with Tim and enjoyed learning how a large brand is getting in early with voice to further engage and delight customers.

If you’re excited to start building multimodal voice experiences, check out our resources below.

Related Content