Friction is any variable that impedes your progress toward a goal, whether it’s purchasing a product or navigating traffic to make your 9 a.m. meeting on time.
Amazon is obsessively focused on reducing or eliminating friction – think one-click ordering, Amazon Prime, or Amazon Go.
This morning, I am delivering a keynote talk at the World Wide Web Conference in Lyon, France, with the title, Conversational AI for Interacting with the Digital and Physical World. In my presentation, I’ll emphasize that while today’s computers are currently optimized to provide audiovisual output and receive tactile and motor skill input, we are on the cusp of voice becoming the primary input. This is significant as we evolve to a world of ambient computing, where we are surrounded at home, work and on the go by devices with internet connectivity and the ability to interact with cloud-based services via natural language understanding. Our goal is to enable more natural interaction with all of these IOT devices, and for these devices to more proactively engage with us.
The mobile computing era provides many benefits; we all wouldn’t be tethered to our phones if it didn’t. But when you think about it, what’s changed primarily with the phone is the form factor; the screen is smaller but we interact with our phones much the same way we do our PCs. It’s great to have a computing device where ever we go, yet we are still attached to a screen, touching, typing and swiping. With voice, you’re truly mobile. I’m often in the kitchen cooking, cleaning or putting groceries in the fridge, and without diverting my attention I can ask Alexa to play a song, or provide a weather update. Rarely am I looking directly at my Echo device when I ask a question, or make a request. In a sense, voice-enabled devices set me free. The profound difference in this emerging era is that with the benefit of AI and machine-learning technologies, Alexa and similar services can learn about you, and conform to your needs, instead of you having to conform to the system’s interaction model.
Alexa is similar to any other Amazon service. It is about removing friction in our customers’ interactions with the physical and digital world. The Alexa Brain initiative, which I lead, is one of many within the Alexa organization focused on making Alexa smarter and more natural to engage with. Our goals are to make it easier for users to discover and interact with the more than 40,000 third-party skills that developers have created for Alexa, and to improve Alexa’s ability to track context and memory within and across dialog sessions.
In my talk today, I’ll be updating conference goers on our progress against these goals, and outline the challenges that still exist in making interaction with Alexa more natural. I’ll also be highlighting three new capabilities we’ll soon make available to our customers.
We are always looking for ways to make it easier for customers to find and engage with skills. One of our approaches to this is the ability for Alexa to dynamically arbitrate among skills using machine learning. In the coming weeks, we’re rolling out this new capability that allows customers in the U.S. to automatically discover, enable and launch skills using natural phrases and requests. For example, using an Echo Show device, I recently asked: “Alexa, how do I remove an oil stain from my shirt?” She replied: “Here is Tide Stain Remover.” This beta experience was friction-free; the skill just walked me through the process of removing an oil stain from my shirt. Previously, I would have had to discover the skill on my own to use it. This is just one example, but it gives you a sense for how this capability will provide customers frictionless direct access to, and interaction with, third-party skills. We’re excited about what we’ve learned from our early beta users and will gradually make this capability available to more skills and customers in the U.S.
Soon, we will improve our understanding of multi-turn utterances, or what we refer to as context carryover. Initially, we will make this capability available to all of our customers in the U.S., U.K., and Germany. Previously, we’ve supported two-turn interactions with explicit pronoun references. For example, “Alexa, what was Adele’s first album?” “Alexa, play it.” We are expanding beyond this to include utterances without pronouns. For example: “Alexa, how is the weather in Seattle?” → “What about this weekend?” We are also supporting context across domains. For example: “Alexa, how’s the weather in Portland?” → “How long does it take to get there?” We are providing this more natural way of engaging with Alexa by adding deep learning models to our spoken language understanding (SLU) pipeline that allows us to carry customers’ intent and entities within and across domains (i.e., between weather and traffic).
In the U.S, we also soon will begin to roll out a new memory feature. With this capability, Alexa can remember any information for you so that you never forget. Alexa can store arbitrary information you want and retrieve it later. For example, a customer might ask: “Alexa, remember that Sean’s birthday is June 20th.” Alexa will reply: “Okay, I’ll remember that Sean’s birthday is June 20th.” This memory feature is the first of many launches this year that will make Alexa more personalized. It's early days, but with this initial release we will make it easier for customers to save information, as well as provide a natural way to recall that information later.
The Challenges Ahead
The work of our science and engineering teams to make Alexa smarter and more engaging has been extraordinary. It requires significant changes to Alexa’s existing architecture and incorporates contextual cues and customer preferences across all components of our system.
We have many challenges still to address, such as how to scale these new experiences across languages and different devices, how to scale skill arbitration across the tens of thousands of Alexa skills, and how to measure experience quality. Additionally, there are component-level technology challenges that span automatic speech recognition, spoken language understanding, dialog management, natural language generation, text-to-speech synthesis, and personalization.
As Rohit Prasad, vice president and head scientist of the Alexa Machine Learning team, said in a recent interview, we’ve only begun to scratch the surface of what’s possible. Skills arbitration, context carryover and the memory feature are early instances of a class of work Amazon scientists and engineers are doing to make engaging with Alexa more friction-free. We’re on a multi-year journey to fundamentally change human-computer interaction, and as we like to say at Amazon, it’s still Day 1.
Ruhi Sarikaya is director of applied science, Alexa AI. You can follow him on Twitter @Ruhi_Sarikaya.