Today’s voice-first technologies are built with natural language understanding (NLU) and automatic speech recognition (ASR), which are forms of artificial intelligence centered on recognizing patterns and meaning within human language. While technologists have been using NLU and ASR for decades, today they are more consumable and accessible to developers via tools like the Alexa Skills Kit (ASK). With ASK, you can create an Alexa skill that takes what a customer says and the sounds they make, and form requests that are handled to create meaning and serve as a response.
The best way to leverage this technology for conversational voice design is through experimentation and practice. That’s why the Alexa team has released its own best practices and key concepts you can leverage to create standout skills.
Here are some core principles that we have uncovered in conversational voice design along with some advanced design skill-building tips to create natural voice experiences.
Conversation and interpreting verbal responses are two of the first things we learn. Many times when you have a conversation with someone, they’ll ask questions to gather information. They are gathering a set of information and interpreting that data. The same can be said when you are developing a skill. Before you can publish a response, you need to meet some requirements for formulating it, making sure all of the boxes are checked and all the variables are fulfilled.
In voice design, we like to call this a “multi-turn dialog.” The conversation is tied to a specific intent representing the customer’s overall request. These questions gather slots until all information is gathered to give an appropriate response.
As an example, let’s look at Pet Match, which is a skill we built to match a customer’s dog preferences to a specific dog breed. The skill needs to learn what temperament, size, and energy level the customer wants in a pet before formulating the complete response. In your skill code, you can prompt the customer for a specific response. Those variables that need to be filled are the slot values. As Alexa continues to elicit more slots, it can request confirmation from the customer and handle any updates. When all the slots are filled, Alexa can perform the final API call to handle the customer’s ultimate request of being matched to a pet.
Multi-turn dialogs are an important concept due to the conversational experience it handles. If a customer is on the web and fills out a form, they might fill in a field titled “Name.” However, with voice, if Alexa just said, “Name,” a user might be confused, or even intimidated by the command. Multi-turn dialogs allow Alexa to sound more conversational. At its core, it is an approach to gathering information for a customer’s request in a way they can easily understand.
Multi-turn dialogs can easily become a large tree based upon the information you need to fulfill a request. The worst-case scenario is if a customer thinks the line of questioning is too long, forgets what they have already answered, or forgets what they want to say. Fortunately, ASK has capabilities for more dialog support.
We talk about graph-based UI versus frame-based UI to illustrate the difference. A graph-based UI models a flow chart or decision tree metric. Not only does it open up the customer’s qualms as stated before, but for developers it is a lot to keep track of. You have to rely on the customer’s memory. The form they are filling out turn by turn can easily back-fire. Furthermore, building out a decision tree in this way is not conversational. A customer should be able to tell you the information they know they need to provide up front and at any time.
To resolve this, we introduced the concept of frame-based UI. With this model, there is an entrance criterion, essentially navigating to the point in the skill where you need to gather information. There is also an exit criterion, which is the information you need to gather to move on to the next part of the skill. Performing this collection of information is called “dialog management.”
With dialog management, a customer can provide information to Alexa at any point, regardless of what was asked. Alexa will then know to interpret the values and assign them to the appropriate slot values and elicit any remaining slots, according to the exit criteria, from the customer.
With the Pet Match example, the entrance criteria is asking to be matched with a new pet. The exit criteria are gathering what size, energy, and temperament the user wants in a dog. A customer can provide this information at any point. If Alexa asks “What size of dog would you like?” and a customer responds “I want a small, family friendly dog,” Alexa will know to fill the {size} and {temperament} slots with those values respectively, and then prompt for the remaining {energy} slot to be filled. Once the exit criteria is filled, the service called the petMatchAPI() and then a response is sent to the customer with an appropriate pet match.
All this being said, graph-based UI is sometimes unavoidable. In practice, try to decompose portions of your graph UI to a frame UI. This will create an overall more conversational experience for your customers.
When we break down ASK, the easiest way to understand what is required of a developer is to look at the dialog. You need to provide what the customer says and how Alexa responds; ASK is able to handle everything else. However, you probably can’t think of everything a customer could possibly say, nor should you. Think of your utterances as training data for Alexa. Recognize that with any training data, more is not necessarily better. Try and think of phrases and different rearrangements of a phrase and incorporate them into your skill.
You can use entity resolution to accomplish this. With entity resolution, a developer can assign synonyms to slot values. When a customer uses the synonym, it can then be resolved to the default slot value and perform the same logic.
When a customer uses Pet Match and Alexa prompts the customer for an energy level, they might not know exactly what they want. Instead, if they managed to say something like, “I want a dog I can run with,” Alexa should be able to interpret that phrase as a high-energy pet. Thus, the phrase “that I can run with” resolves to high energy, and Alexa can send that variable to the Pet Match API. Entity resolution can be one word or a phrase. Think outside the box about what customers might say to express an idea. For example, if they want a low-energy pet, they might talk about their lifestyle and say something like, “I don’t exercise.”
While entity resolution helps to handle many things a customer might say, you should formulate what Alexa says to prompt the customer for a response to follow a cadence. Be direct, as this will eliminate the need to train many phrases and slot values. If you are looking for a “low,” “medium,” or “high” resolve for energy, give the customer a choice. For example, Alexa could say, “Would you rather have a dog that is lazy or that is energetic?”
Memory is a concept common to most technical mediums. With voice, you can interpret the word “memory” less-so as storage or caching and more as recollection and remembrance. If we have a conversation, walk away and then you bring up the topic again moments later, it would be a bad conversational experience if I completely forgot what we were discussing and repeated exactly what I said before.
Conversation is different every time, in large or small ways. The same conversational principle applies to Alexa. If you have a customer who is continuously using your skill, they need not hear instructions in the opening message every time they invoke your skill. You don’t want the customer to tune out what Alexa is saying. The skill should remember their previous choices and provide variance in what Alexa says according to the customer’s usage.
There is, however, a lot to be said about consistency, and it can easily be lost with variance. If a customer enjoyed their experience in your skill previously, you will want to deliver the same enjoyable experience as they continue to invoke it. Think about variance as a benefit to memory. Recalling a customer’s name by saying, “Welcome back, Cami,” will be a delightful addition to the skill, but by no means change the experience.
Memory can be achieved within your skill via a customer’s userId. The userId can be provisioned as a primary key in your database to store user-specific contextual information from a skill session. These attributes are called persistent attributes.
Within Pet Match, the persistent attributes are the previous matches a customer has received. When the customer invokes the skill, they can either hear their previous matches or start a new search. In any case, this allows the customer to reflect on what they were previously told. The service code hosted on AWS Lambda is called DynamoDB when a new search result is performed or if a customer wants to hear their previous searches. This call is minimal in terms of latency, and gives the skill a new level of depth.
With these principles in hand, I hope that you can develop standout Alexa skills. But we are learning along with you; the more developers building for voice, the larger the learning and earning potential.
Check out these resources to find out more about the art and science of Alexa communication and conversational voice design:
You can make money through Alexa skills using in-skill purchasing or Amazon Pay for Alexa Skills. You can also make money for eligible skills that drive some of the highest customer engagement with Alexa Developer Rewards. Download our guide to learn which product best meets your needs.