Welcome to module 2 of our introductory tutorial on building an engaging Alexa skill. In this module, we'll discuss how to design a voice user interface for your skill.
Time required: 15 - 30 minutes
What you’ll learn:
To create a voice user interface for your skill, you need to understand key voice design concepts. A user wakes an Alexa-enabled device with the wake word (“Alexa”) and asks a question or makes a request. For Alexa-enabled devices with a screen, a user can also touch the screen to interact with Alexa.
To create a voice user interface for your skill, you need to understand key voice design concepts.
Wake word: The wake word tells Alexa to start listening to your commands.
Launch word: A launch word is a transitional action word that signals Alexa that a skill invocation will likely follow. Sample launch words include tell, ask, open, launch, and use.
Invocation name: To begin interacting with a skill, a user says the skill's invocation name. For example, to use the Daily Horoscope skill, the user could say, "Alexa, read my daily horoscope."
Utterance: Simply put, an utterance is a user's spoken request. These spoken requests can invoke a skill, provide inputs for a skill, confirm an action for Alexa, and so on. Consider the many ways a user could form their request.
Prompt: A string of text that should be spoken to the customer to ask for information. You include the prompt text in your response to a customer's request.
Intent: An intent represents an action that fulfills a user's spoken request. Intents can optionally have arguments called slots.
Slot value: Slots are input values provided in a user's spoken request. These values help Alexa figure out the user's intent.
In the example below, the user gives input information, the travel date of Friday. This value is a slot of intent, which Alexa will pass on to Lambda for skill code processing.
Slots can be defined with different types. The travel date slot in the above example uses Amazon's built-in AMAZON.DATE type to convert words that indicate dates (such as "today" and "next Friday") into a date format, while both from City and to City use the built-in AMAZON.US_CITY slot.
If you extended this skill to ask the user what activities they plan to do on the trip, you might add a custom LIST_OF_ACTIVITIES slot type to reference a list of activities such as hiking, shopping, skiing, and so on.
Look at the utterances in the table, and note the words or phrases that represent variable information. These will become the intent's slots.
|"I am going on a trip Friday."||TRAVEL_DATE|
|"I want to visit Portland."||TO_CITY|
|"I want to travel from Seattle to Portland next Friday."||FROM_CITY, TO_CITY, and TRAVEL_DATE|
|"I'm driving to Portland to go hiking."||MODE_OF_TRAVEL, TO_CITY, and ACTIVITIES|
Advanced voice design tips: if your skill is complex and has a lot of back-and forth-conversation (multi-turn conversation), create a dialog model for the skill. A dialog model is a structure that identifies the steps for a multi-turn conversation between your skill and the user to collect all the information needed to fulfill each intent. This simplifies the code you need to write to ask the user for information.
Now that you know what the components of a skill are, it is easier to understand what an interaction model is. An interaction model is simply a combination of utterances, intents, and slots that you identify for your skill.
To create an interaction model, define the requests (intents) and the words (sample utterances). Your Lambda skill code then determines how your skill handles each intent. You can start defining the intents and utterances on paper and iterate on those to try to cover as many possible ways the user can interact with the skill.
Then, go to the Alexa developer console and start creating the intents, utterances, and slots. The console creates JSON code of your interaction model. You can also create the interaction model in JSON yourself using any JSON tool and then copy and paste it in the developer console.
A major part of the experience is designing your skill to mimic human conversation well. Before you write one line of code, you should work really hard to think through how your customers will interact with your skill. Skipping this step will result in a poorly written skill that will not work well with your users.
While it may be tempting to use a flow chart to represent how a conversation may branch, don't! Flow charts are not conversational. They are complicated, impossible to read, and tend to lead to an inferior experience not unlike a phone tree. No one likes calling customer support and diving into a phone tree, so let's avoid that. Instead of flow charts, you should use situational design.
Situational Design is a voice-first method to design a voice user interface. You start with a simple dialog which helps keep the focus on the conversation. Each interaction between your customer and the skill represents a turn. Each turn has a situation that represents the context. If it's the customer's first time interacting with the skill, there is a set of data that is yet unknown. Once the skill has stored the information, it will be able to use it the next time the user interacts with the skill.
With situational Design, you start with the conversation and work backwards to your solution. Each interaction between the user and Alexa is treated as a turn. In the example below, the situation is that the user's birthday is unknown and the skill will need to ask for it.
Practice: The script below shows how the skill “Cake Time” asks the user for their birthday and remembers it. Later, it will be able to tell them the number of days until their next birthday and to wish them Happy Birthday on their birthday.
Each turn can be represented as a card that contains, the user utterance, situation and Alexa's response. Combine these cards together to form a storyboard which shows how the user will progress through the skill over time. Storyboards are conversational, flow charts are not.
When talking with a machine, a user should not be required to learn a new language or remember the rules. A machine should conform to the user's paradigm, not the other way around.
Your skill’s VUI should offer an easy way to cut through layers of information hierarchy by using voice commands to find important information.
Voice interfaces should allow a user to perform tasks while their eyes and hands are occupied.
Voice experiences let users collaborate, contribute, or play together through natural conversation. For example, a family could play a game together on an Alexa-enabled device.
Humans have been learning, evolving, and defining language and norms for communication for thousands of years. However, the machines we interact with have had a much shorter time frame to learn how to talk with us. There are inherent challenges with voice interfaces, including context switching or ambiguity in the conversation, discovering intent, and being unaware of the user's current state or mood. For a good user experience, you should plan for these challenges when developing your skill.
The following videos show a few examples of how things could go wrong if you don’t carefully design a VUI for your skill.
In this example, the user provides all the needed information at once, but Alexa is unable to parse information provided all at once. This doesn’t mean that Alexa is unable to comprehend what the user says, but rather that the VUI of the skill is not properly or correctly designed to infer information from the natural way a person may speak.
In this example, Alexa fails to recognize that she already has the answer she needs from context. Again, the VUI design fails to infer information from the context of the situation and is rather rigid on getting the answer for a specific question. This can be quite frustrating to a user.
The two examples show it is important to design the VUI to be as similar as possible to a natural conversation that might take place between two human beings. A good VUI dramatically increases the ease of use and user satisfaction for any given skill.
Designing a good voice user interface for a skill involves writing natural dialog, engaging the user throughout the skill, and staying true to Alexa's personality. Consider these five design best practices to help you design an engaging VUI:
Alexa's personality is friendly, upbeat, and helpful. She's honest about anything blocking her way but also fun, personable, and able to make small talk without being obtrusive or inappropriate.
Try to keep the tone of your skill’s VUI as close to Alexa’s persona as possible. One way to do this is by keeping the VUI natural and conversational.
Slightly vary the responses given by Alexa for responses like "thank you" and "sorry". Engaging the user with questions is also a good technique for a well-designed VUI.
Alexa should be helpful by providing the correct answer. The following is an example:
Engage the user with questions and avoid ending questions with "yes or no?" The following is an example.
The way we speak is far less formal than the way we write. Therefore, it's important to write Alexa’s prompts to the user in a conversational tone.
No matter how good a prompt sounds when you say it, it may sound odd in text-to-speech (TTS).
It is important to listen to the prompts on your test device and then iterate on the prompts based on how they sound.
Keep your VUI informal. The following is an example.
If there are more than two options, present the user with the options and ask which they would like. The following is an example.
List options in order from most to least contextually relevant to make it easier for the user to understand. Avoid giving the user options in an order that changes the subject of the conversation, then returns to it again. This helps the user understand and verbalize their choices better without spending mental time and energy figuring out what's most relevant to them. The following is an example.
Alexa skills should be built to last and grow with the user over time. Your skill should provide a delightful user experience, whether it's the first time a user invokes the skill or the 100th.
Design the skill to phase out information that experienced users will learn over time. Give fresh dialog to repeat users so the skill doesn't become tiresome or repetitive.
Alexa: Thanks for subscribing to Imaginary Radio. You can listen to a live game by saying a team name, like Seattle Seahawks, location, like New York, or league, like NFL. You can also ask me for a music station or genre. What would you like to listen to?
Alexa: Welcome back to Imaginary Radio. Want to keep listening to the Kids Jam station?