Cake Walk

Build an Engaging Alexa Skill

Designing the VUI

How will a user interact with Alexa?

To begin designing the VUI for a skill, you need to understand how a user will interact with Alexa. A user wakes an Alexa-enabled device with the wake word and asks a question or makes a request. For Alexa-enabled devices with a screen, a user can touch the screen to interact with Alexa (if the skill supports the interaction).

The following are examples of how a user might interact with Alexa:

Why design a VUI?

Humans have been learning, evolving, and defining language and norms for communication for thousands of years. However, the machines we interact with have had a much shorter time frame to learn how to talk with us.

There are inherent challenges with voice interfaces. The challenges include context switching or ambiguity in the conversation, discovering intent, and being unaware of the user's current state or mood. For a good user experience, you should plan for these challenges when developing your skill.

The following videos show a few examples of how things could go wrong if you don’t carefully design a VUI for your skill.

In this example, the user provides all the needed information at once, but Alexa is unable to parse information provided all at once. This doesn’t mean that Alexa is unable to comprehend what the user says, but rather that the VUI of the skill is not properly or correctly designed to infer information from the natural way a person may speak.

In this example, Alexa fails to recognize that she already has the answer she needs from context. Again, the VUI design fails to infer information from the context of the situation and is rather rigid on getting the answer for a specific question. This can be quite frustrating to a user.

The two examples show it is important to design the VUI to be as similar as possible to a natural conversation that might take place between two human beings. A good VUI dramatically increases the ease of use and user satisfaction for any given skill.

Characteristics of a well-designed VUI

Uses natural forms of communication

When talking with a machine, a user should not be required to learn a new language or remember the rules. A machine should conform to the user's paradigm, not the other way around.

Navigates through information easily

Your skill’s VUI should offer an easy way to cut through layers of information hierarchy by using voice commands to find important information.

Creates an eyes- and hands-free experience

Voice interfaces should allow a user to perform tasks while their eyes and hands are occupied.

Creates a shared experience

Voice experiences let users collaborate, contribute, or play together through natural conversation. For example, a family could play a game together on an Alexa-enabled device.

Design considerations

Designing a good VUI for a skill involves writing natural dialog, engaging the user throughout the skill, and staying true to Alexa's personality. The following are some design considerations to help you design an engaging VUI:

1. Stay close to Alexa's persona

Alexa's personality is friendly, upbeat, and helpful. She's honest about anything blocking her way but also fun, personable, and able to make small talk without being obtrusive or inappropriate.

Try to keep the tone of your skill’s VUI as close to Alexa’s persona as possible. One way to do this is by keeping the VUI natural and conversational.

Slightly vary the responses given by Alexa for responses like "thank you" and "sorry". Engaging the user with questions is also a good technique for a well-designed VUI.

Alexa should be helpful by providing the correct answer. The following is an example:


Alexa: That's not quite right. One more try. What year was the Bill of Rights signed? 

User: 1986 

Alexa: Shoot. That wasn't it. The correct answer was 1791.


Alexa: That's not quite right. One more try. What year was the Bill of Rights signed? 

User: 1986 

Alexa: That's not correct. Let's move on.

Engage the user with questions and avoid ending questions with "yes or no?" The following is an example.


Alexa: Do you want to keep shopping?


Alexa: Do you want to keep shopping: Yes or no?

2. Write for the ear, not the eye

The way we speak is far less formal than the way we write. Therefore, it's important to write Alexa’s prompts to the user in a conversational tone.

No matter how good a prompt sounds when you say it, it may sound odd in text-to-speech (TTS).

It is important to listen to the prompts on your test device and then iterate on the prompts based on how they sound.

Keep your VUI informal. The following is an example.


Alexa: Getting your playlist.


Alexa: Acquiring your playlist.

If there are more than two options, present the user with the options and ask which they would like. The following is an example.


Alexa: I can tell you a story, recite a rhyme, or sing a song. Which would you like?


Alexa: Do you want me to tell you a story, recite a rhyme, or sing you a song?

3. Be contextually relevant

List options in order from most to least contextually relevant to make it easier for the user to understand. Avoid giving the user options in an order that changes the subject of the conversation, then returns to it again. This helps the user understand and verbalize their choices better without spending mental time and energy figuring out what's most relevant to them. The following is an example.


Alexa: That show plays again tomorrow at 9 PM. I can tell you when a new episode is playing, when another show is playing, or you can do something else. Which would you like?


Alexa: That show plays again tomorrow at 9 PM. You can find out when another show is playing, find out when a new episode of this show is playing, or do something else. What would you like to do?

4. Be brief

Reduce the number of steps to complete a task wherever possible to keep the conversation brief. Simplify messages to their essence wherever possible. The following is an example.


Alexa: Ready to start the game?


Alexa: All right then, are you ready to get started on a new game?

5. Write for engagement to increase retention

Alexa skills should be built to last and grow with the user over time. Your skill should provide a delightful user experience, whether it's the first time a user invokes the skill or the 100th.

Design the skill to phase out information that experienced users will learn over time. Give fresh dialog to repeat users so the skill doesn't become tiresome or repetitive.


First use:

Alexa: Thanks for subscribing to Imaginary Radio. You can listen to a live game by saying a team name, like Seattle Seahawks, location, like New York, or league, like NFL. You can also ask me for a music station or genre. What would you like to listen to?

Return use:

Alexa: Welcome back to Imaginary Radio. Want to keep listening to the Kids Jam station?


First use:

Alexa: Thanks for subscribing to ABC Radio. What do you want to listen to?

Return use:

Alexa: Welcome back. What do you want to listen to?

Components of utterances and intents

To create a VUI for your skill, you need to know the components of a user utterance. In the following image, click each component to learn more about it. 

Wake word
The wake word tells Alexa to start listening to your commands.

Launch word
A launch word is a transitional action word that signals Alexa that a skill invocation will likely follow. Sample launch words include tell, ask, open, launch, and use.

Invocation name 
To begin interacting with a skill, a user says the skill's invocation name. For example, to use the Daily Horoscope skill, the user could say, "Alexa, read my daily horoscope."

Simply put, an utterance is a user's spoken request. These spoken requests can invoke a skill, provide inputs for a skill, confirm an action for Alexa, and so on. Consider the many ways a user could form their request.

Slot value
Slots are input values provided in a user's spoken request. These values help Alexa figure out the user's intent.

In this example, the user gives input information, the travel date of Friday. This value is a slot of intent, which Alexa will pass on to Lambda for skill code processing.

Slots can be defined with different types. The travel date slot in the above example uses Amazon's built-in AMAZON.DATE type to convert words that indicate dates (such as "today" and "next Friday") into a date format, while both from City and to City use the built-in AMAZON.US_CITY slot.

If you extended this skill to ask the user what activities they plan to do on the trip, you might add a custom LIST_OF_ACTIVITIES slot type to reference a list of activities such as hiking, shopping, skiing, and so on.

An intent represents an action that fulfills a user'spoken request. Intents can optionally have arguments called slots.

How to identify slots for an intent

Once you have written a few utterances, note the words or phrases that represent variable information. These will become the intent's slots.


Utterance Maps to
"I am going on a trip Friday." TRAVEL_DATE
"I want to visit Portland." TO_CITY
"I want to travel from Seattle to Portland next Friday." FROM_CITY, TO_CITY, and TRAVEL_DATE
"I'm driving to Portland to go hiking." MODE_OF_TRAVEL, TO_CITY, and ACTIVITIES


Optionally, if your skill is complex and has a lot of back-and forth-conversation (multi-turn conversation), create a dialog model for the skill. A dialog model is a structure that identifies the steps for a multi-turn conversation between your skill and the user to collect all the information needed to fulfill each intent. This simplifies the code you need to write to ask the user for information.

Interaction model

Now that you know what the components of a skill are, it is easier to understand what an interaction model is. An interaction model is simply a combination of utterances, intents, and slots that you identify for your skill.

To create an interaction model, define the requests (intents) and the words (sample utterances). Your Lambda skill code then determines how your skill handles each intent. You can start defining the intents and utterances on paper and iterate on those to try to cover as many possible ways the user can interact with the skill.

Then, go to the Alexa developer console and start creating the intents, utterances, and slots. The console creates JSON code of your interaction model. You can also create the interaction model in JSON yourself using any JSON tool and then copy and paste it in the developer console.

Voice design

A major part of the experience is designing your skill to mimic human conversation well. Before you write one line of code, you should work really hard to think through how your customers will interact with your skill. Skipping this step will result in a poorly written skill that will not work well with your users.

While it may be tempting to use a flow chart to represent how a conversation may branch, don't! Flow charts are not conversational. They are complicated, impossible to read, and tend to lead to an inferior experience not unlike a phone tree. No one likes calling customer support and diving into a phone tree, so let's avoid that. Instead of flow charts, you should use situational design.

Situational Design

Situational Design is a voice-first method to design a voice user interface. You start with a simple dialog which helps keep the focus on the conversation. Each interaction between your customer and the skill represents a turn. Each turn has a situation that represents the context. If it's the customer's first time interacting with the skill, there is a set of data that is yet unknown. For example, if it's the user's first time using Cake Walk, the skill does not know their birthday yet. In this situation, the skill will ask “When were you born?” Once the user gives their birthday, the skill will store that information to use next time. When the user interacts with the skill again, the skill will check to see if it's the user's birthday. If so, it will wish them a happy birthday, otherwise it will countdown the number of days until their birthday.

With situational Design, you start with the conversation and work backwards to your solution. Each interaction between the user and Alexa is treated as a turn. In the example below, the situation is that the user's birthday is unknown and the skill will need to ask for it.

Each turn can be represented as a card that contains, the user utterance, situation and Alexa's response.

You can combine these cards together to form a storyboard which shows how the user will progress through the skill over time. Storyboards are conversational, flow charts are not.

In the next module, you will learn how to create the interaction model for the Cake Walk skill in the Alexa developer console.