Today’s guest blog post is from Maria Spyropoulou, Speech Systems Analyst at Eckoh. Maria is helping to design the dialogue flow and prompts of their services, and is also heavily involved in intent creation and classification among others.
Voice User Interfaces (VUIs) take up more mental resources than Graphical User Interfaces (GUIs), because information is auditory and presented serially, whereas in a GUI information is visual and presented at once. Voice browsing is a lot more complex than web browsing. For this reason, when you are building your Alexa skill, you have to design in a way that reduces cognitive load as much as possible.
Taking inspiration from John Sweller’s cognitive load theory, there are three types of Cognitive Load: Intrinsic Load, Extraneous Load, and Germane Load. Intrinsic Load is the inherent load of every concept (2 + 2 is objectively easier to process than 2^78-456). Extraneous Load relates to efficient presentation of information (if you want to explain to someone the idea of a circle, drawing a circle is a much more efficient presentation of the concept than describing with words what a circle is). Germane Load relates to the mental models of our mind. All people have been having conversations since they were children and they have a mental model of what a conversation is supposed to be like (there is a start, a middle and a finish for example). Moreover, many people have been using Alexa skills and they have a mental model for what interacting with a voice service should be like. VUI designers can’t reduce the Intrinsic Load, but they can reduce Extraneous Load and Germane Load.
You should aim to present the available options to the user only in time of need and not a second sooner, and only when it makes sense for the structure of your application. For example, you wouldn’t say:
“Welcome to Happy Groceries. You can add items to your basket, listen to your previous orders, submit a claim and checkout.”
It doesn’t make sense for someone to ‘checkout’, unless they have added products to their basket. Only when the user has added groceries to the basket should you present the option to checkout:
“If you’d like me to go ahead and purchase the items in your basket, say ‘checkout’.”
Sound effects are a great way to set context and create atmosphere. Sound effects can be used as metaphors for concepts in order to help the user visualize and conceptualize the application structure and feel. Sound effects are great for games, but can be used for all sorts of things. For example, if you have an application that can book plane tickets, you can use a sound effect in the place of speech as a progressive response before your full response:
“Book me a flight for Barcelona this Saturday”
(Jet taking off sound while you’re waiting on your API connection data)
“I have 2 flights this Saturday for Barcelona.”
This fills the waiting time in a more creative way than speech (for example “getting your results…”) and also signifies that the skill is looking for flights, or preparing some flight information. Keep in mind that audio in progressive responses is limited to 30 seconds. You can find sound effects here and read about them here.
Your skill should implement intents to catch universals as defined by the TSSC 2000 (Telephone Speech Standard Committee) and ETSI 2002 (European Telecommunication Standards Institute) standards. Some common ones would be help, repeat, stop, go back, main menu, goodbye. For example, if you have built a navigation game, you can create a contextual help message to help the user get unstuck.
“You can return to the forest or unlock the barn. What would you like to do?”
“Help”
“Your goal in this game is to find the hidden treasure. You have collected one key so far. Would you like to return to the forest or unlock the barn?”
You can leverage the built-in AMAZON.HelpIntent handler and configure the back-end in an appropriate way, depending on what you want the outcome to be. More information on universals can be found here.
The VUI should resemble an everyday conversation as much as possible, and this includes using conversation management markers also called discourse markers. For example, if your skill is failing to access the user’s settings, instead of playing this prompt:
“Unable to access account”
You should use more natural language with plenty of discourse markers to signify that something went wrong, and that you are also presenting the reason for the failure:
“I’m sorry, but due to technical reasons I can’t access your account right now.”
The purpose of the discourse markers is to reduce cognitive load, as they are used to introduce a new topic (by the way..), to denote something gone wrong (sorry, due to..), to give feedback that the user and the system are on the same page (thanks, okay, great, so this is what you requested..), serial information will be given (first, second, third, here are your options..) etc. You can find more information here.
Remember that information processing for voice is different than visual. In voice you should put the focal information at the very end of the prompt, as this will reduce the cognitive and memory load for the user. When you are designing with APL, you have to keep in mind that with visual interfaces, you should put the focal information first, at the very top of the screen.
Something you should avoid, is to offer a premium version of the skill as soon as the user has opened the skill for the first time. This will confuse and could annoy customers. Ideally you could suggest premium features at the end of the first session or at the beginning of subsequent sessions. You can use dynamic entities to keep track of first time and returning users and configure your code accordingly.