How Alexa Conversations Works


Alexa Conversations uses an AI-driven dialog management model to enable your skill to respond to a wide range of phrases and unexpected conversational flows to achieve a goal. The Alexa Conversations dialog management model includes the following three stages:

  1. Authoring – You provide annotated dialogs that represent the different conversational experiences your skill can support. For details on authoring, which involves creating and annotating dialogs, see Write Dialogs for Alexa Conversations and Best Practices for Authoring Dialogs in Alexa Conversations.
  2. Building – Alexa Conversations uses a dialog simulator to expand annotated dialogs into dialog variants that train the dialog management model. Dialog variants are natural ways in which a dialog can occur.
  3. Runtime – Alexa Conversations evaluates the trained dialog management model to process incoming events and predict actions. For details on the science behind the Alexa Conversations modeling architecture, see Science innovations power Alexa Conversations dialogue management.

This page focuses on the dialog simulator and what happens during runtime. We assume you're familiar with basic Alexa Conversations concepts, such as dialog acts, described in About Alexa Conversations.

Dialog simulator

The dialog simulator generates training data by generalizing your annotated dialogs to cover various ways a user might interact with your skill. For example, a user might say variations of utterances to invoke specific functionality, provide requested information out of order, or change previously provided information.

The dialog simulator generates the training data by expanding your annotated dialogs — including slot types, API definitions, utterance sets, and responses — into tens of thousands of dialog variants, phrasing variations, and uncommon alternatives to create a much wider range of possible dialog paths. This expansion improves the robustness of the dialog management model and enables you to focus on the functionality of your skill instead of on identifying and coding every possible way users might engage with your skill.

Alexa Conversations dialog expansion.

The following sections describe the dialog expansion methods:

Utterance variations

When you configure your Alexa Conversations skill, you create utterance sets to group different ways your user might respond to Alexa. For each utterance set, you provide a list of sample utterances. The dialog simulator uses these sample utterances to generate dialog variants.

Example

You provide a dialog to find a movie. Your utterance set contains both "Who directed {movie}?" and "Who's the director of {movie}?" In a variant, the dialog simulator replaces the user utterance with another sample utterance from the utterance set.

Dialog you provide Example variant from the dialog simulator

User: Who directed Inception?

(Invoke API FindMovie.)
Alexa: Christopher Nolan.

User: Who's the director of Inception?

(Invoke API FindMovie.)
Alexa: Christopher Nolan.

Slot value variations

You provide slots by selecting built-in slot types, extending built-in slot types with slot values, or creating custom slot types with values. The dialog manager randomly samples these slots to generate dialog variants.

Example

You provide a dialog to recommend a movie. In the variant, the dialog simulator replaces the slot values "crime" and "Quentin Tarantino" with "comedy" and "Guy Ritchie".

Dialog you provide Example variant from the dialog simulator

User: I'd like to watch a crime movie by Quentin Tarantino.

(Invoke API FindMovie.)
Alexa: How about Reservoir Dogs?

User: I'd like to watch a comedy movie by Guy Ritchie.

(Invoke API FindMovie.)
Alexa: How about Snatch?

Requesting and informing missing API arguments

You must provide annotated dialogs that demonstrate requesting and informing all arguments required to invoke an API. In doing so, you must create a response with the Request Args dialog act for each individual argument. For example, for a weather API that requires the city and date, you must create a response such as "What city?" and a response such as "What date?"

The dialog simulator automatically expands the dialogs to create variants that request missing API arguments. These dialog variants cover cases where the user doesn't provide all the requested slots in a single turn (underfilling) or gives more information than requested (overfilling).

Example

You provide a dialog to request a director and genre to recommend a movie. Your dialog includes responses for requesting the director and the genre. The dialog simulator generates dialog variants that request missing API arguments for RecommendMovie.

Dialog you provide Example variant from the dialog simulator

User: Can you recommend a movie?
Alexa: What director and what genre of movie?
User: Tarantino and crime.

(Invoke API RecommendMovie.)
Alexa: I think you might like Reservoir Dogs.

User: Can you recommend a movie?
Alexa: What director and what genre of movie?
User: Tarantino. (The user is underfilling by not providing all the requested information.)
Alexa: What genre of movie?
User: Crime.

(Invoke API RecommendMovie.)
Alexa: I think you might like Reservoir Dogs.

Another variant might be as follows.

User: Can you recommend a movie?
Alexa: Which director do you prefer?
User: Tarantino and crime. (The user is overfilling by providing more than requested information.)

(Invoke API RecommendMovie.)
Alexa: I think you might like Reservoir Dogs.

User correction of slot values

You don't need to provide annotated dialogs that demonstrate users correcting slot values. The dialog simulator automatically expands your dialogs to create these variants.

Example

Without you providing a dialog with user corrections, the dialog simulator generates dialog variations such as the following.

Dialog you provide Example variant from the dialog simulator

(None that provide user corrections.)

User: What's a good comedy movie?
Alexa: Directed by who?
User: Sorry, I meant a crime movie. (The user is correcting the genre slot value.)
Alexa: OK, please give me the director.
User: Guy Ritchie.

(Invoke API RecommendMovie)
Alexa: I think you might like Snatch.

Confirming APIs

You provide dialogs with Confirm API / Affirm dialog acts to indicate Alexa must confirm API arguments with the user before invoking the API. Each dialog that invokes a specific API must precede the API Success / API Failure dialog act with turns that confirm the API arguments. You aren't required to provide annotated dialogs with the Confirm API / Deny dialog acts. The dialog simulator automatically generates dialog variants that render the built-in reqmore response if the user denies the confirmation. However, you can provide annotated dialogs with Confirm API / Deny if you want to support alternative dialog flows.

Example

You provide a dialog to recommend and then purchase a movie, and a dialog confirming that the user wants to purchase the movie. The dialog simulator generates dialog variations that confirm the ReserveMovie API before invoking it.

Dialog you provide Example variant from the dialog simulator

User: I'd like to watch a movie from Guy Ritchie.

(Invoke API RecommendMovie.)
Alexa: How about Snatch?
User: Please purchase the movie Snatch.

(Invoke API ReserveMovie.)
Alexa: The movie Snatch is reserved for you.

You provide another dialog that includes confirmation of the ReserveMovie API, such as the following.

User: I'd like to watch a movie from Guy Ritchie.

(Invoke API RecommendMovie.)
Alexa: How about Snatch?
User: OK.

Alexa: Just to confirm, you want to purchase the movie Snatch?
User: Yes.

(Invoke API ReserveMovie.)
Alexa: The movie Snatch is reserved for you.

User: I'd like to watch a movie from Guy Ritchie.

(Invoke API RecommendMovie.)
Alexa: How about Snatch?
User: Please purchase the movie Snatch.
Alexa: Just to confirm, you want to purchase the movie Snatch?
User: No.
Alexa: What else would you like to do? (The dialog simulator is rendering the built-in reqmore response to keep the conversation going.)

Confirming arguments

You provide dialogs with Confirm Args / Affirm to indicate Alexa must confirm the arguments after requesting them. Each dialog that requests a specific argument must follow the Request Args dialog act with a turn that confirms the argument. If the dialog simulator attempts to generate a dialog variant where both Confirm API and Confirm Args are applicable before invoking an API, it prioritizes Confirm API for that turn and confirms all required arguments with the user. You aren't required to provide annotated dialogs with the Confirm Args / Deny dialog acts. The dialog simulator automatically generates dialog variants that re-request the argument if the user denies the confirmation. However, you can provide annotated dialogs with Confirm Args / Deny to provide alternative dialog flows.

Example

You provide a dialog to request a director and genre to recommend a movie, and a dialog that confirms the director. The dialog simulator generates dialog variations that confirm the director before proceeding.

Dialog you provide Example variant from the dialog simulator

User: Can you recommend a movie?
Alexa: What director?
User: Tarantino.
Alexa: What genre?

You provide another dialog that includes the Confirm Args dialog act, such as the following.

User: Can you recommend a movie?
Alexa: What director?
User: Tarantino.
Alexa: You said Tarantino, right?
User: Yes.
Alexa: What genre?

User: Can you recommend a movie?
Alexa: What director?
User: Tarantino.
Alexa: You said Tarantino, right?
User: No.
Alexa: Which director? (Alexa is re-requesting the denied argument.)

Invoking multiple APIs in a single turn

You can provide annotated dialogs with a single Alexa turn with API Success and API Failure dialog acts that invoke multiple APIs. When creating variants, the dialog simulator treats the sequence of these APIs as deterministic — that is, it doesn't change the order.

Example

You provide a dialog that invokes two APIs in a single turn to first get user preferences with the user's default city and then get the weather for that city before rendering the result.

Dialog you provide Example variant from the dialog simulator

User: What's the weather today?

Alexa: Today in Seattle, it's a high of 70 degrees with a low of 60 degrees.

User: What's the weather today?


The dialog simulator invokes the first API to get user preferences: GetUserPrefs()->userPrefs0

The dialog simulator invokes the second API to get the weather: GetWeather(userPrefs0)-> weatherResult0

Alexa: Today in Seattle, it's a high of 70 degrees with a low of 60 degrees.

Proactive offers

You can provide annotated dialogs that proactively offer a dialog flow to invoke a new API after the user invokes the original API and receives the result. You can offer a new API by extending a dialog that ends with an API Success / API Failure turn with an Offer Next API turn, which can include requesting arguments on the same turn and/or passing in arguments from a prior API invocation.

You aren't required to complete the dialog after the Offer Next API turn. The dialog simulator automatically completes the dialog when creating dialog variants as long as there is another dialog that invokes the new API. However, you can complete the dialog after the Offer Next API turn to provide alternative dialog flows. Proactive offers are non-deterministic; dialog variants can include proactively offering different APIs and not proactively offering any APIs.

You aren't required to provide annotated dialogs with the Offer Next API / Deny dialog acts. The dialog simulator automatically generates dialog variants that render the built-in reqmore response if the user denies the offer. However, you can provide annotated dialogs with Offer Next API / Deny to provide alternative dialog flows.

Example

You provide a dialog to reserve a table at a restaurant. You extend this dialog with an Offer Next API for an API to book an Uber.

Dialog you provide Example variant from the dialog simulator

…(previous lines of dialog)…
User: OK, let's reserve that table in the restaurant.

(Invoke API ReserveTable.)
Alexa: I have reserved your table.

You extend this dialog with an Offer Next API such as the following.

…(previous lines of dialog)…
User: OK, let's reserve that table in the restaurant.

(Invoke API ReserveTable.)
Alexa: I have reserved your table. Would you like to book an Uber to the restaurant?

…(previous lines of dialog)…
User: OK, let's reserve that table in the restaurant.

(Invoke API ReserveTable.)
Alexa: I have reserved your table.
Alexa: Would you like to book an Uber to the restaurant? (The dialog simulator is proactively offering a new API.)
User: Yes.

The dialog simulator also generates dialog variations for Offer Next API / Deny such as the following.

…(previous lines of dialog)…
User: OK, let's reserve that table in the restaurant.

(Invoke API ReserveTable.)
Alexa: I have reserved your table.
Alexa: Would you like to book an Uber to the restaurant?
User: No.
Alexa: What else would you like to do? (The dialog simulator is rendering the built-in reqmore response to keep the conversation going.)

Contextual carryover

You don't need to provide annotated dialogs that demonstrate contextual carryover. The dialog simulator supports this feature by introducing turns with pronouns (for example, "Please purchase it") when generating dialog variants from annotated dialogs. The carryover occurs at a later runtime stage, argument filling, when Alexa Conversations considers all slots that the user and Alexa mention across the entire dialog for filling API arguments.

Example

You provide a dialog to purchase a movie. The dialog simulator introduces a turn with a pronoun.

Dialog you provide Example variant from the dialog simulator

User: I'd like to watch a movie from Guy Ritchie.

(Invoke API RecommendMovie.)
Alexa: How about Snatch?
User: Please help me purchase Snatch.

(Invoke API ReserveMovie.)
Alexa: The movie Snatch is reserved for you.

User: I'd like to watch a movie from Guy Ritchie.

(Invoke API RecommendMovie.)
Alexa: How about Snatch?
User: Please purchase it.

(Invoke API ReserveMovie.)
Alexa: The movie Snatch is reserved for you.

Runtime

The Alexa Conversations runtime uses several components, including an inference engine, to evaluate the trained dialog management model. The inference engine receives events from the outside world (for example, the user says "Find showtimes for the Star Wars movie"), maintains conversation history for each session, manages dialog context, maintains the dialog state, and orchestrates information across different components within the runtime.

The runtime processes events and predicts the actions that should take place. For example, the action might be to respond to the user, call an AWS Lambda function, perform a calculation, and so on. The runtime either runs the predicted actions implicitly or transforms them to invoke an API.

The following diagram is a conceptual model of the Alexa Conversations inference engine, which hosts a machine-learning-trained dialog management model, processes events, and produces actions.

Alexa Conversations dialog management model.

The inference engine uses context memory to manage dialog context and track the state, as well as the following three domain-specific models trained by machine learning:

  • Named entity recognition – This model tags slots in the user utterance.
  • Action prediction – This model predicts the action that should occur next.
  • Argument filling – This model fills action arguments with entities from the context. An entity is a slot the user mentioned or return values from previous APIs. The inference engine performs entity resolution after argument filling is complete.

Context memory

Context memory is short-term memory that tracks user utterances and models results such as entities (that is, a slot that the user mentioned or a return value from a previous API), predicted actions, and Alexa responses. Context memory maintains dialog context for each session and is constantly updated throughout the session.

Named entity recognition

Named entity recognition is the first step after the inference engine receives a verbal utterance (for example, the user says "Find showtimes for the Star Wars movie"). Named entity recognition segments user utterances into word and phrases that correspond to slot types.

Named entity recognition interprets these phrases as slots and stores the slots in context memory. Later, the dialog management model uses these slots to fulfill actions such as invoking APIs or rendering Alexa responses. The debugging capability of the Alexa simulator, which is in the Test tab of the developer console, shows the phrases and slot types of named entity recognition. For details, see Debug an Alexa Conversations Skill Model.

Example

You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes, which has an argument, title, of type MovieTitle.

If a user says "Find showtimes for the Star Wars movie," named entity recognition recognizes "Star Wars", extracts "Star Wars" as a phrase, and labels this phrase as a MovieTitle slot type as follows.

{Find|Other} {showtimes|Other} {for|Other} {the|Other} {Star Wars|MovieTitle} {movie|Other}

Other means that named entity recognition didn't recognize a specific slot type.

At a later runtime stage, the argument filling model fills the MovieTitle argument for API definition FindShowtimes with "Star Wars", which in this case is a user-mentioned slot.

Action prediction

Next, action prediction processes the current conversation context and predicts the next action type and action name to run. The three action types are as follows:

  • API – Invokes an API in the skill endpoint (for example, perform a ticket purchase transaction).
  • Response – Renders a response to the user (for example, inform of a transaction result or request more information).
  • System – Waits for the next user utterance. This action type is an internal/system action to indicate all tasks have run.

The action name can be an API definition name or a response name. The inference engine might run action prediction multiple times in a single turn until it predicts the System action type. The debugging capability of the Alexa simulator shows the API and response action types. For details, see Debug an Alexa Conversations Skill Model.

Example

You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes, which has an argument, title, of type MovieTitle. A user says "Find showtimes for the Star Wars movie," in the following dialog.

User: Find showtimes for the Star Wars movie.

(Invoke API FindShowtimes.)
Alexa: It is playing at 10:00pm at Downtown Seattle AMC.

Action prediction runs three times on the user utterance. The first run predicts the API action type with name FindShowtimes and invokes the API. The second run predicts the Response action type with name InformMovieShowtimes and renders the response. The third run predicts the System action type, which terminates action prediction, ends the current turn, and waits for the next user utterance.

Argument filling

When action prediction predicts an API action or a Response action, the next step is to determine how to fill the arguments with entities. An entity is a user-mentioned slot or a return from a previous API. Argument filling uses context memory data to access all available entities. Argument filling supports contextual carryover as it considers slots mentioned by the user and Alexa across the entire dialog. Argument filling then selects the most likely entities to fill arguments (of the same type as the entities), which the inference engine uses when invoking actions.

Example

You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes, which has an argument, title, of type MovieTitle and returns slot type ShowTimeInfo, which has properties time (slot type AMAZON.Time) and theaterName (slot type TheaterName). The user says "Find showtimes for the Star Wars movie" in the following dialog.

User: Find showtimes for the Star Wars movie.

(Invoke API FindShowtimes.)
Alexa: It is playing at 10:00pm at Downtown Seattle AMC.

The inference engine takes the following steps:

  1. Named entity recognition labels "Star Wars" as a MovieTitle slot type and stores the "Star Wars" slot in context memory as an entity title.
  2. Action prediction, on its first run, predicts the API action with name FindShowTimes.
  3. Argument filling uses the title entity, "Star Wars", to fill the title argument of the FindShowTimes API and then invokes the API.
  4. Action prediction, on its second run, predicts the Response action with name InformMovieShowtimes.
  5. Argument filling uses the time entity (a property from the API return) to fill the time argument of the InformMovieShowtimes response and uses the theaterName entity (another property from the API return) to fill the theaterName argument of the InformMovieShowtimes response and then renders the response.

Entity resolution

If the action prediction instance predicts the API action type, the inference engine performs entity resolution after argument filling is complete. For each entity to fill an API argument, entity resolution searches against build-time entities (or runtime entities in the case of dynamic entities) and resolves phrases into canonical values if there is a match. The inference engine inserts the entity resolution result as a separate payload in the API-invoking request to the skill. For details, see Receiving requests.

For details on entity resolution, see Define Synonyms and IDs for Slot Type Values (Entity Resolution). For details on dynamic entities, see Use Dynamic Entities for Customized Interactions.


Was this page helpful?

Last updated: Nov 27, 2023