Batch Test Your Natural Language Understanding (NLU) Model

Use the NLU Evaluation tool in the developer console to batch test the natural language understanding (NLU) model for your Alexa skill.

To evaluate your model, you define a set of utterances mapped to the intents and slots you expect to be sent to your skill. This set of utterances is called an annotation set. Then you start an NLU Evaluation with the annotation set to determine how well you skill's model performs against your expectations. The tool can help you measure the accuracy of your NLU model, and run regression testing to ensure that changes to your model don't degrade the customer experience.

You can use the NLU Evaluation tool with skill models for all locales. You can also access these tools with the Skill Management API (SMAPI) or the ASK Command Line Interface (ASK CLI). For details, see NLU Evaluation Tool API.

Prerequisites

You can use the NLU Evaluation tool once you have defined an interaction model and successfully built it.

The tool doesn't call your endpoint, so you don't need to develop the service for your skill to test your model.

Annotations and annotation sets

To evaluate your model with the NLU Evaluation tool, you create an annotation set. This is a set of utterances mapped to the intents and slots you expect to be sent to your skill for each one. Each utterance with its expected intent and slots is called an annotation.

Each annotation has the following fields:

Utterance
The utterance to test.
  • Don't include the wake word or invocation name. Provide just the utterance as it is used after the user invokes the skill.
  • You can use either written form or spoken form for the utterance. For example, you can use numerals ("5") or write out numbers ("five"). For more examples, see the rules for custom slot type values.
Expected Intent
The intent that the utterance should trigger.
Expected Slot Names
(Optional) The name of the slot that the utterance should fill. You can provide more than one expected slot for an utterance.
Expected Slot Values
Required for each Expected Slot Name. Provides the value you expect the utterance to fill for the specified slot. Click in the Add slot value field to enter the value.
For a multiple-value slot, provide each of the values you expect. Click in the Add slot value field and enter each value separately.
Enter the values in the same order they are in the utterance. For example, for the utterance "I want a pizza with pepperoni, mushrooms, and olives," make sure that you add the values "pepperoni", "mushrooms", and "olives" in that same order. For details about multiple-value slots, see Collect Multiple Values in a Slot.
Reference Timestamp (UTC)
(Optional) A time and date in UTC format to use as the basis for relative date and time values. Use this when the utterance tests the AMAZON.DATE or AMAZON.TIME slots with words that represent relative dates and times such as "today," "tomorrow," and "now." Provide the full date and time, including milliseconds, for example: 2018-10-25T23:50:02.135Z For more details, see Create an annotation with relative dates or times.

Create automated annotation sets

You can manage annotation sets in the developer console from the Build > Custom > Annotation Sets page. Take the following steps to create an annotation set directly in the developer console.

To create an automated annotation set in the developer console

  1. With your Amazon developer credentials, log in to the Alexa developer console.
  2. From the developer console, navigate to the Build tab.
  3. Under the Custom left nav tab, click Annotation Sets to display the NLU Evaluation page.
  4. Under Automated Test Sets, click the Generate Test Set button.
  5. Select the data source for your test set:

    • Interaction Model – Use sample utterances in your skill's interaction model to create the test set.
    • Frequent Utterances – Use utterances frequently spoken to your skill to create the test set.
    • Utterances Recommendation Engine – Generate grammatical variations of sample utterances to create the test set.
  6. Click Generate Test Set and wait for your test set to generate.

  7. Review the values for Expected Intent and Expected Slot in your generated test sets.

    The NLU tool evaluates Expected Intent based on past usage of your skills and the sample utterances within each intent. When you review your test sets, add or remove utterances to your test sets as needed.

  8. In the upper-right corner, click Evaluate Model to run the evaluation.

  9. Review and troubleshoot issues with your skill models. The following list describes the expected pass rate and recommendations for improvements for each data source:

    • Interaction Model – An Interaction Model test set should have a pass rate greater than 95%. One common cause of errors is conflicting utterances across similar intents.
    • Frequent Utterance – A Frequent Utterance test set should have a pass rate over 80%. Because this test set contains the utterances that your skill's actual users are saying to your skill, you can use this test set to review how your in development model is responding (or will respond when pushed to production) to live customer utterances.
    • Utterances Recommendation Engine – The Utterances Recommendation Engine test set should have a medium pass rate. The utterances in this set set are variations of your sample utterances and could preempt what a user might say to your skill and your skill's expected response. Review the utterances in this test set, and remove utterances that aren't relevant to your skill. After updating your test set, review all utterances that map to AMAZON.FallbackIntent, if enabled, to find possible unsupported use cases.

Create annotation sets manually

As an alternative, you can manually create and edit annotation sets. Do the following to manually create and edit annotation sets directly in the developer console.

To create an annotation set manually in the developer console

  1. Open your skill in the developer console.
  2. Navigate to Build > Custom > Annotation Sets.
  3. Under User Defined Test Sets, click + Annotation Set.
  4. At the top of the page, enter a name for the annotation set.
  5. Create the annotations.

Edit an annotation set

To edit an annotation set

  1. Open your skill in the developer console.
  2. Navigate to Build > Custom > Annotation Sets.
  3. Find the annotation set to edit, and then click its name or the Edit link.

Create annotations in the developer console

To create annotations in the developer console

For details about the fields for an annotation, see Annotations and annotation sets.

  1. Create or edit an annotation set.
  2. Enter the utterance to test, and then click the plus sign (+) or press Enter.
  3. In the table of utterances, click in the Expected Intent field, and then select the intent the utterance should trigger.
  4. If the utterance should also fill a slot, under Expected Slots, click the plus sign (+), and then enter the slot name and slot value.
  5. If needed, click in the Reference Timestamp field and select the date and time from the date picker.

    This action fills in the selected timestamp in UTC format. See Create an annotation with relative dates or times.

  6. After you have added all the new annotations, click Save Annotation Set.

Upload annotations from a data file

When you upload annotations from a data file, the upload replaces any existing annotations in the annotation set.

To upload annotations from a data file

  1. Create either a JSON or a CSV file with your annotations. Use the same format used for the NLU Annotation Evaluation API:
  2. Create or edit an annotation set.
  3. Click Bulk Edit, select the JSON or CSV file to upload, and then click Submit.
  4. Click Save Annotation Set.

Create an annotation with relative dates or times

The built-in AMAZON.DATE and AMAZON.TIME slot types let users specify dates and times relative to the current date. For example, the utterance "today" normally resolves to the current date. The slot value therefore depends on the day you test the utterance.

To test these types of utterances with the NLU Evaluation tool, enter a specific date and time in the Reference Timestamp (UTC) field. Alexa then uses this value instead of the actual current date and time when calculating the date and time slot values.

For example, in the following table note the following annotations.

Utterance Expected intent Expected slot names Expected slot values

test the date slot with tomorrow

TestDateSlotIntent

DateSlotExample

2019-08-22

test the date slot with next Monday

TestDateSlotIntent

DateSlotExample

2019-08-26

Without a Reference Timestamp, these utterances only pass if you run the evaluation on August 21, 2019. Set the Reference Timestamp for each of these to 2019-08-21T00:00:00.000Z. Then, regardless of the actual date and time, the NLU Evaluation tool resolves the slots as though it is midnight on August 21, 2019, so the specified Expected Slot Values match the actual results.

Select the date and time from the calendar picker. This adds the date/time in UTC format: YYYY-MM-DDThh:mm:ss.sTZD, for example: 1997-07-16T19:20:30.45Z.

Start an evaluation

After you have at least one annotation set defined for your skill, you can start an evaluation. This evaluates the NLU model built from your skill's interaction model, using the specified annotation set.

For live skills, you can choose whether to run the evaluation against the development version or the live version.

You can run multiple evaluations at the same time.

To start an evaluation

  1. From any page in the Build > Custom > Interaction Model section, in the upper-right corner, click Evaluate Model.

    Evaluate Model is also available on the Annotation Sets page.

  2. Click the NLU Evaluation tab.
  3. From the Stage list, select Development or Live (if applicable).
  4. From the Annotation Source list, select one of your annotation sets.
  5. Click Run an Evaluation.

The evaluation starts and its current status is displayed in the NLU Evaluation Results table.

Review the results and update your model

You can review the results of an evaluation on the NLU Evaluation panel, and then closely examine the results for a specific evaluation. The developer console saves all past evaluations for later review.

To review a summary of NLU Evaluation results

  1. Click Evaluate Model, and then select the NLU Evaluation tab.
  2. At the bottom of the panel, review the in-progress and completed evaluations.

Each evaluation in the table displays the following information:

Evaluation ID
Unique ID for the evaluation. After an evaluation is complete, this ID becomes a link that you click to see the full report.
Status
Indicates whether the evaluation is Complete.
Results
Displays the results of the evaluation. Your evaluation has PASSED if all the tests within the annotation set returned the expected intent and slot values. Your evaluation has FAILED if any of the tests within the annotation set failed to return the expected intent and slot values.
Annotation Src
Unique ID for the annotation set used in the evaluation. Click this link to open the annotation set page.
Stage
The skill stage that was tested (Development or Live).
Start Time
The time you started the evaluation.

To get the results for a specific evaluation

  1. To see the results for a given evaluation, open the summary of results.
  2. To open the results, click the Evaluation ID link
  3. View the results page to see which utterances failed the test.

    For each utterance that failed, the table shows the expected value and the actual value, highlighted in red.

  4. To download the report in JSON format, click Export JSON.

Update your skill

Use the evaluation results to identify failing utterances. Add these to your interaction model as sample utterances and slot values, and then rebuild. Re-run the evaluation with the same annotation set to see if the changes improved the accuracy.

Use the NLU Evaluation tool for regression testing

The NLU Evaluation tool is especially useful for regression testing. After you have an annotation set that passes all the tests, you can re-run the evaluation whenever you make changes to your interaction model to ensure that your changes did not degrade your skill's accuracy.

If you do encounter issues, you can revert your skill to an earlier version of your interaction model. For details, see Use a previous version of the interaction model.