Batch Test Your Natural Language Understanding (NLU) Model

Use the NLU evaluation tool in the developer console to batch test the natural language understanding (NLU) model for your Alexa skill.

To evaluate your model, you define a set of utterances mapped to the intents and slots you expect to be sent to your skill. This is called an annotation set. Then you start an NLU evaluation with the annotation set to determine how well you skill's model performs against your expectations. The tool can help you measure the accuracy of your NLU model, and run regression testing to ensure that changes to your model don't degrade the customer experience.

You can use the NLU evaluation tool with skill models for all locales. You can also access these tools with the Skill Management API (SMAPI) or the ASK Command Line Interface (ASK CLI). See NLU Evaluation Tool API.


You can use the NLU evaluation tool once you have defined an interaction model and successfully built it.

The tool does not call an endpoint, so you do not need to develop the service for your skill to test your model.

Annotations and annotation sets

To evaluate your model with the NLU evaluation tool, you create an annotation set. This is a set of utterances mapped to the intents and slots you expect to be sent to your skill for each one. Each utterance with its expected intent and slots is called an annotation.

Each annotation has the following fields:

The utterance to test.
  • Do not include the wake word or invocation name. Provide just the utterance as it would be used after the user invokes the skill.
  • You can use either written form or spoken form for the utterance. For example, you can use numerals ("5") or write out numbers ("five"). For more examples, see the rules for custom slot type values.
Expected Intent
The intent that the utterance should trigger.
Expected Slot Names
(Optional) The name of the slot that the utterance should fill. You can provide more than one expected slot for an utterance.
Expected Slot Values
Required for each Expected Slot Name. Provides the value you expect the utterance to fill for the specified slot. Click in the Add slot value field to enter the value.
For a multiple-value slot, provide each of the values you expect. Click in the Add slot value field and enter each value separately.
Enter the values in the same order they are in the utterance. For example, for the utterance "I want a pizza with pepperoni, mushrooms, and olives", make sure that you add the values "pepperoni", "mushrooms", "olives" in that same order. For details about multiple-value slots, see Collect Multiple Values in a Slot.
Reference Timestamp (UTC)
(Optional) A time and date in UTC format to use as the basis for relative date and time values. Use this when the utterance tests the AMAZON.DATE or AMAZON.TIME slots with words that represent relative dates and times such as "today," "tomorrow," and "now." Provide the full date and time, including milliseconds, for example: 2018-10-25T23:50:02.135Z For more information, see Create an annotation with relative dates or times.

Create and edit annotation sets

You can manage annotation sets in the developer console from the Build > Custom > Annotation Sets page. You can then create and edit annotations directly in the developer console.

Create an annotation set in the developer console

  1. Open your skill in the developer console.
  2. Navigate to Build > Custom > Annotation Sets.
  3. Click + Annotation Set.
  4. At the top of the page, enter a name for the annotation set.
  5. Create the annotations.

Edit an annotation set

  1. Open your skill in the developer console.
  2. Navigate to Build > Custom > Annotation Sets.
  3. Find the annotation set to edit and click its name or the Edit link.

Create the annotations in the developer console

For more about the fields for an annotation, refer back to Annotations and annotation sets.

  1. Create or edit an annotation set.
  2. Enter the utterance to test and click the plus or press enter.
  3. In the table of utterances, click in the Expected Intent field and select the intent the utterance should trigger.
  4. If the utterance should also fill a slot, click the plus under Expected Slots, then enter the slot name and slot value.
  5. If needed, click in the Reference Timestamp field and select the date and time from the date picker. This fills in the selected timestamp in UTC format. See Create an annotation with relative dates or times.
  6. After you have added all the new annotations, click Save Annotation Set.

Upload annotations from a data file

When you upload annotations from a data file, the upload replaces any existing annotations in the annotation set.

  1. Create either a JSON or CSV file with your annotations. Use the same format used for the NLU Annotation Evaluation API:
  2. Create or edit an annotation set.
  3. Click Bulk Edit, select the JSON or CSV file to upload, and click Submit.
  4. Click Save Annotation Set.

Create an annotation with relative dates or times

The built-in AMAZON.DATE and AMAZON.TIME slot types let users specify dates and times relative to the current date. For example, the utterance "today" normally resolves to the current date. The slot value therefore depends on the day you test the utterance.

To test these types of utterances with the NLU evaluation tool, enter a specific date and time in the Reference Timestamp (UTC) field. This value is then used instead of the actual current date and time when calculating the date and time slot values.

For example, note the following annotations:

Utterance Expected Intent Expected Slot Names Expected Slot Values
test the date slot with tomorrow TestDateSlotIntent DateSlotExample 2019-08-22
test the date slot with next Monday TestDateSlotIntent DateSlotExample 2019-08-26

Without a Reference Timestamp, these utterances would only pass if you ran the evaluation on August 21, 2019. Set the Reference Timestamp for each of these to 2019-08-21T00:00:00.000Z. Then, regardless of the actual date and time, the NLU evaluation tool resolves the slots as though it was midnight on August 21, 2019, so the specified Expected Slot Values match the actual results.

Select the date and time from the calendar picker. This adds the date/time in UTC format: YYYY-MM-DDThh:mm:ss.sTZD, for example: 1997-07-16T19:20:30.45Z.

Start an evaluation

Once you have at least one annotation set defined for your skill, you can start an evaluation. This evaluates the natural language understanding (NLU) model built from your skill's interaction model, using the specified annotation set.

For live skills, you can choose whether to run the evaluation against the development version or the live version.

You can run multiple evaluations at the same time.

  1. From any page in the Build > Custom > Interaction Model section, click the Evaluate Model button in the upper-right corner. Evaluate Model is also available on the Annotation Sets page.

  2. Select the NLU Evaluation tab.
  3. From the Stage list, select Development or Live (if applicable).
  4. From the Annotation Source list, select one of your annotation sets.
  5. Click Run an Evaluation.

The evaluation starts and its current status is displayed in the NLU Evaluation Results table. Note that an evaluation may take several minutes. You can close the Evaluate Model panel and do other work on your skill. Check back later to see the results of the test.

Review the results and update your model

You can review the results of an evaluation on the NLU Evaluation panel, then drill down into the results for a specific evaluation. All past evaluations are saved for later review.

Review a summary of NLU evaluation results

Click the Evaluate Model button, then select the NLU Evaluation tab. The table at the bottom of the panel shows each in-progress and completed evaluation.

Each evaluation in the table displays the following information:

Evaluation ID
Unique ID for the evaluation. Once an evaluation is complete, this becomes a link you can click to see the full report.
Indicates whether the evaluation is Complete.
Displays the results of the evaluation. An evaluation is considered PASSED if all the tests within the annotation set returned the expected intent and slot values. An evaluation is considered FAILED if any of the tests within the annotation set failed to return the expected intent and slot values.
Annotation Src
Unique ID for the annotation set used in the evaluation. Click this link to open the annotation set page.
The skill stage that was tested (Development or Live).
Start Time
The time the evaluation was started.

Get the results for a specific evaluation

To see the results for a given evaluation, open the summary of results. Click the Evaluation ID link to open the results. Use the results page to see which utterances failed the test. For each utterance that failed, the table shows the expected value and the actual value, highlighted in red.

Click the Export JSON button to download the report in JSON format.

Update your skill

Use the evaluation results to identify failing utterances. Add these to your interaction model as sample utterances and slot values, rebuild, then re-run the evaluation with the same annotation set to see if the changes improved the accuracy.

Use the NLU evaluation tool for regression testing

The NLU evaluation tool is especially useful for regression testing. Once you have an annotation set that passes all the tests, you can re-run the evaluation whenever you make changes to your interaction model to ensure that your changes did not degrade your skill's accuracy.

If you do encounter issues, you can revert your skill to an earlier version of your interaction model. See Use a previous version of the interaction model.