Get Started
What is the Alexa Skills Kit?
- About Voice Interaction Models
- Index of Skill Types
- Glossary
Create Your Developer Account
Contact Alexa Developer Support
Skill Types
Alexa for Apps
- Use the Command Line
- FAQ
- Alexa for Apps V1 to V2 Migration Guide
Automotive Skills
- Connected Vehicle Skills
- Alexa Skills for Driving
Custom Voice Model Skills
- Steps to Build a Custom Skill
- Get Custom Skill Sample Code
- Understand How Users Invoke Custom Skills
- Choose the Invocation Name for a Custom Skill
- Create a Custom Skill from a Quick Start Template
- Host a Custom Skill as an AWS Lambda Function
- Host a Custom Skill as a Web Service
- Configure a Skill for Multiple Languages
- Create the Interaction Model (Intents, Slots, and Dialogs)
- Use Built-in Intents and Slot Types
- Entity Resolution
- Handle Requests Sent by Alexa
- Understand Name-free Interaction for Custom Skills
- Add Audio to a Custom Skill
  - Stream Long-Form Audio with AudioPlayer
- Alexa Quick Links
  - Create a Quick Link for Your Custom Skill
  - Create a Quick Link for Your Custom Task
- Use Display Templates to Show Content on Screens
Flash Briefing Skills
- Steps to Build a Flash Briefing Skill
- Tips to Create a Great Flash Briefing Skill
- Flash Briefing Skill Certification Checklist
- Normalize the Loudness of Audio Content
- Flash Briefing Skill API Feed Reference
Game Skills
- Alexa Web API for Games
Music, Radio, and Podcast Skills
- Steps to Build a Music, Radio, or Podcast Skill
- Radio Skills Kit
  - Use No-Code Radio Skills
- Implement Podcast Skill Features
- Upload Music or Radio Catalogs
- Upload Podcast Catalogs
- Add Premium Audio, Badging, and License Retrieval to a Music Skill
- Internationalize a Podcast Skill
- Catalog Reference
- Understand Voice Modeling
- Testing Guide
- Event Subscriptions
- Troubleshooting
- Music, Radio, and Podcast Skill API
Smart Home Skills
- Steps to Build a Smart Home Skill
- Smart Home Skill Concepts
- Smart Home Skill Types
- Tutorial: Build a Smart Home Skill
- Troubleshooting Guide
- Smart Home Skill APIs
Video Skills
- Steps to Build a Video Skill
  - Implement Video Skill Code
- State Reporting for Video Skills
- Video Skill Testing Guide
- Video Skill APIs
- Video Skills for Fire TV Apps
- Video Skills for Echo Show
Skill Development Process
Design Your Skill
- Alexa Design Guide
Build Your Skill
- Use AI-Driven Dialog Management
- Add Visuals and Audio to Your Skill
- Add Account Linking
- Use Alexa Advertising ID
  - Steps to Add Advertising ID
- Earn Money with a Skill
- Add Alexa Shopping Kit
- Include Reminders in a Skill
- Personalize the User Experience
- Expose Skill Functionality with Tasks
- Expose Skill Functionality with Triggers
- Let Skills Work Together with Skill Connections
  - Use Skill Connections to Request Tasks
- Offer Pre-Built Routines from Your Skill
  - Pre-Built Routine API Reference
  - Pre-Built Routine Primitives
- Include Timers in a Skill
  - Set Up Voice Permissions for Timers
  - Best Practices for Timers
- Use Events in a Skill
- Add Dash Replenishment
- Display Suggestions on the Home Screen
- Add Rich Media to Your Skill Detail Page
- Deprecated Features
Test and Debug Your Skill
- Test and Debug Your Custom Skill
- Test and Debug Your Smart Home Skill
- Test with the Alexa Simulator
- Beta Test a Skill
Certify and Publish Your Skill
- Requirements
- Certification Testing
  - Certification Functional Tests
  - Certification Tests for VUI and UX
- Troubleshoot Certification Failures
- Works with Alexa Certification
Monitor Your Skill Metrics and Earnings
- Analyze Your Skill Metrics
- View Your Payments and Earnings
Tools to Create and Manage Skills
- Manage Your Developer Account
- Alexa Developer Console
- Alexa-hosted Skills
- ASK Toolkit for VS Code
- ASK SDKs
- ASK CLI
- Skill Management API
- AWS Tools
Skill Developer Reference
Alexa Interface Reference
- List of Alexa Interfaces
- Message and Property Reference
- Foundational APIs
- Smart Home Skill APIs
- Video Skill APIs
Custom Skill Interface Reference
- Request and Response JSON Reference
- Request Types Reference
- Interfaces
REST API Reference
- Access Token Retrieval
- Account Linking Management
- Alexa-hosted Skill Management
- Audit Logs
- Beta Test Management
- Beta Tester Management
- Catalog Management
- Customer Profile
- Device Settings
- In-Skill Product Management
- Intent Request History
- Interaction Model Catalog Management
- Interaction Model Management
- Linked Data
- Locale Cloning
- Metrics
- Monetization
- NLU Annotation Set
- NLU Evaluation
- Person Profile
- Proactive Events
- Proactive Suggestion
- Progressive Response
- Reminders
- Resource Schema
- Skill Certification
- Skill Credentials
- Skill Development Notifications
- Skill Enablement
- Skill Invocation
- Skill Manifest
- Skill Messaging
- Skill Package Management
- Skill Publishing
- Skill Rollback
- Skill Simulation
- Skill Validation
- Smart Home Skill Evaluation
- SSL Certificates
- Timers
- Utterance Profiler
- Vendor Management
Skill Schema Reference
- Skill Manifest Schema
- Skill Manifest Examples
- Account Linking Schemas
- Interaction Model Schema
- In-Skill Product Schema
- Paid Skill Schema
- Proactive Events Schemas
- Skill Development Event Schemas
SSML Reference
- Best Practices for Using Amazon Polly Voices
- Alexa Skills Kit Sound Library
- Speechcons (Interjections)
- Test SSML Examples with the Audio Sandbox

Batch Test Your Natural Language Understanding Model

Note: Sign in to the developer console to build or publish your skill.

Use the Natural Language Understanding (NLU) Evaluation tool in the developer console to batch test the natural language understanding (NLU) model for your Alexa skill.

To evaluate your model, you define a set of utterances mapped to the intents and slots you expect to be sent to your skill. This set of utterances is called an annotation set. Then you start an NLU Evaluation with the annotation set to determine how well your skill's model performs against your expectations. The tool can help you measure the accuracy of your NLU model and make sure that changes to your model don't degrade the accuracy.

You can use the NLU Evaluation tool with skill models for all locales. You can also access these tools with the Skill Management API. To create an annotation set, see NLU Annotation Set REST API Reference. To run an NLU evaluation, see NLU Evaluation REST API Reference.

Prerequisites

After you define and build an interaction model, you can use the NLU Evaluation tool.

The tool doesn't call your endpoint, so you don't need to develop the service for your skill to test your model.

Annotations and annotation sets

To evaluate your model with the NLU Evaluation tool, you create an annotation set. This is a set of utterances mapped to the intents and slots you expect to be sent to your skill for each one. Each utterance with its expected intent and slots is called an annotation.

Note: The maximum number of annotations per annotation set is 10,000.

Each annotation has the following fields:

Utterance

The utterance to test.

Don't include the wake word or invocation name. Provide just the utterance as it is used after the user invokes the skill.
You can use either written form or spoken form for the utterance. For example, you can use numerals ("5") or write out numbers ("five"). For more examples, see the rules for custom slot type values.

Expected Intent

The intent that the utterance should trigger.

Expected Slot Names

(Optional) The name of the slot that the utterance should fill. You can provide more than one expected slot for an utterance.

Expected Slot Values

Required for each Expected Slot Name. Provides the value you expect the utterance to fill for the specified slot. Click in the Add slot value field to enter the value.

For a multiple-value slot, provide each of the values you expect. Click in the Add slot value field and enter each value separately.

Enter the values in the same order they are in the utterance. For example, for the utterance "I want a pizza with pepperoni, mushrooms, and olives," make sure that you add the values "pepperoni", "mushrooms", and "olives" in that same order. For details about multiple-value slots, see Collect Multiple Values in a Slot.

Reference Timestamp (UTC)

(Optional) A time and date in UTC format to use as the basis for relative date and time values. Use this when the utterance tests the AMAZON.DATE or AMAZON.TIME slots with words that represent relative dates and times such as "today," "tomorrow," and "now." Provide the full date and time, including milliseconds, for example: 2018-10-25T23:50:02.135Z For more details, see Create an annotation with relative dates or times.

Tip: If your skill is already live, use the utterances shown in the intent history as a source for your test annotations. This lets you test the accuracy of the real-world utterances your users are speaking. Use the utterances where the Interaction type is MODAL and the Dialog Act column is blank.

Important: The Automated Test Sets tool is deprecated and no longer available on the developer portal effective February 1, 2023. See Deprecated Features.

Create annotation sets manually

As an alternative, you can manually create and edit annotation sets. Do the following to manually create and edit annotation sets directly in the developer console.

To create an annotation set manually in the developer console

Open your skill in the developer console.
Navigate to Build > Custom > Annotation Sets.
Under User Defined Test Sets, click + Annotation Set.
At the top of the page, enter a name for the annotation set.
Create the annotations.

Edit an annotation set

To edit an annotation set

Open your skill in the developer console.
Navigate to Build > Custom > Annotation Sets.
Find the annotation set to edit, and then click its name or the Edit link.

Create annotations in the developer console

To create annotations in the developer console

For details about the fields for an annotation, see Annotations and annotation sets.

Create or edit an annotation set.
Enter the utterance to test, and then click the plus sign (+) or press Enter.
In the table of utterances, click in the Expected Intent field, and then select the intent the utterance should trigger.
If the utterance should also fill a slot, under Expected Slots, click the plus sign (+), and then enter the slot name and slot value.
If needed, click in the Reference Timestamp field and select the date and time from the date picker.

This action fills in the selected timestamp in UTC format. See Create an annotation with relative dates or times.
After you have added all the new annotations, click Save Annotation Set.

Upload annotations from a data file

When you upload annotations from a data file, the upload replaces any existing annotations in the annotation set.

To upload annotations from a data file

Create either a JSON or a CSV file with your annotations. Use the same format used for the NLU Annotation Set REST API:
- JSON
- CSV (encoded in UTF-8)
Create or edit an annotation set.
Click Bulk Edit, select the JSON or CSV file to upload, and then click Submit.
Click Save Annotation Set.

Create an annotation with relative dates or times

The built-in AMAZON.DATE and AMAZON.TIME slot types let users specify dates and times relative to the current date. For example, the utterance "today" normally resolves to the current date. The slot value therefore depends on the day you test the utterance.

To test these types of utterances with the NLU Evaluation tool, enter a specific date and time in the Reference Timestamp (UTC) field. Alexa then uses this value instead of the actual current date and time when calculating the date and time slot values.

For example, in the following table note the following annotations.

Utterance	Expected intent	Expected slot names	Expected slot values
test the date slot with tomorrow	TestDateSlotIntent	DateSlotExample	2019-08-22
test the date slot with next Monday	TestDateSlotIntent	DateSlotExample	2019-08-26

Without a Reference Timestamp, these utterances only pass if you run the evaluation on August 21, 2019. Set the Reference Timestamp for each of these to 2019-08-21T00:00:00.000Z. Then, regardless of the actual date and time, the NLU Evaluation tool resolves the slots as though it is midnight on August 21, 2019, so the specified Expected Slot Values match the actual results.

Select the date and time from the calendar picker. This adds the date/time in UTC format: YYYY-MM-DDThh:mm:ss.sTZD, for example: 1997-07-16T19:20:30.45Z.

Start an evaluation

After you have at least one annotation set defined for your skill, you can start an evaluation. This evaluates the NLU model built from your skill's interaction model, using the specified annotation set.

Note: The evaluation runs against the currently built-model, even if another build occurs after the evaluation has started. For example, if you start the evaluation after quick build is complete, the tests are all run against the quick-build version of the model rather than the full build version. In general, it is best wait for the full build to complete before you begin an evaluation.

For live skills, you can choose whether to run the evaluation against the development version or the live version.

You can run multiple evaluations at the same time.

To start an evaluation

From any page in the Build > Custom > Interaction Model section, in the upper-right corner, click Evaluate Model.

Evaluate Model is also available on the Annotation Sets page.
Click the NLU Evaluation tab.
From the Stage list, select Development or Live (if applicable).
From the Annotation Source list, select one of your annotation sets.
Click Run an Evaluation.

The evaluation starts and its current status is displayed in the NLU Evaluation Results table.

Note: An evaluation might take several minutes. You can close the Evaluate Model panel and do other work on your skill. Check back later to see the results of the test.

Review the results and update your model

You can review the results of an evaluation on the NLU Evaluation panel, and then closely examine the results for a specific evaluation. The developer console saves all past evaluations for later review.

To review a summary of NLU Evaluation results

Click Evaluate Model, and then select the NLU Evaluation tab.
At the bottom of the panel, review the in-progress and completed evaluations.

Each evaluation in the table displays the following information:

Evaluation ID: Unique ID for the evaluation. After an evaluation is complete, this ID becomes a link that you click to see the full report.
Status: Indicates whether the evaluation is Complete.
Results: Displays the results of the evaluation. Your evaluation has PASSED if all the tests within the annotation set returned the expected intent and slot values. Your evaluation has FAILED if any of the tests within the annotation set failed to return the expected intent and slot values.
Annotation Src: Unique ID for the annotation set used in the evaluation. Click this link to open the annotation set page.
Stage: The skill stage that was tested (Development or Live).
Start Time: The time you started the evaluation.

To get the results for a specific evaluation

To see the results for a given evaluation, open the summary of results.
To open the results, click the Evaluation ID link
View the results page to see which utterances failed the test.

For each utterance that failed, the table shows the expected value and the actual value, highlighted in red.
To download the report in JSON format, click Export JSON.

Update your skill

Use the evaluation results to identify failing utterances. Add these to your interaction model as sample utterances and slot values, and then rebuild. Re-run the evaluation with the same annotation set to see if the changes improved the accuracy.

Use the NLU Evaluation tool for regression testing

The NLU Evaluation tool is especially useful for regression testing. After you have an annotation set that passes all the tests, you can re-run the evaluation whenever you make changes to your interaction model to make sure that your changes don't degrade your skill's accuracy.

If you do encounter issues, you can revert your skill to an earlier version of your interaction model. For details, see Use a previous version of the interaction model.

Was this page helpful?

Provide feedback

Last updated: Nov 28, 2023