Use Different Audio in Your Skill

Alexa isn’t the only voice you can use in your skill. You can add Amazon Polly voices to your characters with different ranges for male and female voices. You can also use your own voiceovers if you have a voice actor for your brand. Additionally, you can enhance the production value by adding in audio clips and speechcons to make your skill more fun and engaging for customers.

Use other voices

You can use a variety of male and female voices with Polly voices. Using a different voice is a good choice if you need a variety of voices in a story or if you need a male voice in your skill. When you use Polly voices, Alexa is always the main voice of the interface and you should not change this unless you have a good reason to do so. For more information about how to incorporate Polly voices, see Customer Experience Guidelines for the Use of Amazon Polly Voices.

Add other types of audio to your skill

One way you can enhance the value and engagement of voice experiences is to include audio files, such as short sound effects. You can use audio files to signify a specific interaction, as a handoff between Alexa and audio files, or to replace Alexa with full audio. The exact way that you include audio in your experience is up to you and the specific use case you design for your customer.

There are some important considerations to keep in mind when you plan to include audio files in your skill:

  • If you want to use audio to ask a question, you must include a separate and appropriate re-prompt. You must associate the re-prompt with the question so that you can play it for a customer when necessary.
  • When it’s the skill’s turn to speak to the customer and you expect the customer to reply, make each turn 90 seconds or less.

Use short-form audio

Audio clips that are less than 240 seconds are considered short-form audio. Short-form audio allows the skill session to remain open, which means that the customer doesn’t have to re-invoke the skill by saying “Alexa” again. Use short-form audio when you expect additional interaction with the customer after playing the audio clip.

  • File type: .mp3
  • Specification: 16000 Hz w/ bitrate (48 kbps)
  • Length: Up to 240 seconds

For more information about implementing short-form audio, see SSML Reference.

Use long-form audio

If you have an audio-based skill such as a podcast, you use long-form audio. Audio clips that are more than 240 seconds are considered long-form audio. When the audio starts playing, the skill closes. The customer can control the audio by making requests without the invocation name, for example by saying, “Alexa, next.” To interact with the skill again, the customer can invoke the skill by saying, “Alexa” and the invocation name.

Use long-form audio when you expect the user interaction to consist of audio-control requests. Your skill can also add new audio files to the queue for continuous playback, such as with a playlist.

  • File types: .acc .mp4 .mp3 .hls .pls .m3u
  • Specification: Bitrates from 16 kbps to 384 kbps
  • Length: Anything over 240 seconds with no maximum time limit

For more information about implementing long-form audio, see Audio Streaming in Alexa Skills and AudioPlayer Interface Reference.

Use speechcons

A speechcon is a special word or phrase that is distinctive enough to represent a specific event or convey other information to a customer. For example, a word signaling the successful completion of an action. Speechcons can help bring your skill to life while also connecting with the customer in a way that strengthens your conversation. For more information about using speechcons in your skill, see Speechcon Reference.

Use audio interstitials

An audio interstitial is a sound file that would be played to bridge together two areas of the flow. This includes areas like intros and outros, as well as times of transition. You can use the interstitials to cue the customer to a change in soundscape and help paint an audio picture of where they are heading next.