Home > Alexa > Alexa Skills Kit

Speech Synthesis Markup Language (SSML) Reference

Introduction

When the service for your skill returns a response to a user’s request, you provide text that the Alexa service converts to speech. Alexa automatically handles normal punctuation, such as pausing after a period, or speaking a sentence ending in a question mark as a question.

However, in some cases you may want additional control over how Alexa generates the speech from the text in your response. For example, you may want a longer pause within the speech, or you may want a string of digits read back as a standard telephone number. The Alexa Skills Kit provides this type of control with Speech Synthesis Markup Language (SSML) support.

SSML is a markup language that provides a standard way to mark up text for the generation of synthetic speech. The Alexa Skills Kit supports a subset of the tags defined in the SSML specification. The specific tags supported are listed in Supported SSML Tags.

Using SSML in Your Response

To use SSML, construct your output speech using the supported SSML tags. When sending back a response from your service, you must indicate that it is using SSML rather than plain text:

  • When using the Java library, use the SsmlOutputSpeech class. Call the setSsml() method and pass in the output speech marked up with the tags.
  • When not using the Java library, provide the marked-up text in the outputSpeech property, but set the type to SSML instead of PlainText. Use the ssml property instead of text for the marked-up text:
"outputSpeech": {
    "type": "SSML",
    "ssml": "<speak>This output speech uses SSML.</speak>"
}
  • You can use SSML with both the normal output speech response and any re-prompt included in the response.

The SSML you provide must be wrapped within <speak> tags. For example:

<speak>
    Here is a number <w role="ivona:VBD">read</w> 
    as a cardinal number: 
    <say-as interpret-as="cardinal">12345</say-as>. 
    Here is a word spelled out: 
    <say-as interpret-as="spell-out">hello</say-as>. 
</speak>

Supported SSML Tags

The Alexa Skills Kit supports the following SSML tags (listed in alphabetic order):

The remaining sections describe each of these tags.

Note that the Alexa service strips out any unsupported SSML tags included in the text you provide.

audio

The audio tag lets you provide the URL for an MP3 file that the Alexa service can play while rendering a response. You can use this to embed short, pre-recorded audio within your service’s response. For example, you could include sound effects alongside your text-to-speech responses, or provide responses using a voice associated with your brand. For more information, see the “Including Pre-Recorded Audio in your Response” section of Handling Requests Sent by Alexa.

Attribute Possible Values

src

Specifies the URL for the MP3 file. Note the following requirements and limitations:

  • The MP3 must be hosted at an Internet-accessible HTTPS endpoint. HTTPS is required, and the domain hosting the MP3 file must present a valid, trusted SSL certificate. Self-signed certificates cannot be used.
  • The MP3 must not contain any customer-specific or other sensitive information.
  • The MP3 must be a valid MP3 file (MPEG version 2).
  • The audio file cannot be longer than ninety (90) seconds.
  • The bit rate must be 48 kbps. Note that this bit rate gives a good result when used with spoken content, but is generally not a high enough quality for music.
  • The sample rate must be 16000 Hz.

You may need to use converter software to convert your MP3 files to the required codec version (MPEG version 2) and bit rate (48 kbps).

Include the audio tag within your text-to-speech response within the speak tag. Alexa plays the MP3 at the specified point within the text to speech. For example:

<speak>
    Welcome to Car-Fu. 
    <audio src="https://carfu.com/audio/carfu-welcome.mp3" /> 
    You can order a ride, or request a fare estimate. 
    Which will it be?
</speak> 

When Alexa renders this response, it would sound like this:

Alexa: Welcome to Car-Fu.
(the specified carfu-welcome.mp3 audio file plays)
Alexa: You can order a ride, or request a fare estimate. Which will it be?

A single response sent by your service can include multiple audio tags according to the following limits:

  • No more than five audio files can be used in a single response.
  • The combined total time for all audio files in a single response cannot be more than ninety (90) seconds.

Converting Audio Files to an Alexa-Friendly Format

You may need to use converter software to convert your MP3 files to the required codec version (MPEG version 2) and bit rate (48 kbps). One option for this is a command-line tool, FFmpeg. The following command converts the provided <input-file> to an MP3 file that works with the audio tag.

ffmpeg -i <input-file> -ac 2 -codec:a libmp3lame -b:a 48k -ar 16000 <output-file.mp3>

Another option is Audacity:

  1. Open the file to convert.
  2. Set the Project Rate in the lower-left corner to 16000.
  3. Click File > Export Audio and change the Save as type to MP3 Files.
  4. Click Options, set the Quality to 48 kbps and the Bit Rate Mode to Constant.

This requires the Lame library, which can be found at: http://lame.buanzo.org/#lamewindl.

Hosting the Audio Files for Your Skill

The MP3 files you use to provide audio must be hosted on an endpoint that uses HTTPS. The endpoint must provide an SSL certificate signed by an Amazon-approved certificate authority. Many content hosting services provide this. For example, you could host your files at a service such as Amazon Simple Storage Service (Amazon S3) (an Amazon Web Services offering).

We don’t require that you authenticate the requests for the audio files. Therefore, you must not include any customer-specific or sensitive information in these audio files. For example, building a custom MP3 file in response to a user’s request, and including sensitive information within the audio, is not allowed.

break

Represents a pause in the speech. Set the length of the pause with the strength or time attributes.

Attribute Possible Values

strength

  • none: No pause should be outputted. This can be used to remove a pause that would normally occur (such as after a period).
  • x-weak: No pause should be outputted (same as none).
  • weak: Treat adjacent words as if separated by a single comma (equivalent to medium).
  • medium: Treat adjacent words as if separated by a single comma.
  • strong: Make a sentence break (equivalent to using the <s> tag).
  • x-strong: Make a paragraph break (equivalent to using the <p> tag).

time

Duration of the pause; up to 10 seconds (10s) or 10000 milliseconds (10000ms). Include the unit with the time (s or ms).

The default is medium. This is used if you don’t specify any attributes, or if you provide any unsupported attribute values.

<speak>
    There is a three second pause here <break time="3s"/> 
    then the speech continues.
</speak> 

p

Represents a paragraph. This tag provides extra-strong breaks before and after the tag. This is equivalent to specifying a pause with <break strength="x-strong"/>.

<speak>                                         
    <p>This is the first paragraph. There should be a pause after this text is spoken.</p>       
    <p>This is the second paragraph.</p> 
</speak>                                        

phoneme

Provides a phonemic/phonetic pronunciation for the contained text. For example, people may pronounce words like “pecan” differently.

Attribute Possible Values

alphabet

Set to the phonetic alphabet to use:

  • ipa — The International Phonetic Alphabet (IPA).
  • x-sampa — The Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA).

ph

The phonetic pronunciation to speak.

See below for a list of supported symbols.

When using this tag, Alexa uses the pronunciation provided in the ph attribute rather than the text contained within the tag. However, you should still provide human-readable text within the tags. In the following example, the word “pecan” shown within the tags is never spoken. Instead, Alexa speaks the text provided in the ph attribute:

<speak>
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. 
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak> 

Supported Symbols

The following tables list the supported symbols for use with the phoneme tag. These symbols provide full coverage for the sounds of US English. Note that many non-English languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list is discouraged, as it may result in suboptimal speech synthesis.

Consonants

IPA X-SAMPA Description Examples
b b voiced bilabial plosive bed
d d voiced alveolar plosive dig
d͡ʒ dZ voiced postalveolar affricate jump
ð D voiced dental fricative then
f f voiceless labiodental fricative five
g g voiced velar plosive game
h h voiceless glottal fricative house
j j palatal approximant yes
k k voiceless velar plosive cat
l l alveolar lateral approximant lay
m m bilabial nasal mouse
n n alveolar nasal nap
ŋ N velar nasal thing
p p voiceless bilabial plosive speak
ɹ r\ alveolar approximant red
s s voiceless alveolar fricative seem
ʃ S voiceless postalveolar fricative ship
t t voiceless alveolar plosive trap
t͡ʃ tS voiceless postalveolar affricate chart
θ T voiceless dental fricative thin
v v voiced labiodental fricative vest
w w labial-velar approximant west
z z voiced alveolar fricative zero
ʒ Z voiced postalveolar fricative vision

Vowels

IPA X-SAMPA Description Examples
ə @ mid central vowel arena
ɚ @` mid central r-colored vowel reader
æ { near-open front unrounded vowel trap
aI diphthong price
aU diphthong mouth
ɑ A long open back unrounded vowel father
eI diphthong face
ɝ 3` open-mid central unrounded r-colored vowel nurse
ɛ E open-mid front unrounded vowel dress
i i long close front unrounded vowel fleece
ɪ I near-close near-front unrounded vowel kit
oU diphthong goat
ɔ O long open-mid back rounded vowel thought
ɔɪ OI diphthong choice
u u long close back rounded vowel goose
ʊ U near-close near-back rounded vowel foot
ʌ V open-mid back unrounded vowel strut

Additional symbols

IPA X-SAMPA Description Examples
ˈ " primary stress Alabama
ˌ % secondary stress Alabama
. . syllable boundary A.la.ba.ma

s

Represents a sentence. This tag provides strong breaks before and after the tag.

This is equivalent to:

  • Ending a sentence with a period (.).
  • Specifying a pause with <break strength="strong"/>.
<speak>
    <s>This is a sentence</s>
    <s>There should be a short pause before this second sentence</s> 
    This sentence ends with a period and should have the same pause.
</speak>

say-as

Describes how the text should be interpreted. This lets you provide additional context to the text and eliminate any ambiguity on how Alexa should render the text. Indicate how Alexa should interpret the text with the interpret-as attribute.

Attribute Possible Values

interpret-as

  • characters, spell-out: Spell out each letter.
  • cardinal, number: Interpret the value as a cardinal number.
  • ordinal: Interpret the value as an ordinal number.
  • digits: Spell each digit separately .
  • fraction: Interpret the value as a fraction. This works for both common fractions (such as 3/20) and mixed fractions (such as 1+1/2).
  • unit: Interpret a value as a measurement. The value should be either a number or fraction followed by a unit (with no space in between) or just a unit.
  • date: Interpret the value as a date. Specify the format with the format attribute.
  • time: Interpret a value such as 1'21" as duration in minutes and seconds.
  • telephone: Interpret a value as a 7-digit or 10-digit telephone number. This can also handle extensions (for example, 2025551212x345).
  • address: Interpret a value as part of street address.
  • interjection: (English (US) only) Interpret the value as an interjection. Alexa speaks the text in a more expressive voice. For optimal results, only use the supported interjections and surround each one with a pause. For example: <say-as interpret-as="interjection">Wow.</say-as>.

format

Only used when interpret-as is set to date. Set to one of the following to indicate format of the date:

  • mdy
  • dmy
  • ymd
  • md
  • dm
  • ym
  • my
  • d
  • m
  • y

Alternatively, if you provide the date in YYYYMMDD format, the format attribute is ignored. You can include question marks (?) for portions of the date to leave out. For instance, Alexa would speak <say-as interpret-as="date">????0922</say-as> as “September 22nd”.

Note that the Alexa service attempts to interpret the provided text correctly based on the text’s formatting even without this tag. For example, if your output speech includes “202-555-1212”, Alexa speaks each individual digit, with a brief pause for each dash. You don’t need to use <say-as interpret-as="telephone"> in this case. However, if you provided the text “2025551212”, but you wanted Alexa to speak it as a phone number, you would need to use <say-as interpret-as="telephone">.

<speak>
    Here is a number spoken as a cardinal number: 
    <say-as interpret-as="cardinal">12345</say-as>.
    Here is the same number with each digit spoken separately:
    <say-as interpret-as="digits">12345</say-as>.
    Here is a word spelled out: <say-as interpret-as="spell-out">hello</say-as>
</speak>

speak

This is the root element of an SSML document. When using SSML with the Alexa Skills Kit, surround the text to be spoken with this tag.

<speak>
    This is what Alexa sounds like without any SSML.
</speak>

w

Similar to <say-as>, this tag customizes the pronunciation of words by specifying the word’s part of speech.

Attribute Possible Values

role

Set to one of the following

  • ivona:VB: Interpret the word as a verb (present simple).
  • ivona:VBD: Interpret the word as a past participle.
  • ivona:NN: Interpret the word as a noun.
  • ivona:SENSE_1: Use the non-default sense of the word. For example, the noun “bass” is pronounced differently depending on meaning. The “default” meaning is the lowest part of the musical range. The alternate sense (which is still a noun) is a freshwater fish. Specifying <speak><w role="ivona:SENSE_1">bass</w>"</speak> renders the non-default pronunciation (freshwater fish).
<speak>
    The word <say-as interpret-as="characters">read</say-as> may be interpreted 
    as either the present simple form <w role="ivona:VB">read</w>, 
    or the past participle form <w role="ivona:VBD">read</w>.
</speak>