Voice Expression Authoring Notes

From Hanson Robotics Wiki
Jump to: navigation, search

For hand authoring speech and algorithmic decoration of dialog/chatbot system output, authors and developers may choose to use supported tags to mark up the text which is sent to a tts (text to speech) subsystem. Note that a simple markup system has been implemented allowing key phrases and key words to be marked, with automatic translation in to tags controlling speech rate, inflection and emphasis. Texts with this markup can be used with the esay() command in the performance scripting environment, and the text can also include specific SSML tags summarized below, or more fully in the w3.org link. The simple markup expand supports the following decorations.

()  key phrases are rendered more slowly
 *  key words are given emphasis and rendered more slowly
 #  the pitch is increased for the marked words
 Example: Now (*witches and *wizards), as #you perhaps know, are *people who are *born for the (first time).

A standard acting text recommends that a scored monolog should target 40% key phrases, 12% key words. The interpreter has built in feedback to indicate the ratios.

There is a speech synthesis standard set of tags SSML defined here: http://www.w3.org/TR/speech-synthesis/

One thing to know is that in contrast to most programming or markup contexts, there really are no guarantees that your tags will be rendered. From the standard documents:

All prosodic attribute values are indicative. If a synthesis processor is unable to accurately render a document as specified (e.g., trying to set the pitch to 1Mhz or the speaking rate to 1,000,000 words per minute), it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.

To use any specific tags, you wrap the response or your scripted text with speak tags. This may not be necessary when passing directly to our ros tts or the cereproc api, but if any other XML parser is upstream you need them.

<speak>  Here is what you want to say </speak>

If you're modifying AIML you have to use escape syntax, or use a preprocessor which substitutes the escaped characters before starting the chatbot parser. Otherwise the content is not parsed and will not be passed through.

I have excerpted a few useful tags and examples most relevant, including some cereproc proprietary tags. Probably the most useful are the emotion, break, emphasis and prosody tags. A variant tag allows generation of alternate renderings for words or phrases, but I did not hear any effect when it was used inside another emotion tag.

Voice (Emotion) cerevoice custom tag The high level cerevoice tags attempt to do the work for you for a limited range of emotions. Valid emotions: "happy", "calm", "

cross" It's possible to use this tag for the whole sentence, then use more detailed tags for breaks, pitch change on particular words.

<speak><voice emotion="cross">You are very rude!</voice></speak>
<speak><voice emotion="cross">You <break="0.5s"/> are very rude!</voice></speak>

Emphasis tag This is an obvious assist, but it doesn't seem to make a big audible difference. It may be the case that the default volume is maximum loud anyway, so you only hear it if you use an outer prosody tag (described below) to set the overall volume less.

That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis> bank account!

Break The break tag inserts a pause with an exact time specification or a strength="" specification.

Valid strengths: "none", "x-weak", "weak", "medium" (default value), "strong", or "x-strong".

 Take a deep breath <break/>
 then continue. 
 Press 1 or wait for the tone. <break time="3s"/>
 I didn't hear you! <break strength="weak"/> Please repeat.

Prosody and attributes Prosody is the most general control over any parameter for the contained block of text. While all the attributes are optional, you must use at least one (or why bother?). The attributes are pitch, contour (pitch curve at time breakpoints), range (pitch variation), rate From the SSML doc: Prosody Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:

pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.

contour: sets the actual pitch contour for the contained text.

The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text. Notice that you can use percentage or absolute frequency; for the pitch one can also use musical offsets in semitones (a half step; i.e pitch="+1st"

 <prosody contour="(0%,+20Hz) (10%,+30Hz) (40%,+10Hz)">  good morning  </prosody>

range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.

rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.

time: a value in seconds or milliseconds for the desired time to take to read the element contents. Takes precedence over rate. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s".

Variant Tags The variant tag allows the user to request a different version of the synthesis for a particular section of speech. This is a very useful tag that can be used to make sections of speech sound more appropriate. The variant number can be increased to produce more and more different versions of the speech. The original version is equivalent to variant 0. For example, to change the version of the word test in This is a test sentence, use:

This is a <usel variant='1'>test</usel> sentence.

The variant tag can be used to produce a bespoke rendering of a particular piece of speech. For example, an often-used speech prompt could be tuned to give a different rendering if desired.

Vocal Gestures UPDATE: D. DeMaris attempted to run through the list and only a few worked. (1,2,3,7,8,9, marked with OK below) Cereproc provides a set of gestures listed below.

Non-speech sounds, such as laughter and coughing, can be inserted into the output speech. The <spurt> tag is used with an audio attribute to select a vocal gesture to included in the synthesis output, for example:

 <spurt audio="g0001_004">cough</spurt>, excuse me, <spurt audio="g0001_018">err</spurt>, hello.

The <spurt> tag cannot be empty, however the text content of the tag is not read, it is replaced by the gesture. GESTURE list Gesture ID (using form g0001_00x) as in the example above

1 tut    OK
2 tut tut  OK
3 cough OK
4 cough
5 cough
6 clear throat
7 breath in   OK 
8 sharp intake of breath  OK
9 breath in through teeth  OK 
10 sigh happy
11 sigh sad
12 hmm question
13 hmm yes
14 hmm thinking
15 umm
16 umm
17 err
18 err
19 giggle
20 giggle
21 laugh
22 laugh
23 laugh
24 laugh
25 ah positive
26 ah negative
27 yeah question
28 yeah positive
29 yeah resigned
30 sniff
31 sniff
32 argh
33 argh
34 ugh