SSML Builder Documentation - v1.0.1
    Preparing search index...

    Class VoiceBuilder

    Builder class for creating voice-specific content within an SSML document. Provides a fluent API for adding text, pauses, emphasis, prosody, and other speech synthesis features.

    This class encapsulates all the content that will be spoken by a specific voice, including text, audio, and various speech modification elements.

    // Basic usage
    const builder = new SSMLBuilder({ lang: 'en-US' });
    builder
    .voice('en-US-AvaNeural')
    .text('Hello, world!')
    .break('500ms')
    .emphasis('Important!', 'strong')
    .build();
    // Advanced usage with emotions and prosody
    builder
    .voice('en-US-AvaNeural')
    .expressAs('I am so happy!', { style: 'cheerful' })
    .prosody('Speaking slowly', { rate: 'slow', pitch: 'low' })
    .paragraph(p => p
    .text('This is a paragraph.')
    .sentence(s => s.text('With a sentence.'))
    )
    .build();

    Hierarchy (View Summary)

    Index

    Constructors

    • Creates a new VoiceBuilder instance.

      Parameters

      • options: VoiceOptions

        Voice configuration options

        Configuration options for voice elements.

        Defines the voice to use for speech synthesis and optional effects that can be applied to modify the voice output.

        • Optionaleffect?: string

          Optional audio effect to apply to the voice.

          Modifies the voice output to simulate different audio environments or transmission methods.

          Available effects:

          • eq_car: Optimized for car speakers
          • eq_telecomhp8k: Telephone quality (8kHz sampling)
          • eq_telecomhp3k: Lower quality telephone (3kHz sampling)
          "eq_telecomhp8k" - Phone call sound
          
          "eq_car" - Car speaker optimization
          
        • name: string

          Voice identifier for text-to-speech synthesis. (Required)

          Must be a valid voice name supported by the speech service. Format typically: language-REGION-NameNeural

          Common voices:

          • English: en-US-AvaNeural, en-US-AndrewNeural, en-US-EmmaNeural
          • Spanish: es-ES-ElviraNeural, es-ES-AlvaroNeural
          • French: fr-FR-DeniseNeural, fr-FR-HenriNeural
          • Multilingual: en-US-AvaMultilingualNeural, en-US-AndrewMultilingualNeural
          "en-US-AvaNeural"
          
          "en-US-AndrewMultilingualNeural"
          
          "es-ES-ElviraNeural"
          
      • Optionalparent: SSMLBuilder

        Optional reference to parent SSMLBuilder for chaining

      Returns VoiceBuilder

      const voiceBuilder = new VoiceBuilder({ 
      name: 'en-US-AvaNeural',
      effect: 'eq_car'
      });

    Methods

    • Inserts an audio file into the speech output. Supports fallback text if audio is unavailable.

      Parameters

      • src: string

        URL of the audio file (must be publicly accessible HTTPS URL)

      • OptionalfallbackText: string

        Optional text to speak if audio fails to load

      Returns this

      This VoiceBuilder instance for method chaining

      // With fallback text
      voice.audio(
      'https://example.com/sound.mp3',
      'Sound effect here'
      );

      // Without fallback
      voice.audio('https://example.com/music.mp3');
    • Controls the total audio duration (Azure Speech Service specific). Can speed up or slow down speech to fit a specific duration.

      Parameters

      • value: string

        Target duration (e.g., '10s', '5000ms')

      Returns this

      This VoiceBuilder instance for method chaining

      voice
      .audioDuration('10s')
      .text('This text will be adjusted to last exactly 10 seconds.');
    • Adds a pause or break in speech. Can specify either duration or strength of the pause.

      Parameters

      • Optionaloptions: string | BreakOptions

        Break configuration or duration string

        • string
        • BreakOptions

          Configuration options for break/pause elements.

          Defines pauses in speech either by strength (semantic) or explicit duration. If both are specified, time takes precedence.

          • Optionalstrength?: BreakStrength

            Semantic strength of the pause.

            Each strength corresponds to a typical pause duration:

            • x-weak: 250ms (very short)
            • weak: 500ms (short, like a comma)
            • medium: 750ms (default, like a period)
            • strong: 1000ms (long, like paragraph break)
            • x-strong: 1250ms (very long, for emphasis)

            Ignored if time is specified.

            "medium"
            
            "strong"
            
          • Optionaltime?: string

            Explicit duration of the pause.

            Specified in milliseconds (ms) or seconds (s). Valid range: 0-20000ms (20 seconds max) Values above 20000ms are capped at 20000ms.

            Takes precedence over strength if both are specified.

            "500ms" - Half second
            
            "2s" - 2 seconds
            
            "1500ms" - 1.5 seconds
            

      Returns this

      This VoiceBuilder instance for method chaining

      // Using duration string
      voice.break('500ms');
      voice.break('2s');

      // Using strength
      voice.break({ strength: 'medium' });

      // Using explicit time (overrides strength)
      voice.break({ time: '750ms' });
    • Builds the complete SSML document and returns it as a string. Delegates to the parent SSMLBuilder's build method.

      Returns string

      The complete SSML document as an XML string

      If VoiceBuilder was not created from an SSMLBuilder

      const ssml = new SSMLBuilder({ lang: 'en-US' })
      .voice('en-US-AvaNeural')
      .text('Hello!')
      .build();
      // Returns: <speak version="1.0" ...><voice name="en-US-AvaNeural">Hello!</voice></speak>
    • Protected

      Escapes special XML characters in text content to ensure valid XML output.

      This method replaces XML special characters with their corresponding entity references to prevent XML parsing errors and potential security issues (XML injection). It should be used whenever inserting user-provided or dynamic text content into XML elements.

      The following characters are escaped:

      • & becomes &amp; (must be escaped first to avoid double-escaping)
      • < becomes &lt; (prevents opening of unintended tags)
      • > becomes &gt; (prevents closing of unintended tags)
      • " becomes &quot; (prevents breaking out of attribute values)
      • ' becomes &apos; (prevents breaking out of attribute values)

      This method is marked as protected so it's only accessible to classes that extend SSMLElement, ensuring proper encapsulation while allowing all element implementations to use this essential functionality.

      Parameters

      • text: string

        The text content to escape

      Returns string

      The text with all special XML characters properly escaped

      // In a render method implementation
      class TextElement extends SSMLElement {
      private text: string = 'Hello & "world" <script>';

      render(): string {
      // Escapes to: Hello &amp; &quot;world&quot; &lt;script&gt;
      return `<text>${this.escapeXml(this.text)}</text>`;
      }
      }

      // Edge cases handled correctly
      this.escapeXml('5 < 10 & 10 > 5');
      // Returns: '5 &lt; 10 &amp; 10 &gt; 5'

      this.escapeXml('She said "Hello"');
      // Returns: 'She said &quot;Hello&quot;'

      this.escapeXml("It's a test");
      // Returns: 'It&apos;s a test'

      // Prevents XML injection
      this.escapeXml('</voice><voice name="evil">');
      // Returns: '&lt;/voice&gt;&lt;voice name=&quot;evil&quot;&gt;'
    • Expresses emotion or speaking style (Azure Speech Service specific). Only available for certain neural voices.

      Parameters

      • text: string

        Text to express with style

      • options: ExpressAsOptions

        Expression configuration

        Configuration options for express-as elements (Azure-specific).

        Controls emotional expression and speaking styles for neural voices that support these features. Allows for nuanced emotional delivery and role-playing scenarios.

        • Optionalrole?: ExpressAsRole

          Age and gender role for voice modification.

          Simulates different speaker characteristics. Only supported by certain voices.

          "Girl"
          
          "OlderAdultMale"
          
          "YoungAdultFemale"
          

          ExpressAsRole type for full list

        • style: ExpressAsStyle

          Emotional or speaking style to apply. (Required)

          The available styles depend on the voice being used. Common categories include emotions (cheerful, sad, angry), professional styles (newscast, customerservice), and special effects (whispering, shouting).

          "cheerful"
          
          "newscast-formal"
          
          "whispering"
          

          ExpressAsStyle type for full list

        • Optionalstyledegree?: string

          Intensity of the style expression.

          Controls how strongly the style is applied. Range: "0.01" (minimal) to "2" (double intensity)

          "1"
          
          "0.5" - Half intensity
          
          "1.5" - 50% more intense
          
          "2" - Maximum intensity
          

      Returns this

      This VoiceBuilder instance for method chaining

      // Express with emotion
      voice.expressAs('I am so happy to see you!', {
      style: 'cheerful',
      styledegree: '2'
      });

      // Express with role
      voice.expressAs('Once upon a time...', {
      style: 'narration-professional',
      role: 'OlderAdultMale'
      });
    • Embeds MathML content for mathematical expressions. The math content will be spoken as mathematical notation.

      Parameters

      • mathML: string

        MathML markup string

      Returns this

      This VoiceBuilder instance for method chaining

      voice.math(`
      <math xmlns="http://www.w3.org/1998/Math/MathML">
      <mrow>
      <mi>a</mi>
      <mo>+</mo>
      <mi>b</mi>
      </mrow>
      </math>
      `);
    • Specifies exact phonetic pronunciation using phonetic alphabets. Provides precise control over pronunciation.

      Parameters

      • text: string

        Text to pronounce

      • options: PhonemeOptions

        Phoneme configuration

        Configuration options for phoneme elements.

        Provides exact phonetic pronunciation using standard phonetic alphabets. Essential for proper names, technical terms, or words with ambiguous pronunciation.

        • alphabet: PhonemeAlphabet

          Phonetic alphabet used for transcription. (Required)

          Available alphabets:

          • ipa: International Phonetic Alphabet (universal standard)
          • sapi: Microsoft SAPI phonemes (English-focused)
          • ups: Universal Phone Set (Microsoft's unified system)
          "ipa"
          
          "sapi"
          
        • ph: string

          Phonetic transcription of the word. (Required)

          The exact phonetic representation in the specified alphabet. Must be valid according to the chosen alphabet's rules.

          "ˈʃɛdjuːl" - IPA for "schedule" (British)
          
          "s k eh jh uw l" - SAPI for "schedule" (American)
          

      Returns this

      This VoiceBuilder instance for method chaining

      // IPA pronunciation
      voice.phoneme('tomato', {
      alphabet: 'ipa',
      ph: 'təˈmeɪtoʊ'
      });

      // SAPI pronunciation
      voice.phoneme('read', {
      alphabet: 'sapi',
      ph: 'r eh d'
      });
    • Modifies prosody (pitch, rate, volume, contour, range) of speech. Allows fine-grained control over how text is spoken.

      Parameters

      • text: string

        Text to modify

      • options: ProsodyOptions

        Prosody configuration options

        Configuration options for prosody (speech characteristics).

        Controls various aspects of speech delivery including pitch, speaking rate, volume, and intonation contours. Multiple properties can be combined for complex speech modifications.

        • Optionalcontour?: string

          Pitch contour changes over time.

          Defines how pitch changes during speech using time-position pairs. Format: "(time1,pitch1) (time2,pitch2) ..." Time as percentage, pitch as Hz or percentage change.

          "(0%,+5Hz) (50%,+10Hz) (100%,+5Hz)" - Rising intonation
          
          "(0%,+20Hz) (100%,-10Hz)" - Falling intonation
          
        • Optionalpitch?: string

          Pitch adjustment for the speech.

          Can be specified as:

          • Absolute frequency: "200Hz", "150Hz"
          • Relative change: "+2st" (semitones), "+10%", "-5%"
          • Named values: "x-low", "low", "medium", "high", "x-high"
          "high" - High pitch
          
          "+10%" - 10% higher
          
          "200Hz" - Specific frequency
          
          "-2st" - 2 semitones lower
          
        • Optionalrange?: string

          Pitch range variation.

          Controls the variability of pitch (monotone vs expressive). Can be relative change or named value.

          "x-low" - Very monotone
          
          "high" - Very expressive
          
          "+10%" - 10% more variation
          
        • Optionalrate?: string

          Speaking rate/speed.

          Can be specified as:

          • Multiplier: "0.5" (half speed), "2.0" (double speed)
          • Percentage: "+10%", "-20%"
          • Named values: "x-slow", "slow", "medium", "fast", "x-fast"
          "slow" - Slow speech
          
          "1.5" - 50% faster
          
          "+25%" - 25% faster
          
        • Optionalvolume?: string

          Volume level of the speech.

          Can be specified as:

          • Numeric: "0" to "100" (0=silent, 100=loudest)
          • Percentage: "50%", "80%"
          • Decibels: "+10dB", "-5dB"
          • Named values: "silent", "x-soft", "soft", "medium", "loud", "x-loud"
          "soft" - Quiet speech
          
          "loud" - Loud speech
          
          "50" - 50% volume
          
          "+5dB" - 5 decibels louder
          

      Returns this

      This VoiceBuilder instance for method chaining

      // Slow and quiet speech
      voice.prosody('Speaking slowly and quietly', {
      rate: 'slow',
      volume: 'soft',
      pitch: 'low'
      });

      // Precise numeric values
      voice.prosody('Precise control', {
      rate: '0.8',
      pitch: '+5%',
      volume: '+10dB'
      });
    • Internal

      Renders this voice element as an XML string. Internal method used by SSMLBuilder.

      Returns string

      The voice element as an XML string

      // Internal usage
      const xml = voiceBuilder.render();
      // Returns: <voice name="en-US-AvaNeural">content here</voice>
    • Controls how text is interpreted and pronounced. Useful for dates, numbers, currency, abbreviations, etc.

      Parameters

      • text: string

        Text to interpret

      • options: SayAsOptions

        Say-as configuration

        Configuration options for say-as elements.

        Controls interpretation and pronunciation of formatted text like dates, numbers, currency, and other specialized content.

        • Optionaldetail?: string

          Additional detail for interpretation.

          Provides extra context for certain interpretAs types:

          • For currency: ISO currency code (USD, EUR, GBP, etc.)
          • For other types: Additional pronunciation hints
          "USD" - US Dollars
          
          "EUR" - Euros
          
          "JPY" - Japanese Yen
          
        • Optionalformat?: string

          Format hint for interpretation.

          Provides additional formatting information. Available formats depend on interpretAs value:

          For dates:

          • "mdy": Month-day-year
          • "dmy": Day-month-year
          • "ymd": Year-month-day
          • "md": Month-day
          • "dm": Day-month
          • "ym": Year-month
          • "my": Month-year
          • "d": Day only
          • "m": Month only
          • "y": Year only

          For time:

          • "hms12": 12-hour format with seconds
          • "hms24": 24-hour format with seconds
          "ymd" - For date: 2025-12-31
          
          "hms24" - For time: 14:30:00
          
        • interpretAs: SayAsInterpretAs

          How to interpret the text content. (Required)

          Determines the pronunciation rules applied to the text. Each type has specific formatting requirements.

          "date" - For dates
          
          "cardinal" - For numbers
          
          "telephone" - For phone numbers
          
          "currency" - For money
          

          SayAsInterpretAs type for full list

      Returns this

      This VoiceBuilder instance for method chaining

      // Date interpretation
      voice.sayAs('2025-08-24', {
      interpretAs: 'date',
      format: 'ymd'
      });

      // Currency
      voice.sayAs('42.50', {
      interpretAs: 'currency',
      detail: 'USD'
      });

      // Spell out
      voice.sayAs('SSML', { interpretAs: 'spell-out' });

      // Phone number
      voice.sayAs('1234567890', { interpretAs: 'telephone' });
    • Adds silence at specific positions in the speech. More precise than break element for controlling silence placement.

      Parameters

      • options: SilenceOptions

        Silence configuration

        Configuration options for silence elements.

        Provides precise control over silence placement in speech output, with options for various positions and boundary types.

        • type: SilenceType

          Position and type of silence to add. (Required)

          Determines where silence is inserted:

          • Leading types: Beginning of text
          • Tailing types: End of text
          • Boundary types: Between sentences or at punctuation
          • Exact types: Replace natural silence with specified duration
          "Sentenceboundary"
          
          "Leading-exact"
          
          "Comma-exact"
          
        • value: string

          Duration of the silence. (Required)

          Specified in milliseconds (ms) or seconds (s). Valid range: 0-20000ms (20 seconds max)

          For non-exact types, this is added to natural silence. For exact types, this replaces natural silence.

          "200ms" - 200 milliseconds
          
          "1s" - 1 second
          
          "500ms" - Half second
          

      Returns this

      This VoiceBuilder instance for method chaining

      // Add silence between sentences
      voice.silence({ type: 'Sentenceboundary', value: '500ms' });

      // Add leading silence
      voice.silence({ type: 'Leading', value: '200ms' });

      // Add exact silence at comma
      voice.silence({ type: 'Comma-exact', value: '150ms' });
    • Substitutes text with an alias for pronunciation. Useful for acronyms or text that should be pronounced differently.

      Parameters

      • original: string

        Original text to display

      • alias: string

        How the text should be pronounced

      Returns this

      This VoiceBuilder instance for method chaining

      voice
      .text('The ')
      .sub('W3C', 'World Wide Web Consortium')
      .text(' sets web standards.');
    • Adds plain text to be spoken by the voice. Special characters (&, <, >, ", ') are automatically escaped.

      Parameters

      • text: string

        The text to be spoken

      Returns this

      This VoiceBuilder instance for method chaining

      voice.text('Hello, world!');
      // Multiple text segments can be chained
      voice
      .text('First part. ')
      .text('Second part.');
    • Uses a custom speaker profile for voice synthesis (Azure Speech Service specific). Requires a pre-trained speaker profile.

      Parameters

      • speakerProfileId: string

        ID of the speaker profile

      • text: string

        Text to speak with the custom voice

      Returns this

      This VoiceBuilder instance for method chaining

      voice.ttsEmbedding(
      'profile-id-123',
      'This is spoken with a custom voice profile.'
      );
    • Switches to a different voice while maintaining the fluent API chain. Allows multiple voices in the same SSML document.

      Parameters

      • name: string

        Name of the new voice (e.g., 'en-US-AndrewNeural')

      • Optionaleffect: string

        Optional voice effect for the new voice

      Returns VoiceBuilder

      A new VoiceBuilder instance for the specified voice

      If VoiceBuilder was not created from an SSMLBuilder

      new SSMLBuilder({ lang: 'en-US' })
      .voice('en-US-AvaNeural')
      .text('Hello from Ava!')
      .voice('en-US-AndrewNeural')
      .text('Hello from Andrew!')
      .build();