Transcripts and Commit Strategies

Control when and how transcripts are finalized in real-time streaming.

Why Commits Matter

In real-time transcription, the model continuously refines its understanding as more audio arrives. A word that sounds like "their" might become "there" or "they're" once more context is heard. The commit mechanism lets you decide when to "lock in" the transcript.

Transcript Types

Type	Description
Partial	Interim "best guess" results that update frequently as audio is processed. Use for live feedback (showing text as the user speaks), but don't save these - they may change.
Committed	Final, stable results after a commit occurs. Use these as the source of truth for your application - they won't change.
Committed with Timestamps	Same as committed, but includes word-level timing data for subtitles, karaoke, or lip-sync.

Manual Commit (Default)

You explicitly control when transcript segments finalize.

Python

async with client.speech_to_text.realtime.connect(
    model_id="scribe_v2_realtime",
) as connection:
    # Send audio
    await connection.send({
        "audio_base_64": audio_base_64,
        "sample_rate": 16000,
    })

    # Commit when ready (e.g., pause in speech, end of sentence)
    await connection.commit()

JavaScript

const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
});

// Send audio
connection.send({
  audioBase64: audioBase64,
  sampleRate: 16000,
});

// Commit when ready
connection.commit();

Best Practices

Commit every 20-30 seconds for optimal performance
Commit during silence or logical breaks (end of sentence, speaker change)
Auto-commit at 90 seconds if no manual commit is sent

Providing Context

Send previous text with the first audio chunk to help the model:

await connection.send({
    "audio_base_64": first_chunk,
    "sample_rate": 16000,
    "previous_text": "So as I was saying,"  # Keep under 50 characters
})

This helps with:

Continuing conversations after reconnection
Providing context for better accuracy
Handling sentence fragments

Voice Activity Detection (VAD)

VAD listens for silence and automatically commits when the speaker pauses. This creates natural transcript segments that match how people actually speak - pausing between sentences and thoughts. Recommended for live microphone input.

Configuration

const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,    // Silence duration before commit
    threshold: 0.4,               // Speech detection sensitivity (0-1)
    minSpeechDurationMs: 100,     // Minimum speech length required
    minSilenceDurationMs: 100,    // Minimum silence length required
  },
});

Parameters

Parameter	Description	Default
`silenceThresholdSecs`	Seconds of silence before auto-commit	1.5
`threshold`	Speech detection sensitivity (lower = more sensitive)	0.4
`minSpeechDurationMs`	Ignore speech shorter than this	100
`minSilenceDurationMs`	Ignore silence shorter than this	100

When to Use VAD

Live microphone input
Conversational applications
When natural speech boundaries are preferred
Client-side implementations

When to Use Manual Commit

Processing audio files
Known segment boundaries
Maximum control over timing
Server-side batch processing

Supported Audio Formats

Format	Sample Rate	Notes
PCM 16-bit	16kHz	Recommended, best balance
PCM 16-bit	8kHz - 48kHz	Supported range
μ-law 8-bit	8kHz	Telephony compatibility

3.8 KiB Raw Permalink Blame History