Files
resolutionflow/.agent/skills/speech-to-text/references/realtime-commit-strategies.md
2026-02-15 00:43:41 -05:00

3.8 KiB

Transcripts and Commit Strategies

Control when and how transcripts are finalized in real-time streaming.

Why Commits Matter

In real-time transcription, the model continuously refines its understanding as more audio arrives. A word that sounds like "their" might become "there" or "they're" once more context is heard. The commit mechanism lets you decide when to "lock in" the transcript.

Transcript Types

Type Description
Partial Interim "best guess" results that update frequently as audio is processed. Use for live feedback (showing text as the user speaks), but don't save these - they may change.
Committed Final, stable results after a commit occurs. Use these as the source of truth for your application - they won't change.
Committed with Timestamps Same as committed, but includes word-level timing data for subtitles, karaoke, or lip-sync.

Manual Commit (Default)

You explicitly control when transcript segments finalize.

Python

async with client.speech_to_text.realtime.connect(
    model_id="scribe_v2_realtime",
) as connection:
    # Send audio
    await connection.send({
        "audio_base_64": audio_base_64,
        "sample_rate": 16000,
    })

    # Commit when ready (e.g., pause in speech, end of sentence)
    await connection.commit()

JavaScript

const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
});

// Send audio
connection.send({
  audioBase64: audioBase64,
  sampleRate: 16000,
});

// Commit when ready
connection.commit();

Best Practices

  • Commit every 20-30 seconds for optimal performance
  • Commit during silence or logical breaks (end of sentence, speaker change)
  • Auto-commit at 90 seconds if no manual commit is sent

Providing Context

Send previous text with the first audio chunk to help the model:

await connection.send({
    "audio_base_64": first_chunk,
    "sample_rate": 16000,
    "previous_text": "So as I was saying,"  # Keep under 50 characters
})

This helps with:

  • Continuing conversations after reconnection
  • Providing context for better accuracy
  • Handling sentence fragments

Voice Activity Detection (VAD)

VAD listens for silence and automatically commits when the speaker pauses. This creates natural transcript segments that match how people actually speak - pausing between sentences and thoughts. Recommended for live microphone input.

Configuration

const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,    // Silence duration before commit
    threshold: 0.4,               // Speech detection sensitivity (0-1)
    minSpeechDurationMs: 100,     // Minimum speech length required
    minSilenceDurationMs: 100,    // Minimum silence length required
  },
});

Parameters

Parameter Description Default
silenceThresholdSecs Seconds of silence before auto-commit 1.5
threshold Speech detection sensitivity (lower = more sensitive) 0.4
minSpeechDurationMs Ignore speech shorter than this 100
minSilenceDurationMs Ignore silence shorter than this 100

When to Use VAD

  • Live microphone input
  • Conversational applications
  • When natural speech boundaries are preferred
  • Client-side implementations

When to Use Manual Commit

  • Processing audio files
  • Known segment boundaries
  • Maximum control over timing
  • Server-side batch processing

Supported Audio Formats

Format Sample Rate Notes
PCM 16-bit 16kHz Recommended, best balance
PCM 16-bit 8kHz - 48kHz Supported range
μ-law 8-bit 8kHz Telephony compatibility