3.8 KiB
3.8 KiB
Transcripts and Commit Strategies
Control when and how transcripts are finalized in real-time streaming.
Why Commits Matter
In real-time transcription, the model continuously refines its understanding as more audio arrives. A word that sounds like "their" might become "there" or "they're" once more context is heard. The commit mechanism lets you decide when to "lock in" the transcript.
Transcript Types
| Type | Description |
|---|---|
| Partial | Interim "best guess" results that update frequently as audio is processed. Use for live feedback (showing text as the user speaks), but don't save these - they may change. |
| Committed | Final, stable results after a commit occurs. Use these as the source of truth for your application - they won't change. |
| Committed with Timestamps | Same as committed, but includes word-level timing data for subtitles, karaoke, or lip-sync. |
Manual Commit (Default)
You explicitly control when transcript segments finalize.
Python
async with client.speech_to_text.realtime.connect(
model_id="scribe_v2_realtime",
) as connection:
# Send audio
await connection.send({
"audio_base_64": audio_base_64,
"sample_rate": 16000,
})
# Commit when ready (e.g., pause in speech, end of sentence)
await connection.commit()
JavaScript
const connection = await client.speechToText.realtime.connect({
modelId: "scribe_v2_realtime",
});
// Send audio
connection.send({
audioBase64: audioBase64,
sampleRate: 16000,
});
// Commit when ready
connection.commit();
Best Practices
- Commit every 20-30 seconds for optimal performance
- Commit during silence or logical breaks (end of sentence, speaker change)
- Auto-commit at 90 seconds if no manual commit is sent
Providing Context
Send previous text with the first audio chunk to help the model:
await connection.send({
"audio_base_64": first_chunk,
"sample_rate": 16000,
"previous_text": "So as I was saying," # Keep under 50 characters
})
This helps with:
- Continuing conversations after reconnection
- Providing context for better accuracy
- Handling sentence fragments
Voice Activity Detection (VAD)
VAD listens for silence and automatically commits when the speaker pauses. This creates natural transcript segments that match how people actually speak - pausing between sentences and thoughts. Recommended for live microphone input.
Configuration
const connection = await client.speechToText.realtime.connect({
modelId: "scribe_v2_realtime",
vad: {
silenceThresholdSecs: 1.5, // Silence duration before commit
threshold: 0.4, // Speech detection sensitivity (0-1)
minSpeechDurationMs: 100, // Minimum speech length required
minSilenceDurationMs: 100, // Minimum silence length required
},
});
Parameters
| Parameter | Description | Default |
|---|---|---|
silenceThresholdSecs |
Seconds of silence before auto-commit | 1.5 |
threshold |
Speech detection sensitivity (lower = more sensitive) | 0.4 |
minSpeechDurationMs |
Ignore speech shorter than this | 100 |
minSilenceDurationMs |
Ignore silence shorter than this | 100 |
When to Use VAD
- Live microphone input
- Conversational applications
- When natural speech boundaries are preferred
- Client-side implementations
When to Use Manual Commit
- Processing audio files
- Known segment boundaries
- Maximum control over timing
- Server-side batch processing
Supported Audio Formats
| Format | Sample Rate | Notes |
|---|---|---|
| PCM 16-bit | 16kHz | Recommended, best balance |
| PCM 16-bit | 8kHz - 48kHz | Supported range |
| μ-law 8-bit | 8kHz | Telephony compatibility |