Major UI overhaul plans; Other random docs
This commit is contained in:
@@ -0,0 +1,124 @@
|
||||
# Transcripts and Commit Strategies
|
||||
|
||||
Control when and how transcripts are finalized in real-time streaming.
|
||||
|
||||
## Why Commits Matter
|
||||
|
||||
In real-time transcription, the model continuously refines its understanding as more audio arrives. A word that sounds like "their" might become "there" or "they're" once more context is heard. The **commit** mechanism lets you decide when to "lock in" the transcript.
|
||||
|
||||
## Transcript Types
|
||||
|
||||
| Type | Description |
|
||||
|------|-------------|
|
||||
| **Partial** | Interim "best guess" results that update frequently as audio is processed. Use for live feedback (showing text as the user speaks), but don't save these - they may change. |
|
||||
| **Committed** | Final, stable results after a commit occurs. Use these as the source of truth for your application - they won't change. |
|
||||
| **Committed with Timestamps** | Same as committed, but includes word-level timing data for subtitles, karaoke, or lip-sync. |
|
||||
|
||||
## Manual Commit (Default)
|
||||
|
||||
You explicitly control when transcript segments finalize.
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
async with client.speech_to_text.realtime.connect(
|
||||
model_id="scribe_v2_realtime",
|
||||
) as connection:
|
||||
# Send audio
|
||||
await connection.send({
|
||||
"audio_base_64": audio_base_64,
|
||||
"sample_rate": 16000,
|
||||
})
|
||||
|
||||
# Commit when ready (e.g., pause in speech, end of sentence)
|
||||
await connection.commit()
|
||||
```
|
||||
|
||||
### JavaScript
|
||||
|
||||
```javascript
|
||||
const connection = await client.speechToText.realtime.connect({
|
||||
modelId: "scribe_v2_realtime",
|
||||
});
|
||||
|
||||
// Send audio
|
||||
connection.send({
|
||||
audioBase64: audioBase64,
|
||||
sampleRate: 16000,
|
||||
});
|
||||
|
||||
// Commit when ready
|
||||
connection.commit();
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- **Commit every 20-30 seconds** for optimal performance
|
||||
- **Commit during silence** or logical breaks (end of sentence, speaker change)
|
||||
- **Auto-commit at 90 seconds** if no manual commit is sent
|
||||
|
||||
### Providing Context
|
||||
|
||||
Send previous text with the first audio chunk to help the model:
|
||||
|
||||
```python
|
||||
await connection.send({
|
||||
"audio_base_64": first_chunk,
|
||||
"sample_rate": 16000,
|
||||
"previous_text": "So as I was saying," # Keep under 50 characters
|
||||
})
|
||||
```
|
||||
|
||||
This helps with:
|
||||
- Continuing conversations after reconnection
|
||||
- Providing context for better accuracy
|
||||
- Handling sentence fragments
|
||||
|
||||
## Voice Activity Detection (VAD)
|
||||
|
||||
VAD listens for silence and automatically commits when the speaker pauses. This creates natural transcript segments that match how people actually speak - pausing between sentences and thoughts. Recommended for live microphone input.
|
||||
|
||||
### Configuration
|
||||
|
||||
```javascript
|
||||
const connection = await client.speechToText.realtime.connect({
|
||||
modelId: "scribe_v2_realtime",
|
||||
vad: {
|
||||
silenceThresholdSecs: 1.5, // Silence duration before commit
|
||||
threshold: 0.4, // Speech detection sensitivity (0-1)
|
||||
minSpeechDurationMs: 100, // Minimum speech length required
|
||||
minSilenceDurationMs: 100, // Minimum silence length required
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| `silenceThresholdSecs` | Seconds of silence before auto-commit | 1.5 |
|
||||
| `threshold` | Speech detection sensitivity (lower = more sensitive) | 0.4 |
|
||||
| `minSpeechDurationMs` | Ignore speech shorter than this | 100 |
|
||||
| `minSilenceDurationMs` | Ignore silence shorter than this | 100 |
|
||||
|
||||
### When to Use VAD
|
||||
|
||||
- Live microphone input
|
||||
- Conversational applications
|
||||
- When natural speech boundaries are preferred
|
||||
- Client-side implementations
|
||||
|
||||
### When to Use Manual Commit
|
||||
|
||||
- Processing audio files
|
||||
- Known segment boundaries
|
||||
- Maximum control over timing
|
||||
- Server-side batch processing
|
||||
|
||||
## Supported Audio Formats
|
||||
|
||||
| Format | Sample Rate | Notes |
|
||||
|--------|-------------|-------|
|
||||
| PCM 16-bit | 16kHz | Recommended, best balance |
|
||||
| PCM 16-bit | 8kHz - 48kHz | Supported range |
|
||||
| μ-law 8-bit | 8kHz | Telephony compatibility |
|
||||
Reference in New Issue
Block a user