Major UI overhaul plans; Other random docs

2026-02-15 00:43:41 -05:00
parent 1b86f66954
commit ef829f06a4
17 changed files with 5533 additions and 553 deletions
--- a/.agent/skills/speech-to-text/references/realtime-commit-strategies.md
+++ b/.agent/skills/speech-to-text/references/realtime-commit-strategies.md
@@ -0,0 +1,124 @@
+# Transcripts and Commit Strategies
+
+Control when and how transcripts are finalized in real-time streaming.
+
+## Why Commits Matter
+
+In real-time transcription, the model continuously refines its understanding as more audio arrives. A word that sounds like "their" might become "there" or "they're" once more context is heard. The **commit** mechanism lets you decide when to "lock in" the transcript.
+
+## Transcript Types
+
+| Type | Description |
+|------|-------------|
+| **Partial** | Interim "best guess" results that update frequently as audio is processed. Use for live feedback (showing text as the user speaks), but don't save these - they may change. |
+| **Committed** | Final, stable results after a commit occurs. Use these as the source of truth for your application - they won't change. |
+| **Committed with Timestamps** | Same as committed, but includes word-level timing data for subtitles, karaoke, or lip-sync. |
+
+## Manual Commit (Default)
+
+You explicitly control when transcript segments finalize.
+
+### Python
+
+```python
+async with client.speech_to_text.realtime.connect(
+    model_id="scribe_v2_realtime",
+) as connection:
+    # Send audio
+    await connection.send({
+        "audio_base_64": audio_base_64,
+        "sample_rate": 16000,
+    })
+
+    # Commit when ready (e.g., pause in speech, end of sentence)
+    await connection.commit()
+```
+
+### JavaScript
+
+```javascript
+const connection = await client.speechToText.realtime.connect({
+  modelId: "scribe_v2_realtime",
+});
+
+// Send audio
+connection.send({
+  audioBase64: audioBase64,
+  sampleRate: 16000,
+});
+
+// Commit when ready
+connection.commit();
+```
+
+### Best Practices
+
+- **Commit every 20-30 seconds** for optimal performance
+- **Commit during silence** or logical breaks (end of sentence, speaker change)
+- **Auto-commit at 90 seconds** if no manual commit is sent
+
+### Providing Context
+
+Send previous text with the first audio chunk to help the model:
+
+```python
+await connection.send({
+    "audio_base_64": first_chunk,
+    "sample_rate": 16000,
+    "previous_text": "So as I was saying,"  # Keep under 50 characters
+})
+```
+
+This helps with:
+- Continuing conversations after reconnection
+- Providing context for better accuracy
+- Handling sentence fragments
+
+## Voice Activity Detection (VAD)
+
+VAD listens for silence and automatically commits when the speaker pauses. This creates natural transcript segments that match how people actually speak - pausing between sentences and thoughts. Recommended for live microphone input.
+
+### Configuration
+
+```javascript
+const connection = await client.speechToText.realtime.connect({
+  modelId: "scribe_v2_realtime",
+  vad: {
+    silenceThresholdSecs: 1.5,    // Silence duration before commit
+    threshold: 0.4,               // Speech detection sensitivity (0-1)
+    minSpeechDurationMs: 100,     // Minimum speech length required
+    minSilenceDurationMs: 100,    // Minimum silence length required
+  },
+});
+```
+
+### Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `silenceThresholdSecs` | Seconds of silence before auto-commit | 1.5 |
+| `threshold` | Speech detection sensitivity (lower = more sensitive) | 0.4 |
+| `minSpeechDurationMs` | Ignore speech shorter than this | 100 |
+| `minSilenceDurationMs` | Ignore silence shorter than this | 100 |
+
+### When to Use VAD
+
+- Live microphone input
+- Conversational applications
+- When natural speech boundaries are preferred
+- Client-side implementations
+
+### When to Use Manual Commit
+
+- Processing audio files
+- Known segment boundaries
+- Maximum control over timing
+- Server-side batch processing
+
+## Supported Audio Formats
+
+| Format | Sample Rate | Notes |
+|--------|-------------|-------|
+| PCM 16-bit | 16kHz | Recommended, best balance |
+| PCM 16-bit | 8kHz - 48kHz | Supported range |
+| μ-law 8-bit | 8kHz | Telephony compatibility |