Files
resolutionflow/.agent/skills/speech-to-text/references/transcription-options.md
2026-02-15 00:43:41 -05:00

175 lines
5.6 KiB
Markdown

# Transcription Options
## Request Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `file` | file | Yes | Audio or video file to transcribe |
| `model_id` | string | Yes | `scribe_v2` (or legacy `scribe_v1`) for batch transcription |
| `language_code` | string | No | Language hint (ISO 639-1 or ISO 639-3, e.g., `en` or `eng`) |
| `timestamps_granularity` | string | No | `none`, `word`, or `character` (default: `word`) |
| `diarize` | boolean | No | Enable speaker diarization (up to 32 speakers for batch) |
| `num_speakers` | integer | No | Maximum speakers to detect (up to 32 for batch) |
| `diarization_threshold` | number | No | Tune diarization sensitivity when `diarize=true` |
| `keyterms` | array | No | Terms to bias transcription (up to 100) |
| `tag_audio_events` | boolean | No | Detect non-speech sounds (laughter, applause) |
| `entity_detection` | string or array | No | Detect entities (e.g., `pii`, `phi`, `pci`, `offensive_language`) |
| `use_multi_channel` | boolean | No | Split multichannel audio into separate transcripts |
| `cloud_storage_url` | string | No | HTTPS URL to transcribe instead of uploading a file |
| `webhook` | boolean | No | Process async and send result to webhook |
| `webhook_metadata` | string or object | No | Custom metadata included in webhook responses |
## Python Example
```python
from elevenlabs.client import ElevenLabs
client = ElevenLabs()
with open("audio.mp3", "rb") as audio_file:
result = client.speech_to_text.convert(
file=audio_file,
model_id="scribe_v2",
language_code="eng",
timestamps_granularity="word",
diarize=True,
keyterms=["ElevenLabs", "Scribe"]
)
```
## JavaScript Example
```javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";
const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
file: createReadStream("audio.mp3"),
modelId: "scribe_v2",
languageCode: "eng",
timestampsGranularity: "word",
diarize: true,
keyterms: ["ElevenLabs", "Scribe"],
});
```
## cURL Example
```bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-F "file=@audio.mp3" \
-F "model_id=scribe_v2" \
-F "language_code=eng" \
-F "timestamps_granularity=word" \
-F "diarize=true"
```
## Response Structure
```json
{
"text": "The complete transcribed text from the audio file.",
"language_code": "eng",
"language_probability": 0.98,
"words": [
{
"text": "The",
"start": 0.0,
"end": 0.15,
"type": "word",
"speaker_id": "speaker_0"
},
{
"text": " ",
"start": 0.15,
"end": 0.16,
"type": "spacing",
"speaker_id": "speaker_0"
}
]
}
```
## Response Fields
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Full transcription text |
| `language_code` | string | Detected language (ISO 639-1 or ISO 639-3) |
| `language_probability` | float | Confidence in detection (0-1) |
| `words` | array | Word-level timestamps (if requested) |
| `words[].text` | string | The transcribed word or spacing |
| `words[].start` | float | Start time in seconds |
| `words[].end` | float | End time in seconds |
| `words[].type` | string | `word`, `spacing`, or `audio_event` |
| `words[].speaker_id` | string | Speaker identifier (if diarization enabled) |
## Supported Languages (90+)
Common languages (ISO 639-3 codes):
| Code | Language | Code | Language |
|------|----------|------|----------|
| `eng` | English | `jpn` | Japanese |
| `spa` | Spanish | `kor` | Korean |
| `fra` | French | `zho` | Mandarin |
| `deu` | German | `ara` | Arabic |
| `ita` | Italian | `hin` | Hindi |
| `por` | Portuguese | `tur` | Turkish |
| `nld` | Dutch | `swe` | Swedish |
| `pol` | Polish | `dan` | Danish |
| `rus` | Russian | `fin` | Finnish |
Full list: Afrikaans, Amharic, Armenian, Azerbaijani, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Cantonese, Catalan, Cebuano, Croatian, Czech, Estonian, Filipino, Georgian, Greek, Gujarati, Hausa, Hebrew, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Kazakh, Khmer, Kyrgyz, Lao, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Māori, Marathi, Mongolian, Nepali, Norwegian, Odia, Pashto, Persian, Punjabi, Romanian, Serbian, Shona, Sindhi, Slovak, Slovenian, Somali, Swahili, Tamil, Tajik, Telugu, Thai, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Wolof, Xhosa, Yoruba, Zulu.
## Format Requirements
**Audio:** MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus
**Video:** MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
**Limits:**
- Maximum file size: 3GB
- Maximum duration: 10 hours
## Use Cases
### Subtitle Generation with Speakers
```python
result = client.speech_to_text.convert(
file=audio_file,
model_id="scribe_v2",
timestamps_granularity="word",
diarize=True
)
# Generate SRT with speaker labels
for i, word in enumerate(result.words, 1):
if word.type == "word":
print(f"[{word.speaker_id}] {word.text} ({word.start:.2f}s)")
```
### Meeting Transcription with Custom Terms
```python
with open("meeting.mp3", "rb") as f:
result = client.speech_to_text.convert(
file=f,
model_id="scribe_v2",
diarize=True,
keyterms=["Q4 forecast", "revenue target", "ACME Corp"]
)
# Group by speaker
current_speaker = None
for word in result.words:
if word.type == "word":
if word.speaker_id != current_speaker:
current_speaker = word.speaker_id
print(f"\n[{current_speaker}]:", end=" ")
print(word.text, end="")
```