Major UI overhaul plans; Other random docs
This commit is contained in:
174
.agent/skills/speech-to-text/references/transcription-options.md
Normal file
174
.agent/skills/speech-to-text/references/transcription-options.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Transcription Options
|
||||
|
||||
## Request Parameters
|
||||
|
||||
| Parameter | Type | Required | Description |
|
||||
|-----------|------|----------|-------------|
|
||||
| `file` | file | Yes | Audio or video file to transcribe |
|
||||
| `model_id` | string | Yes | `scribe_v2` (or legacy `scribe_v1`) for batch transcription |
|
||||
| `language_code` | string | No | Language hint (ISO 639-1 or ISO 639-3, e.g., `en` or `eng`) |
|
||||
| `timestamps_granularity` | string | No | `none`, `word`, or `character` (default: `word`) |
|
||||
| `diarize` | boolean | No | Enable speaker diarization (up to 32 speakers for batch) |
|
||||
| `num_speakers` | integer | No | Maximum speakers to detect (up to 32 for batch) |
|
||||
| `diarization_threshold` | number | No | Tune diarization sensitivity when `diarize=true` |
|
||||
| `keyterms` | array | No | Terms to bias transcription (up to 100) |
|
||||
| `tag_audio_events` | boolean | No | Detect non-speech sounds (laughter, applause) |
|
||||
| `entity_detection` | string or array | No | Detect entities (e.g., `pii`, `phi`, `pci`, `offensive_language`) |
|
||||
| `use_multi_channel` | boolean | No | Split multichannel audio into separate transcripts |
|
||||
| `cloud_storage_url` | string | No | HTTPS URL to transcribe instead of uploading a file |
|
||||
| `webhook` | boolean | No | Process async and send result to webhook |
|
||||
| `webhook_metadata` | string or object | No | Custom metadata included in webhook responses |
|
||||
|
||||
## Python Example
|
||||
|
||||
```python
|
||||
from elevenlabs.client import ElevenLabs
|
||||
|
||||
client = ElevenLabs()
|
||||
|
||||
with open("audio.mp3", "rb") as audio_file:
|
||||
result = client.speech_to_text.convert(
|
||||
file=audio_file,
|
||||
model_id="scribe_v2",
|
||||
language_code="eng",
|
||||
timestamps_granularity="word",
|
||||
diarize=True,
|
||||
keyterms=["ElevenLabs", "Scribe"]
|
||||
)
|
||||
```
|
||||
|
||||
## JavaScript Example
|
||||
|
||||
```javascript
|
||||
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
|
||||
import { createReadStream } from "fs";
|
||||
|
||||
const client = new ElevenLabsClient();
|
||||
|
||||
const result = await client.speechToText.convert({
|
||||
file: createReadStream("audio.mp3"),
|
||||
modelId: "scribe_v2",
|
||||
languageCode: "eng",
|
||||
timestampsGranularity: "word",
|
||||
diarize: true,
|
||||
keyterms: ["ElevenLabs", "Scribe"],
|
||||
});
|
||||
```
|
||||
|
||||
## cURL Example
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
|
||||
-H "xi-api-key: $ELEVENLABS_API_KEY" \
|
||||
-F "file=@audio.mp3" \
|
||||
-F "model_id=scribe_v2" \
|
||||
-F "language_code=eng" \
|
||||
-F "timestamps_granularity=word" \
|
||||
-F "diarize=true"
|
||||
```
|
||||
|
||||
## Response Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "The complete transcribed text from the audio file.",
|
||||
"language_code": "eng",
|
||||
"language_probability": 0.98,
|
||||
"words": [
|
||||
{
|
||||
"text": "The",
|
||||
"start": 0.0,
|
||||
"end": 0.15,
|
||||
"type": "word",
|
||||
"speaker_id": "speaker_0"
|
||||
},
|
||||
{
|
||||
"text": " ",
|
||||
"start": 0.15,
|
||||
"end": 0.16,
|
||||
"type": "spacing",
|
||||
"speaker_id": "speaker_0"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Response Fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `text` | string | Full transcription text |
|
||||
| `language_code` | string | Detected language (ISO 639-1 or ISO 639-3) |
|
||||
| `language_probability` | float | Confidence in detection (0-1) |
|
||||
| `words` | array | Word-level timestamps (if requested) |
|
||||
| `words[].text` | string | The transcribed word or spacing |
|
||||
| `words[].start` | float | Start time in seconds |
|
||||
| `words[].end` | float | End time in seconds |
|
||||
| `words[].type` | string | `word`, `spacing`, or `audio_event` |
|
||||
| `words[].speaker_id` | string | Speaker identifier (if diarization enabled) |
|
||||
|
||||
## Supported Languages (90+)
|
||||
|
||||
Common languages (ISO 639-3 codes):
|
||||
|
||||
| Code | Language | Code | Language |
|
||||
|------|----------|------|----------|
|
||||
| `eng` | English | `jpn` | Japanese |
|
||||
| `spa` | Spanish | `kor` | Korean |
|
||||
| `fra` | French | `zho` | Mandarin |
|
||||
| `deu` | German | `ara` | Arabic |
|
||||
| `ita` | Italian | `hin` | Hindi |
|
||||
| `por` | Portuguese | `tur` | Turkish |
|
||||
| `nld` | Dutch | `swe` | Swedish |
|
||||
| `pol` | Polish | `dan` | Danish |
|
||||
| `rus` | Russian | `fin` | Finnish |
|
||||
|
||||
Full list: Afrikaans, Amharic, Armenian, Azerbaijani, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Cantonese, Catalan, Cebuano, Croatian, Czech, Estonian, Filipino, Georgian, Greek, Gujarati, Hausa, Hebrew, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Kazakh, Khmer, Kyrgyz, Lao, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Māori, Marathi, Mongolian, Nepali, Norwegian, Odia, Pashto, Persian, Punjabi, Romanian, Serbian, Shona, Sindhi, Slovak, Slovenian, Somali, Swahili, Tamil, Tajik, Telugu, Thai, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Wolof, Xhosa, Yoruba, Zulu.
|
||||
|
||||
## Format Requirements
|
||||
|
||||
**Audio:** MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus
|
||||
**Video:** MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
|
||||
|
||||
**Limits:**
|
||||
- Maximum file size: 3GB
|
||||
- Maximum duration: 10 hours
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Subtitle Generation with Speakers
|
||||
|
||||
```python
|
||||
result = client.speech_to_text.convert(
|
||||
file=audio_file,
|
||||
model_id="scribe_v2",
|
||||
timestamps_granularity="word",
|
||||
diarize=True
|
||||
)
|
||||
|
||||
# Generate SRT with speaker labels
|
||||
for i, word in enumerate(result.words, 1):
|
||||
if word.type == "word":
|
||||
print(f"[{word.speaker_id}] {word.text} ({word.start:.2f}s)")
|
||||
```
|
||||
|
||||
### Meeting Transcription with Custom Terms
|
||||
|
||||
```python
|
||||
with open("meeting.mp3", "rb") as f:
|
||||
result = client.speech_to_text.convert(
|
||||
file=f,
|
||||
model_id="scribe_v2",
|
||||
diarize=True,
|
||||
keyterms=["Q4 forecast", "revenue target", "ACME Corp"]
|
||||
)
|
||||
|
||||
# Group by speaker
|
||||
current_speaker = None
|
||||
for word in result.words:
|
||||
if word.type == "word":
|
||||
if word.speaker_id != current_speaker:
|
||||
current_speaker = word.speaker_id
|
||||
print(f"\n[{current_speaker}]:", end=" ")
|
||||
print(word.text, end="")
|
||||
```
|
||||
Reference in New Issue
Block a user