Files
resolutionflow/.agent/skills/speech-to-text/references/transcription-options.md
2026-02-15 00:43:41 -05:00

5.6 KiB

Transcription Options

Request Parameters

Parameter Type Required Description
file file Yes Audio or video file to transcribe
model_id string Yes scribe_v2 (or legacy scribe_v1) for batch transcription
language_code string No Language hint (ISO 639-1 or ISO 639-3, e.g., en or eng)
timestamps_granularity string No none, word, or character (default: word)
diarize boolean No Enable speaker diarization (up to 32 speakers for batch)
num_speakers integer No Maximum speakers to detect (up to 32 for batch)
diarization_threshold number No Tune diarization sensitivity when diarize=true
keyterms array No Terms to bias transcription (up to 100)
tag_audio_events boolean No Detect non-speech sounds (laughter, applause)
entity_detection string or array No Detect entities (e.g., pii, phi, pci, offensive_language)
use_multi_channel boolean No Split multichannel audio into separate transcripts
cloud_storage_url string No HTTPS URL to transcribe instead of uploading a file
webhook boolean No Process async and send result to webhook
webhook_metadata string or object No Custom metadata included in webhook responses

Python Example

from elevenlabs.client import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(
        file=audio_file,
        model_id="scribe_v2",
        language_code="eng",
        timestamps_granularity="word",
        diarize=True,
        keyterms=["ElevenLabs", "Scribe"]
    )

JavaScript Example

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();

const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
  languageCode: "eng",
  timestampsGranularity: "word",
  diarize: true,
  keyterms: ["ElevenLabs", "Scribe"],
});

cURL Example

curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "file=@audio.mp3" \
  -F "model_id=scribe_v2" \
  -F "language_code=eng" \
  -F "timestamps_granularity=word" \
  -F "diarize=true"

Response Structure

{
  "text": "The complete transcribed text from the audio file.",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {
      "text": "The",
      "start": 0.0,
      "end": 0.15,
      "type": "word",
      "speaker_id": "speaker_0"
    },
    {
      "text": " ",
      "start": 0.15,
      "end": 0.16,
      "type": "spacing",
      "speaker_id": "speaker_0"
    }
  ]
}

Response Fields

Field Type Description
text string Full transcription text
language_code string Detected language (ISO 639-1 or ISO 639-3)
language_probability float Confidence in detection (0-1)
words array Word-level timestamps (if requested)
words[].text string The transcribed word or spacing
words[].start float Start time in seconds
words[].end float End time in seconds
words[].type string word, spacing, or audio_event
words[].speaker_id string Speaker identifier (if diarization enabled)

Supported Languages (90+)

Common languages (ISO 639-3 codes):

Code Language Code Language
eng English jpn Japanese
spa Spanish kor Korean
fra French zho Mandarin
deu German ara Arabic
ita Italian hin Hindi
por Portuguese tur Turkish
nld Dutch swe Swedish
pol Polish dan Danish
rus Russian fin Finnish

Full list: Afrikaans, Amharic, Armenian, Azerbaijani, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Cantonese, Catalan, Cebuano, Croatian, Czech, Estonian, Filipino, Georgian, Greek, Gujarati, Hausa, Hebrew, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Kazakh, Khmer, Kyrgyz, Lao, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Māori, Marathi, Mongolian, Nepali, Norwegian, Odia, Pashto, Persian, Punjabi, Romanian, Serbian, Shona, Sindhi, Slovak, Slovenian, Somali, Swahili, Tamil, Tajik, Telugu, Thai, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Wolof, Xhosa, Yoruba, Zulu.

Format Requirements

Audio: MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus Video: MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

Limits:

  • Maximum file size: 3GB
  • Maximum duration: 10 hours

Use Cases

Subtitle Generation with Speakers

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    timestamps_granularity="word",
    diarize=True
)

# Generate SRT with speaker labels
for i, word in enumerate(result.words, 1):
    if word.type == "word":
        print(f"[{word.speaker_id}] {word.text} ({word.start:.2f}s)")

Meeting Transcription with Custom Terms

with open("meeting.mp3", "rb") as f:
    result = client.speech_to_text.convert(
        file=f,
        model_id="scribe_v2",
        diarize=True,
        keyterms=["Q4 forecast", "revenue target", "ACME Corp"]
    )

# Group by speaker
current_speaker = None
for word in result.words:
    if word.type == "word":
        if word.speaker_id != current_speaker:
            current_speaker = word.speaker_id
            print(f"\n[{current_speaker}]:", end=" ")
        print(word.text, end="")