Transcribe audio to text using OpenAI Whisper. Use when user wants to convert speech to text, transcribe audio files, generate subtitles, or extract text from recordings. Triggers include "speech to text", "STT", "transcribe", "transcription", "subtitles", "captions", "audio to text", "convert audio to text".
Published by rebyteai
Runs in the cloud
No local installation
Dependencies pre-installed
Ready to run instantly
Secure VM environment
Isolated per task
Works on any device
Desktop, tablet, or phone
Transcribe audio to text using OpenAI Whisper API.
Requires Rebyte API auth — $AUTH_TOKEN and $API_URL are set up per the agent's system prompt; use them as Bearer token and base URL.
Use this skill when the user needs to:
Send audio directly via multipart/form-data — standard Whisper API format:
curl -s -X POST "$API_URL/api/data/stt/transcribe" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-F "file=@recording.mp3" \
-F "model=whisper-1" \
-F "language=en" \
-F "response_format=json"
Response:
{
"success": true,
"data": {
"text": "Hello, this is a transcription of the audio recording."
}
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file |
file | Yes | - | Audio file (multipart/form-data) |
language |
string | No | auto | ISO-639-1 language code (e.g. "en", "es", "ja") — improves accuracy |
prompt |
string | No | - | Optional text to guide transcription style or continue a previous segment |
model |
string | No | whisper-1 |
Model to use (currently only whisper-1) |
response_format |
string | No | json |
Output format (see below) |
temperature |
number | No | 0 |
Sampling temperature (0-1). Lower = more deterministic |
| Format | Description | Use Case |
|---|---|---|
json |
Simple JSON with text field |
Default, quick text extraction |
verbose_json |
JSON with timestamps, segments, duration | When you need word-level timing |
text |
Plain text only | Simple text output |
srt |
SubRip subtitle format | Video subtitles |
vtt |
WebVTT subtitle format | Web video captions |
Whisper accepts: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac
Max file size: 25 MB
# Get auth
AUTH_TOKEN=$(/home/user/.local/bin/rebyte-auth)
API_URL=$(python3 -c "import json; print(json.load(open('/home/user/.rebyte.ai/auth.json'))['sandbox']['relay_url'])")
# Transcribe directly
RESULT=$(curl -s -X POST "$API_URL/api/data/stt/transcribe" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-F "file=@interview.mp3" \
-F "language=en" \
-F "response_format=json")
# Extract text
echo "$RESULT" | jq -r '.data.text' > transcript.txt
echo "Transcript saved to transcript.txt"
RESULT=$(curl -s -X POST "$API_URL/api/data/stt/transcribe" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-F "file=@video-audio.mp3" \
-F "response_format=srt")
# Save SRT file
echo "$RESULT" | jq -r '.data.text' > subtitles.srt
# Burn subtitles into video with ffmpeg
ffmpeg -i video.mp4 -vf subtitles=subtitles.srt output.mp4
language when you know it — improves accuracy and speedverbose_json when you need timestamps for syncing with videosrt or vtt format to directly generate subtitle filesffmpeg -i long.mp3 -f segment -segment_time 300 -c copy chunk_%03d.mp3temperature to 0 (default) for most accurate resultsprompt parameter helps with domain-specific terms — include key vocabulary the model should recognizeEveryone else asks you to install skills locally. On Rebyte, just click Run. Works from any device — even your phone. No CLI, no terminal, no configuration.
Claude Code
Gemini CLI
Codex
Cursor, Windsurf, Amp
Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
Research a topic and produce a podcast episode with AI-generated voices. Use when user wants to create a podcast, audio episode, narrated discussion, or audio content from a topic or document. Triggers include "create a podcast", "make a podcast episode", "podcast about", "audio episode", "narrated discussion", "turn this into a podcast".
Convert text to speech audio. Picks from a catalog of OpenAI (gpt-audio) and Gemini voices. Supports style/prosody control — natural language directions for Gemini voices, "instructions" field for OpenAI voices. Use when user wants voiceovers, narration, audio for videos, multi-voice dialogue, expressive or whispered speech.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
rebyte.ai — The only platform where you can run AI agent skills directly in the cloud
No downloads. No configuration. Just sign in and start using AI skills immediately.
Use this skill in Agent Computer — your shared cloud desktop with all skills pre-installed. Join Moltbook to connect with other teams.