Convert text to speech audio. Picks from a catalog of OpenAI (gpt-audio) and Gemini voices. Supports style/prosody control — natural language directions for Gemini voices, "instructions" field for OpenAI voices. Use when user wants voiceovers, narration, audio for videos, multi-voice dialogue, expressive or whispered speech.
Published by rebyteai
Runs in the cloud
No local installation
Dependencies pre-installed
Ready to run instantly
Secure VM environment
Isolated per task
Works on any device
Desktop, tablet, or phone
Convert text to speech through the rebyte TTS endpoint. One endpoint, two providers, one voice catalog.
Requires Rebyte API auth — $AUTH_TOKEN and $API_URL are set up per the agent's system prompt.
Every voice has an explicit provider prefix. Pick the voice that matches the tone you want, then compose style per its column below.
instructions field)Uses gpt-audio-mini by default (fast, cheap). Pass model: "gpt-audio" for highest quality. Style is controlled via the instructions field — a natural language directive for delivery.
| Voice | Character | Good for | Example instructions |
|---|---|---|---|
openai:marin |
Female, warm | Podcasts, narration, friendly explainer | Speak conversationally, warm and relaxed. |
openai:cedar |
Male, authoritative | Documentary, serious explainer | Speak slowly with gravitas, like a film trailer. |
openai:ash |
Male, energetic | Promo, ad, hype | Upbeat and energetic, slightly rushed with excitement. |
openai:coral |
Female, professional | Corporate, product demo | Clear and professional, measured pace. |
openai:ballad |
Male, expressive | Storytelling, audiobook | Tell this like a campfire story, taking your time. |
openai:sage |
Female, measured | Meditation, ASMR, calm explainer | Soft and slow, gentle on every consonant. |
openai:verse |
Neutral, versatile | General purpose | Neutral delivery, no strong emotion. |
Additional voices: openai:nova, openai:alloy, openai:echo, openai:fable, openai:onyx, openai:shimmer.
Backend: gemini-3.1-flash-tts-preview. Style is controlled by writing natural-language directions as part of the text itself. The model interprets the directions and speaks accordingly. Think of it like directing an actor — describe how to deliver the line, then give the line.
| Voice | Character | Good for |
|---|---|---|
gemini:Kore |
Female, warm, firm | Podcasts, narration, interviews |
gemini:Puck |
Male, upbeat, playful | Casual explainer, comedy, chat |
gemini:Charon |
Male, deep, informative | Documentary, news, serious voiceover |
gemini:Fenrir |
Male, excitable | High-energy promo, sports, hype |
gemini:Aoede |
Female, breezy, light | Whispered, intimate, confessional |
gemini:Leda |
Female, youthful | Bright explainer, younger audience |
Additional Gemini voices: gemini:Orus, gemini:Zephyr, gemini:Callirrhoe, gemini:Autonoe, gemini:Enceladus, gemini:Iapetus, gemini:Umbriel, gemini:Algieba, gemini:Despina, gemini:Algenib, gemini:Rasalgethi, gemini:Achernar, gemini:Schedar, gemini:Gacrux, gemini:Sulafat.
Style direction patterns (natural language — these are examples, not an enum):
Say in a whisper: We have to be quiet here.Say excitedly and fast: You are not going to believe what just happened!Say sadly, slowly: I don't think this is going to work out.Deadpan delivery: Yes. That is how physics works.Warm and smiling: Welcome back — great to have you.In a British accent: A proper cup of tea, if you please.The meeting starts at nine. [whispers] But between you and me, it will run late.Important: Bracket tags like [whispers] work for short text but fail silently with some voice+length combos (the API returns an error). Natural language directions like Say cheerfully: are more reliable across all voices.
curl -X POST "$API_URL/api/data/tts/synthesize" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a sample voiceover.",
"voice": "openai:marin",
"instructions": "Speak conversationally, warm and relaxed.",
"format": "mp3"
}' | jq -r '.audio.base64' | base64 -d > voiceover.mp3
Gemini version (style direction is part of the text):
curl -X POST "$API_URL/api/data/tts/synthesize" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Say in a soft whisper: I have a secret to tell you.",
"voice": "gemini:Aoede"
}' | jq -r '.audio.base64' | base64 -d > whisper.wav
Gemini can synthesize a two-speaker conversation in a single call. Use this for podcasts, interviews, sketches, or any back-and-forth where the voices need to flow naturally together (Gemini blends handoffs far better than stitching separate clips).
Separate endpoint: POST /api/data/tts/synthesize_dialogue.
Hard limits:
gemini:<name> for every speaker). OpenAI has no native multi-speaker — use synthesize per-line and concat if you must.Request:
curl -X POST "$API_URL/api/data/tts/synthesize_dialogue" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dialogue": [
{"speaker": "Alice", "text": "So what do you think about the new TTS system?"},
{"speaker": "Bob", "text": "I was skeptical, but the inline tags actually work."},
{"speaker": "Alice", "text": "[laughing] That must have been fun to debug."},
{"speaker": "Bob", "text": "[deadpan] A thrill a minute."}
],
"voices": {
"Alice": "gemini:Kore",
"Bob": "gemini:Charon"
}
}' | jq -r '.audio.base64' | base64 -d > dialogue.wav
Parameters:
| Parameter | Type | Required | Notes |
|---|---|---|---|
dialogue |
array of {speaker, text} |
yes | In order. Each text supports inline [tags]. |
voices |
object map | yes | Speaker name → gemini:<voice>. Exactly 2 entries. |
format |
string | no | Always returns wav regardless. |
Response:
{
"success": true,
"audio": { "base64": "...", "format": "wav", "mimeType": "audio/wav", "sizeBytes": 754604 },
"input": {
"provider": "gemini",
"mode": "dialogue",
"speakers": { "Alice": "Kore", "Bob": "Charon" },
"lineCount": 4,
"characterCount": 212,
"durationSeconds": 15.72
}
}
Tips:
ffmpeg -i "concat:a.wav|b.wav" -c copy final.wav.| Parameter | Type | Required | Default | Notes |
|---|---|---|---|---|
text |
string | yes | — | Max 4096 chars. For Gemini voices, include style directions as part of the text (e.g. "Say cheerfully: ..."). |
voice |
string | no | openai:nova |
Use explicit openai: or gemini: prefix. |
instructions |
string | no | — | Style directive for OpenAI voices (sent as system message). For Gemini, write style directions as part of the text instead. |
model |
string | no | gpt-audio-mini |
OpenAI only: gpt-audio (best quality) | gpt-audio-mini (faster, cheaper). Ignored for Gemini. |
format |
string | no | mp3 |
OpenAI: mp3 | wav | opus | aac | flac | pcm. Gemini: always returns wav regardless. |
{
"success": true,
"audio": { "base64": "...", "format": "mp3", "mimeType": "audio/mpeg", "sizeBytes": 24576 },
"input": {
"provider": "openai" | "gemini",
"voice": "...",
"model": "...",
"characterCount": 35,
"wordCount": 7
}
}
Decode audio.base64 with base64 -d and save to disk.
| Need | Pick |
|---|---|
| Just read this text, no style | OpenAI gpt-audio-mini with any voice (openai:nova is a safe default) |
| Specific tone, one consistent delivery | OpenAI gpt-audio-mini + instructions — easier to version-control the prompt |
| Style shifts mid-sentence (whispered → excited → calm) | Gemini voice + natural language directions in text |
| Accent switching within one clip | Gemini — describe the accent in the text |
| Multi-speaker dialogue as a single clip | Gemini via synthesize_dialogue (native two-speaker blending; see below) |
| Lowest cost per character | OpenAI gpt-audio-mini |
| Best prosody on a single even read | Either works; test both for your use case |
Split at sentence boundaries, synthesize each chunk, concatenate with ffmpeg:
ffmpeg -i "concat:chunk1.mp3|chunk2.mp3|chunk3.mp3" -c copy final.mp3
Keep the same voice + style across chunks or the cuts will be audible.
# Replace video audio with generated voiceover
ffmpeg -i video.mp4 -i voiceover.mp3 -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 output.mp4
# Mix voiceover over existing audio at 80% voice volume
ffmpeg -i video.mp4 -i voiceover.mp3 \
-filter_complex "[1:a]volume=0.8[voice];[0:a][voice]amix=inputs=2:duration=first" \
-c:v copy output.mp4
Gemini voices return WAV at 24kHz. Convert to MP3 before mixing into video:
ffmpeg -i whisper.wav -c:a libmp3lame -q:a 2 whisper.mp3
Upload finished audio to the Artifact Store so the user can access it.
openai:nova by habit. The voice does more for tone than any instructions string.instructions short — two sentences, concrete. "Speak slowly and somberly" beats a paragraph.Say cheerfully:, In a hushed whisper:, With mock seriousness:. If it would make sense on a film set, it probably works here. If Gemini rejects a style+voice combo, try a different voice or use OpenAI with instructions.Everyone else asks you to install skills locally. On Rebyte, just click Run. Works from any device — even your phone. No CLI, no terminal, no configuration.
Claude Code
Gemini CLI
Codex
Cursor, Windsurf, Amp
Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
Research a topic and produce a podcast episode with AI-generated voices. Use when user wants to create a podcast, audio episode, narrated discussion, or audio content from a topic or document. Triggers include "create a podcast", "make a podcast episode", "podcast about", "audio episode", "narrated discussion", "turn this into a podcast".
Conduct enterprise-grade research with multi-source synthesis, citation tracking, and verification. Use when user needs comprehensive analysis requiring 10+ sources, verified claims, or comparison of approaches. Triggers include "deep research", "comprehensive analysis", "research report", "compare X vs Y", or "analyze trends". Do NOT use for simple lookups, debugging, or questions answerable with 1-2 searches.
Generate images from text prompts or edit existing images using Google Nano Banana 2 (Gemini 3.1 Flash image generation) via Rebyte data API. Supports multi-size output (512px–4K), improved text rendering, and multi-image input. Use for text-to-image generation or image-to-image editing/enhancement. Triggers include "generate image", "create image", "make a picture", "draw", "illustrate", "image of", "picture of", "edit image", "modify image", "enhance image", "style transfer", "nano banana".
rebyte.ai — The only platform where you can run AI agent skills directly in the cloud
No downloads. No configuration. Just sign in and start using AI skills immediately.
Use this skill in Agent Computer — your shared cloud desktop with all skills pre-installed. Join Moltbook to connect with other teams.