HomeCatalog🤖 AI / LLMSpeaches
Screenshot of Speaches website

// screenshot of speaches.ai ↗

AI / LLM · PRO TIER

Speachespro

Speaches is a self-hosted speech-to-text (STT) and text-to-speech (TTS) server with OpenAI-compatible API. Wraps Whisper (STT), Piper / Kokoro (TTS), and exposes them as the standard /v1/audio/transcriptions and /v1/audio/speech OpenAI endpoints.

🤖 AI / LLM Min 2048 MB RAM Port 8000 (http) Tier pro
// What it is

A closer look.

Speaches is a self-hosted speech-to-text (STT) and text-to-speech (TTS) server with OpenAI-compatible API. Wraps Whisper (STT), Piper / Kokoro (TTS), and exposes them as the standard /v1/audio/transcriptions and /v1/audio/speech OpenAI endpoints.

Drop-in replacement for OpenAI Whisper API at $0/transcription — runs on your own VPS or GPU.

// Use cases

What it's for.

Concrete scenarios where teams pick Speaches over the SaaS alternative.

Self-hosted transcription

replace OpenAI Whisper API with predictable VPS cost

Voice assistant TTS

synthesize speech for self-hosted Alexa-style apps

Audio content production

bulk transcribe podcasts, meetings, lectures

Real-time streaming STT

live captions, voice control

Multi-language speech

Whisper handles 100+ languages out of the box

// Who it's for

Built for these teams.

If your team profile matches one of these, Speaches is a strong fit out of the box.

Profile A

Indie SaaS founders

building voice features without OpenAI per-minute costs

Profile B

Podcasters

bulk transcribing back catalogs without metered API spend

Profile C

Privacy-bound apps

needing voice processing without cloud upload

Profile D

Voice assistant developers

building self-hosted Alexa/Google Home alternatives

Profile E

AI engineers

integrating voice into LLM agents (Open WebUI, LibreChat)

// Differentiators

Why teams pick Speaches.

When evaluating self-hosted options for this category, here are the dimensions on which Speaches consistently lands above the alternatives.

  • OpenAI-compatible API — every OpenAI Whisper SDK works pointing at Speaches
  • Multiple model sizes — Whisper tiny / base / small / medium / large
  • CPU + GPU support — runs on modest hardware for testing, scales with GPU
  • MIT license — commercial use unrestricted
  • Streaming support — real-time transcription for live audio
  • Active development — frequent releases tracking upstream Whisper
// Integrations

Connects to.

The stack you'll plug Speaches into — services, protocols, and adjacent apps in the BluixApps catalog.

OpenAI SDKs
Python, JS, every official OpenAI client works
STT engines
Whisper (multiple sizes), Faster-Whisper (optimized)
TTS engines
Piper (fast), Kokoro (quality)
Audio formats
MP3, WAV, M4A, OGG, FLAC input; WAV / MP3 output
VAD
Voice Activity Detection for streaming
Webhook support
async transcription completion callbacks
HTTP REST
primary API surface
// Adoption & deployment

Notable users & community

  • 5k+ GitHub stars (rapidly growing)
  • Featured in self-hosted voice AI guides
  • Active Discord community
  • Strong adoption in privacy-bound voice applications
  • Frequent releases matching OpenAI API evolution

What we ship

  • Docker compose: Speaches server + model cache volume
  • Pinned ghcr.io/speaches-ai/speaches:latest (release-tagged)
  • HTTPS via Let's Encrypt; API key auth via proxy
  • Whisper-base + Kokoro voices pre-downloaded
  • GPU passthrough optional (significantly faster for large models)
  • OpenAI-compatible /v1/audio/transcriptions + /v1/audio/speech endpoints
  • Stateless service — no backup needed beyond config
// Tips & operations

Run it properly.

Operational guidance from running this in production — what to do before you scale, what to lock down, what surprises people.

// PERFORMANCE
GPU strongly recommended for large
Whisper-large on CPU = unusable; medium acceptable on modern CPU
// SECURITY
Pre-download models
first request downloads model; bake into image to avoid stalls
// OPERATIONS
Audio format conversion
Speaches transcodes via ffmpeg; some formats need explicit re-encoding
// RELIABILITY
Mind disk usage
Whisper models: tiny 39MB, base 74MB, small 244MB, medium 769MB, large 1.5GB
// DEPLOYMENT
Streaming has GPU overhead
VAD + chunking add latency on CPU
// SCALING
Auth at proxy layer
Speaches has no built-in auth; protect with API key proxy
2048
// min ram (MB)
10
// min disk (GB)
8000
// access port
http
// protocol
pro
// bluixapps tier
8000:8000 · ghcr.io/speaches-ai/speaches:latest-cpu
// docker image

Project resources

Official sitespeaches.ai ↗
// Alternatives in AI / LLM

Compare with