CatalogStacksModulesSaaSMobileLabs → Become a partner
HomeCatalog🎵 Audio & musicXTTS-v2 (Coqui)
Screenshot of XTTS-v2 (Coqui)

// official site: github.com ↗

AUDIO & MUSIC · PRO TIER

XTTS-v2 (Coqui)pro

XTTS-v2 is Coqui AI's multilingual text-to-speech model — 17 languages, voice cloning from 6-second samples, expressive emotional delivery, streaming output. Industry-leading open TTS, the canonical choice for self-hosted speech synthesis projects.

🎵 Audio & music Min 6144 MB RAM Port 5002 (http) Tier pro
// What it is

A closer look.

XTTS-v2 is Coqui AI's multilingual text-to-speech model — 17 languages, voice cloning from 6-second samples, expressive emotional delivery, streaming output. Industry-leading open TTS, the canonical choice for self-hosted speech synthesis projects.

The voice equivalent of "open SDXL" — best-in-class open weights with permissive commercial terms.

// Use cases

What it's for.

Concrete scenarios where teams pick XTTS-v2 (Coqui) over the SaaS alternative.

Multi-lingual TTS

17 languages from one model

Voice cloning

6-second sample → speech in cloned voice

Real-time streaming

chunked audio output, low latency

Cross-lingual generation

English speaker → speak in Spanish/French/Italian

Emotion-aware delivery

natural prosody, not robotic

API server

REST endpoints for programmatic use

// Who it's for

Built for these teams.

If your team profile matches one of these, XTTS-v2 (Coqui) is a strong fit out of the box.

Profile A

Podcast producers

generating multi-language content

Profile B

Game studios

creating character voices

Profile C

Educational platforms

narrating content in multiple languages

Profile D

Marketers

producing demo videos at scale

Profile E

Accessibility teams

auto-narrating articles for screen readers

Profile F

Hosting providers

selling voice synthesis services

// Differentiators

Why teams pick XTTS-v2 (Coqui).

When evaluating self-hosted options for this category, here are the dimensions on which XTTS-v2 (Coqui) consistently lands above the alternatives.

  • MPL-2.0 / CPML license — fully open; commercial OK with attribution
  • 17 languages — broader coverage than F5-TTS, ChatTTS
  • Voice cloning quality — 6-second sample is impressive
  • Streaming server — production-ready API
  • Coqui pedigree — speech-tech veterans (formerly Mozilla DeepSpeech team)
  • Active community — frequent fine-tuned forks for specific languages
// Integrations

Connects to.

The stack you'll plug XTTS-v2 (Coqui) into — services, protocols, and adjacent apps in the BluixApps catalog.

REST API server
/tts_stream endpoint for programmatic use
WebSocket support
for real-time streaming
Speaker library
reference voice samples stored persistently
HuggingFace integration
model versions tracked
Pair with Whisper
speech → text → translate → re-speak in new voice
Pair with LLM
text generation → XTTS narration
// Adoption & deployment

Notable users & community

  • 33k+ GitHub stars (parent Coqui TTS repo)
  • Coqui AI (founded by ex-Mozilla DeepSpeech team)
  • Industry-standard for open TTS
  • Used in commercial products + academic research
  • Active fine-tuning community on HuggingFace

What we ship

  • Docker (ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121)
  • Persistent volumes: models, output, speakers (reference voices)
  • COQUI_TOS_AGREED=1 + MODEL_NAME pre-set
  • Port 5002 (default XTTS) with Swagger docs at /docs
  • Install report at /root/bluixapps/xtts.txt
  • Acceptable Use Policy noted (no impersonation without consent)
  • Sample API calls for voice cloning + text-to-speech in install report
  • GPU pre-flight check via bluixapps_ensure_nvidia_runtime
  • Backup hook covers speakers + outputs
// Tips & operations

Run it properly.

Operational guidance from running this in production — what to lock down, what surprises people.

// PERFORMANCE
Reference voice
6-30 seconds, clean speech, low noise, single speaker
// SECURITY
Languages supported
en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi
// OPERATIONS
VRAM
6 GB GPU optimal, 4 GB CPU fallback works
// RELIABILITY
Streaming mode
chunk size affects latency vs throughput tradeoff
// DEPLOYMENT
Speaker storage
/opt/xtts/speakers/ keeps your reference voices
// SCALING
Production
reverse proxy + auth, rate limiting via gateway
// MAINTENANCE
License caveat
voice cloning has misuse potential — disclose AI-generated audio
6144
// min ram (MB)
8
// min disk (GB)
5002
// access port
http
// protocol
pro
// bluixapps tier

Project resources

Official sitegithub.com ↗