Home›Catalog›🎵 Audio & music›XTTS-v2 (Coqui)

AUDIO & MUSIC · PRO TIER

XTTS-v2 (Coqui)pro

XTTS-v2 is Coqui AI's multilingual text-to-speech model — 17 languages, voice cloning from 6-second samples, expressive emotional delivery, streaming output. Industry-leading open TTS, the canonical choice for self-hosted speech synthesis projects.

Install via WHMCS → Visit github.com ↗

🎵 Audio & music Min 6144 MB RAM Port 5002 (http) Tier pro

// What it is

A closer look.

XTTS-v2 is Coqui AI's multilingual text-to-speech model — 17 languages, voice cloning from 6-second samples, expressive emotional delivery, streaming output. Industry-leading open TTS, the canonical choice for self-hosted speech synthesis projects.

The voice equivalent of "open SDXL" — best-in-class open weights with permissive commercial terms.

// Use cases

What it's for.

Concrete scenarios where teams pick XTTS-v2 (Coqui) over the SaaS alternative.

◆

Multi-lingual TTS

17 languages from one model

◈

Voice cloning

6-second sample → speech in cloned voice

◇

Real-time streaming

chunked audio output, low latency

▣

Cross-lingual generation

English speaker → speak in Spanish/French/Italian

▦

Emotion-aware delivery

natural prosody, not robotic

▩

API server

REST endpoints for programmatic use

// Who it's for

Built for these teams.

If your team profile matches one of these, XTTS-v2 (Coqui) is a strong fit out of the box.

Profile A

Podcast producers

generating multi-language content

Profile B

Game studios

creating character voices

Profile C

Educational platforms

narrating content in multiple languages

Profile D

Marketers

producing demo videos at scale

Profile E

Accessibility teams

auto-narrating articles for screen readers

Profile F

Hosting providers

selling voice synthesis services

// Differentiators

Why teams pick XTTS-v2 (Coqui).

When evaluating self-hosted options for this category, here are the dimensions on which XTTS-v2 (Coqui) consistently lands above the alternatives.

✓MPL-2.0 / CPML license — fully open; commercial OK with attribution
✓17 languages — broader coverage than F5-TTS, ChatTTS
✓Voice cloning quality — 6-second sample is impressive
✓Streaming server — production-ready API
✓Coqui pedigree — speech-tech veterans (formerly Mozilla DeepSpeech team)
✓Active community — frequent fine-tuned forks for specific languages

// Integrations

Connects to.

The stack you'll plug XTTS-v2 (Coqui) into — services, protocols, and adjacent apps in the BluixApps catalog.

◇

REST API server

/tts_stream endpoint for programmatic use

◈

WebSocket support

for real-time streaming

◆

Speaker library

reference voice samples stored persistently

▣

HuggingFace integration

model versions tracked

▦

Pair with Whisper

speech → text → translate → re-speak in new voice

▩

Pair with LLM

text generation → XTTS narration

// Adoption & deployment

Notable users & community

33k+ GitHub stars (parent Coqui TTS repo)
Coqui AI (founded by ex-Mozilla DeepSpeech team)
Industry-standard for open TTS
Used in commercial products + academic research
Active fine-tuning community on HuggingFace

What we ship

Docker (ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121)
Persistent volumes: models, output, speakers (reference voices)
COQUI_TOS_AGREED=1 + MODEL_NAME pre-set
Port 5002 (default XTTS) with Swagger docs at /docs
Install report at /root/bluixapps/xtts.txt
Acceptable Use Policy noted (no impersonation without consent)
Sample API calls for voice cloning + text-to-speech in install report
GPU pre-flight check via bluixapps_ensure_nvidia_runtime
Backup hook covers speakers + outputs

// Tips & operations

Run it properly.

Operational guidance from running this in production — what to lock down, what surprises people.

// PERFORMANCE

Reference voice

6-30 seconds, clean speech, low noise, single speaker

// SECURITY

Languages supported

en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi

// OPERATIONS

VRAM

6 GB GPU optimal, 4 GB CPU fallback works

// RELIABILITY

Streaming mode

chunk size affects latency vs throughput tradeoff

// DEPLOYMENT

Speaker storage

/opt/xtts/speakers/ keeps your reference voices

// SCALING

Production

reverse proxy + auth, rate limiting via gateway

// MAINTENANCE

License caveat

voice cloning has misuse potential — disclose AI-generated audio

6144

// min ram (MB)

// min disk (GB)

5002

// access port

http

// protocol

pro

// bluixapps tier

// Alternatives in Audio & music

Compare with

Project resources

Official sitegithub.com ↗