Cartesia Sonic

Cartesia Sonic 3 is a streaming text-to-speech (TTS) model that converts written text into ultra-realistic, expressive speech with exceptionally low latency, typically starting audio in about 90 milliseconds. It acts as the "vocal cords" for AI agents to enable fluid, natural voice conversations with human-like emotion, tone, and even laughter.

Use tool

Pricing

Free

$/mo

Get introduced to ultra-low latency voice AI through core models and your own voice agent
20K credits for models
1 prepaid for agents
Personal use
Discord support

Pro

$/mo

Upgrade for instant voice cloning and to try voice AI in production for commercial use
100K credits for models
5 prepaid for agents
Instant voice cloning
Commercial Use

Startup

$/mo

For teams starting to use voice AI in production and need shared API keys, pro voice cloning, and multiple agents
1.25M credits for models
49 prepaid for agents
Pro voice cloning
Organizations

Scale

$/mo

For businesses with large-scale use cases requiring high concurrencies and multiple agents
8M credits for models
299 prepaid for agents
Priority support
High concurrency limits

Enterprise

$/mo

Custom supported models and agents with mission-critical guarantees for uptime, security, and compliance
Custom usage pricing
Custom concurrency
Priority and Enterprise support via Slack
Enterprise-grade security & compliance
Single Sign-On (SSO)
PCI compliance
Custom SLAs
Custom Security Review
HIPAA compliance

Details

Pricing Tier

Freemium

Sponsor

Ad space