Trial Limitations
Nari Labs: Advanced Open-Source Text-to-Speech
Experience Nari Labs' revolutionary multi-speaker TTS technology that generates natural, expressive voices with direct multi-character dialogue capabilities
What is Nari Labs?
Team & Background
Nari Labs is a pioneering AI voice technology company founded in late 2024 to early 2025. Operating remotely and primarily based in Korea, the team is dedicated to creating high-quality open-source voice models with minimal resources, making advanced speech technology accessible to researchers and developers worldwide.
Open-Source Mission
With a focus on democratizing voice AI technology, Nari Labs releases their models under permissive licenses (MIT) that allow commercial use and redistribution. Their flagship technologies aim to provide open-source alternatives to commercial solutions, delivering comparable or superior quality with complete freedom for adaptation and integration.
Nari Labs: Next-Gen TTS
A powerful multi-speaker text-to-speech technology that generates natural, expressive voices with direct multi-character dialogue capabilities
Model Architecture
Built on a diffusion-decoder framework with multi-speaker conditional encoder, Nari Labs' technology directly outputs 24kHz, dual-channel audio. With approximately 1.6B parameters, it balances quality and efficiency for real-time generation.
Language Support
Initially optimized for English, Nari Labs' models have been successfully fine-tuned by the community for other languages including Chinese and Japanese, demonstrating their adaptability across linguistic boundaries.
Performance
Capable of real-time or near-real-time inference on consumer-grade GPUs like RTX 4090, Nari Labs delivers professional-quality voice synthesis with emotional range comparable to leading commercial solutions.
Key Features & Capabilities
Nari Labs leverages cutting-edge AI technology to deliver exceptional text-to-speech capabilities with unique multi-character dialogue generation.
Multi-Character Dialogue
Generate complete conversations with multiple distinct voices from a single text input, with automatic character switching and appropriate emotional tones.
Emotional Control
Fine-tune voice generation with explicit emotion tags or prosody embeddings to achieve the perfect tone, from excited and energetic to calm and contemplative.
Speed Control
Adjust speaking pace to match your needs, whether for natural conversation, rapid information delivery, or dramatic emphasis in storytelling.
High-Quality Audio
Produces 24kHz dual-channel audio with HiFi-GAN-V2 vocoder and optional SDPA-diffusion for enhanced quality, delivering crisp, natural-sounding speech.
Open-Source Freedom
Released under MIT license, allowing full commercial use and redistribution of model weights, enabling seamless integration into your applications and services.
Community Ecosystem
Benefit from a growing ecosystem of fine-tuned models, integration guides, and creative applications built around Nari Labs by an active developer community.
Technical Details
Explore the advanced technology behind Nari Labs' exceptional performance
Training Data
- Large-scale public dialogue speech datasets (LibriTTS, LibriLight-TTS)
- Custom YouTube data with automatic alignment
- Diverse speaker profiles for multi-voice capabilities
Model Architecture
- Diffusion-decoder framework with multi-speaker conditional encoder
- Speaker and emotion embeddings explicitly concatenated during training
- HiFi-GAN-V2 vocoder with optional SDPA-diffusion for enhanced quality
Nari Labs Roadmap
Nari Labs' vision for the evolution of their voice AI technology
Model Scaling
Expansion to 4-5B parameters for enhanced quality and capabilities while maintaining efficient inference
Streaming TTS
Real-time streaming capabilities for immediate voice generation as text is input
Multilingual Expansion
Native support for additional languages beyond English, with preserved emotional range and natural prosody
Creative Tools Integration
Enhanced compatibility with video generation tools like Pika and Runway for complete AI-powered storytelling workflows
Frequently Asked Questions
Find answers to commonly asked questions about Nari Labs and their technology
What makes Nari Labs different from other TTS providers?
Nari Labs uniquely combines multi-character dialogue generation, high-quality audio output, and open-source accessibility. Their technology can generate entire conversations with distinct voices from a single text input, a capability typically limited to proprietary commercial solutions.
What hardware requirements are needed to run Nari Labs models?
For real-time or near-real-time inference, a consumer-grade GPU like RTX 4090 is recommended. However, the models can run on less powerful hardware with longer generation times. Memory requirements are moderate due to the efficient 1.6B parameter size.
Can I use Nari Labs technology for commercial projects?
Yes, Nari Labs releases their technology under the MIT license, which allows for commercial use and redistribution of the model weights. You can integrate it into commercial products and services without restrictions, though attribution is appreciated.
Does Nari Labs support languages other than English?
While primarily optimized for English, the community has successfully fine-tuned Nari Labs models for other languages including Chinese and Japanese. Nari Labs plans to expand native multilingual support in future versions.
How can I contribute to the Nari Labs ecosystem?
You can contribute by fine-tuning the models for new languages, developing integration tools, reporting issues, or creating demonstrations. The GitHub repository and Hugging Face Space provide starting points for community engagement.
What's next for Nari Labs?
Nari Labs is working on scaling their models to 4-5B parameters, implementing real-time streaming TTS, expanding multilingual support, and enhancing integration with creative tools for complete AI-powered storytelling workflows.