MOSS-TTS Family is an open-source speech and sound generation model family designed for high-fidelity, expressive, and complex real-world scenarios. It comprises five models: MOSS-TTS for high-fidelity speech generation, MOSS-TTSD for expressive multi-speaker dialogues, MOSS-VoiceGenerator for voice design, MOSS-TTS-Realtime for real-time voice agents, and MOSS-SoundEffect for sound effect generation. The family supports 20 languages and offers various capabilities for different applications.
MOSS-TTS-v1.0 outperforms other open-source and closed-source models in Speaker Switch Accuracy, Speaker Similarity, and Word Error Rate. It achieves state-of-the-art results on the Seed-TTS-eval benchmark, surpassing all open-source models and rivalling leading closed-source systems.
MOSS-Audio-Tokenizer, based on the Cat architecture, is a unified discrete audio interface for the MOSS-TTS Family. It achieves state-of-the-art reconstruction quality among open-source audio tokenizers, with extreme compression and high fidelity.