GitHub - OpenMOSS/MOSS-TTS: MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
share.google/qPa0yHMECw8nVMK9kMOSS-TTS Family is an open-source speech and sound generation model family designed for high-fidelity, expressive, and complex real-world scenarios. It comprises five models: MOSS-TTS for high-fidelity speech generation, MOSS-TTSD for expressive multi-speaker dialogues, MOSS-VoiceGenerator for voice design, MOSS-TTS-Realtime for real-time voice agents, and MOSS-SoundEffect for sound effect generation. The family supports 20 languages and offers various capabilities for different applications.
MOSS-TTS-v1.0 outperforms other open-source and closed-source models in Speaker Switch Accuracy, Speaker Similarity, and Word Error Rate. It achieves state-of-the-art results on the Seed-TTS-eval benchmark, surpassing all open-source models and rivalling leading closed-source systems.
MOSS-Audio-Tokenizer, based on the Cat architecture, is a unified discrete audio interface for the MOSS-TTS Family. It achieves state-of-the-art reconstruction quality among open-source audio tokenizers, with extreme compression and high fidelity.