overview
What is Microsoft MAI-Voice-2?
Microsoft MAI-Voice-2 is a text-to-speech (TTS) model developed by Microsoft AI that enables individuals and organizations to generate highly expressive, natural-sounding, and high-fidelity speech. It supports multilingual voice cloning across 15 languages with minimal audio input. This model represents an advancement in speech synthesis, offering enhanced fidelity, broader language coverage, consistent speaker identity, and a wider emotional range compared to previous iterations. Its core functionality includes natural and expressive speech synthesis, multilingual support, voice prompting (cloning), granular emotion control, and long-form speech generation. Launched around June 2, 2026, MAI-Voice-2 is part of Microsoft AI's multimodal MAI family, which also includes models for reasoning (MAI-Thinking-1), image generation (MAI-Image-2.5), and speech-to-text (MAI-Transcribe-1.5). Microsoft emphasizes its commitment to responsible AI development, aligning its internal policies and product development with regulatory frameworks such as the EU AI Act.