Review DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation ~ Seed-TTS

Автор: Olewave

Загружено: 2025-04-18

Просмотров: 476

Описание:

This work is done by the group of researchers who built Seed-TTS. To see the review of Seed-TTS:
• Review ByteDance/Tiktok's Seed-TTS: A Fami...

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

#DiTAR #diffusionmodels #diffusion #transformers #seedtts #seed-tts #tts #zeroshottts #zeroshoticl #voiceclone #voiceconversion #deepfake #bytedance #tiktok #genai #speechgenai #cmos

Eager to train your own #Whisper or #GPT-4o model but running out of data? We are proud to offer this unique large-scale conversational speech dataset in different languages and topics for #ASR, #TTS, #NLP, and other conversational AI R&D. It has speaker labels and high quality transcriptions. The duration of the dataset depends on the customer's needs and can extend up to 1 million hours. See the description and samples in the following post:
/ olewave-large-scaled-convesational-speech-...
send an email to info@olewave.com for more details.

Review DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation ~ Seed-TTS

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Kowalski: Rząd techniczny to jedyna droga. Trzeba odebrać Polskę Tuskowi | #PolitycznaKawa

Kowalski: Rząd techniczny to jedyna droga. Trzeba odebrać Polskę Tuskowi | #PolitycznaKawa

From Mirror to Mind, From Neurons to MindReading - AI Isn't Magic, It's The Math They Don't Show You

From Mirror to Mind, From Neurons to MindReading - AI Isn't Magic, It's The Math They Don't Show You

Review ByteDance/Tiktok's Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Review ByteDance/Tiktok's Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

[Detailed Paper Reading] Zipformer: A faster and better encoder for automatic speech recognition

[Detailed Paper Reading] Zipformer: A faster and better encoder for automatic speech recognition

Deduct OpenAI GPT-4o's Neural Network Architecture

Deduct OpenAI GPT-4o's Neural Network Architecture

The Man Behind Google's AI Machine | Demis Hassabis Interview

The Man Behind Google's AI Machine | Demis Hassabis Interview

AI Assisted Cracking of LeetCode Add Binary

AI Assisted Cracking of LeetCode Add Binary

[ISAMS 2023] R2_5Korean Medicine Research_2연자 Manseok Kim

[ISAMS 2023] R2_5Korean Medicine Research_2연자 Manseok Kim

Я в опасности

TTS

Review Microsoft's VALL-E 2 (Achieving Human Parity in Zero-shot TTS)

Review Microsoft's VALL-E 2 (Achieving Human Parity in Zero-shot TTS)

This is why I believe that the future already exists

This is why I believe that the future already exists

Microsoft begs for mercy

Microsoft begs for mercy

Google Researcher's In-Depth Analysis on End-to-End Speech Recognition, Part 1: Overview & Modeling

Google Researcher's In-Depth Analysis on End-to-End Speech Recognition, Part 1: Overview & Modeling

From OpenAI's Whisper Model to Your Own In-House ASR Service: Long Audio and Streaming (Part 3)

From OpenAI's Whisper Model to Your Own In-House ASR Service: Long Audio and Streaming (Part 3)

PRZYMUS WIARY. Co groziło w Polsce za nieobecność na niedzielnej mszy świętej?

PRZYMUS WIARY. Co groziło w Polsce za nieobecność na niedzielnej mszy świętej?

The REAL Reason Going To Mars Will NEVER Happen

The REAL Reason Going To Mars Will NEVER Happen

No One Understands What Elon Just Said About 2026

No One Understands What Elon Just Said About 2026

From OpenAI's Whisper Model to Your Own In-House ASR Service: Postprocessing and Language Modeling

From OpenAI's Whisper Model to Your Own In-House ASR Service: Postprocessing and Language Modeling

From OpenAI Whisper to Your In-House ASR Service: Recognizing Name Entities & Domain-Specific Terms

From OpenAI Whisper to Your In-House ASR Service: Recognizing Name Entities & Domain-Specific Terms