Natural Language Guidance of High-Fidelity Text-to-speech with Synthetic Annotations

CoRR（2024）

Cited 1|Views22

Abstract

Text-to-speech models trained on large-scale datasets have demonstratedimpressive in-context learning capabilities and naturalness. However, controlof speaker identity and style in these models typically requires conditioningon reference speech recordings, limiting creative applications. Alternatively,natural language prompting of speaker identity and style has demonstratedpromising results and provides an intuitive method of control. However,reliance on human-labeled descriptions prevents scaling to large datasets. Ourwork bridges the gap between these two approaches. We propose a scalable methodfor labeling various aspects of speaker identity, style, and recordingconditions. We then apply this method to a 45k hour dataset, which we use totrain a speech language model. Furthermore, we propose simple methods forincreasing audio fidelity, significantly outperforming recent work despiterelying entirely on found data. Our results demonstrate high-fidelity speechgeneration in a diverse range of accents, prosodic styles, channel conditions,and acoustic conditions, all accomplished with a single model and intuitivenatural language conditioning. Audio samples can be heard athttps://text-description-to-speech.com/.

Translated text

Key words

Language Modeling,Natural Language Generation,Spoken Dialogue Systems,Part-of-Speech Tagging,Syntax-based Translation Models

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined