Natural Language Guidance of High-Fidelity Text-to-speech with Synthetic Annotations
CoRR(2024)
Abstract
Text-to-speech models trained on large-scale datasets have demonstratedimpressive in-context learning capabilities and naturalness. However, controlof speaker identity and style in these models typically requires conditioningon reference speech recordings, limiting creative applications. Alternatively,natural language prompting of speaker identity and style has demonstratedpromising results and provides an intuitive method of control. However,reliance on human-labeled descriptions prevents scaling to large datasets. Ourwork bridges the gap between these two approaches. We propose a scalable methodfor labeling various aspects of speaker identity, style, and recordingconditions. We then apply this method to a 45k hour dataset, which we use totrain a speech language model. Furthermore, we propose simple methods forincreasing audio fidelity, significantly outperforming recent work despiterelying entirely on found data. Our results demonstrate high-fidelity speechgeneration in a diverse range of accents, prosodic styles, channel conditions,and acoustic conditions, all accomplished with a single model and intuitivenatural language conditioning. Audio samples can be heard athttps://text-description-to-speech.com/.
MoreTranslated text
Key words
Language Modeling,Natural Language Generation,Spoken Dialogue Systems,Part-of-Speech Tagging,Syntax-based Translation Models
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined