Tu Anh Nguyen*, Wei-Ning Hsu*,
Antony D'Avirro*, Bowen Shi*,
Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet,
Gabriel Synnaeve, Michael Hassid, Felix Kreuk,
Yossi Adi+, Emmanuel Dupoux+
Meta AI Research
We introduce Expresso, a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided. The task of the Expresso Benchmark is to resynthesize the input audio using a low-bitrate discrete code that has been obtained without supervision from text.
Here, we provide illustrative samples of the Expresso dataset and of resynthesis results from baseline systems using discrete HuBERT and Encodec units.
The Expresso dataset contains 8 read styles (including narration) and 26 improvised styles, totaling 11 hours of read speech and 30 hours of improvised speech.
Style | angry | animal | animal_directed | awe | bored | calm | child | child_directed | confused | default | desire | disgusted | enunciated | fast | fearful | happy | laughing | narration | non_verbal | projected | sad | sarcastic | singing | sleepy | sympathetic | whisper | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Read (min) | - | - | - | - | - | - | - | - | 94 | 133 | - | - | 116 | - | - | 74 | 94 | 21 | - | - | 81 | - | - | - | - | 79 | 11.5h |
Improvised (min) | 82 | 27 | 32 | 92 | 92 | 93 | 28 | 38 | 66 | 158 | 92 | 118 | 62 | 98 | 98 | 92 | 103 | 76 | 32 | 94 | 101 | 106 | 4 | 93 | 100 | 86 | 34.4h |
Total (h) | 1.4 | 0.4 | 0.5 | 1.5 | 1.5 | 1.6 | 0.4 | 0.6 | 2.7 | 4.9 | 1.5 | 2.0 | 3.0 | 1.6 | 1.6 | 2.8 | 3.3 | 1.6 | 0.5 | 1.6 | 3.0 | 1.8 | 0.1 | 1.5 | 1.7 | 2.8 | 45.9h |
The audio was recorded in a professional recording studio with minimal background noise at 48kHz/24bit. The files for read speech and singing are in a mono wav format; and for the dialog section in stereo (one channel per actor), where the original flow of turn-taking is preserved.
We also provide the transcriptions for the read speech section in the dataset.
We present here samples from the Expresso dataset, which have been selected to represent the diversity of expression of this dataset.
This section comprises short sentences read by speakers in 7 different styles: default, confused, enunciated, happy, laughing, sad and whisper.
Style | ex01 | ex02 | ex03 | ex04 |
---|---|---|---|---|
default | ||||
confused | ||||
enunciated | ||||
happy | ||||
laughing | ||||
sad | ||||
whisper |
In this section, the speakers were asked to expressively read long-form narrative pieces (in narration style) or one news article (in default style). We limit the samples here to two speakers and 20 seconds.
Style | ex01 | ex04 |
---|---|---|
narration |
We have a default emphasis substyle in the dataset where the speakers are asked to emphasize different words in the sentence. We enclose certain words/spans in asterisks to denote emphasis.
Text | Audio |
---|---|
Was this element *always* there? | |
Was *this* element always *there?* | |
Maybe *you* should change your priorities. | |
Maybe you should change your *priorities*. |
We also provide a tiny (4 minutes) set of singing style, where the speakers are asked to sing songs acapella. We limit the samples here to 20 seconds, and the dataset is restricted to songs that are copyright free.
Style | ex02 | ex03 |
---|---|---|
singing |
In this section, the speakers are prompted to improvise a dialogue with another speaker in an imaginary situation (eg, two drivers arguing about a car accident). To illustrate the diversity of the dataset, we selected two representative conversations for 9 of the available styles. The files are in stereo, with one speaker per channel. We limit the samples here to 30 seconds.
Style | Sample 1 | Sample 2 |
---|---|---|
angry | ||
animal-animaldir | ||
child-childdir | ||
disgusted | ||
laughing | ||
projected | ||
sad-sympathetic | ||
sarcastic | ||
nonverbal |
The Expresso dataset can be downloaded from the following repository.
LICENSE. The Expresso dataset is distributed under the CC BY-NC 4.0 license.
Here, we present resynthesis results for two classes of models: Encodec-based and HuBERT-based. Encodec-based models have been directly trained with a compression objective to encode and decode speech using a single codebook (Encodec 1) or 8 codebooks (Encodec 8). HuBERT-based models use a pretrained HuBERT encoder: V1, our best model, is trained on a mixture of read and spontaneous corpora; the base model has been trained on Librispeech. The HuBERT embeddings (layer 12 for V1, layer 9 for base) are then followed by kmeans clustering (k=2000 or 500) on either Expresso (E) or the same training set at the HuBERT model (V1 or LS, resp.). Finally, a HifiGAN vocoder trained on Expresso, VCTK and LJ speech converts the discrete units back to speech. The vocoder is either conditioned on speaker (S) or on both speaker and expressoin (S+E). The best model is on the left (V1/2000_E/S+E).
We present representative samples from the most difficult expressions in the dev set of the Expresso dataset, along with the resynthesisized outputs of our encoder/decoder models, in order to illustrate the challenges of the Expresso dataset.
We divide the samples into Read and Conversation sections; the singing example illustrate the fact that HuBERT units do not encode pitch, and many extreme pitch excursions (as in child or animal), result in voiceless renderings. In contrast, Encodec units encode well all expressions.
Here, the decoder is conditionned with the same speaker and style as the input speech.
Style | Original | Encodec 8 | Encodec 1 | V1/2000_E/S+E | V1/2000_E/S | V1/2000_V1/S+E | V1/2000_V1/S | base/2000_E/S+E | base/2000_E/S | base/500_LS/S+E | base/500_LS/S |
---|---|---|---|---|---|---|---|---|---|---|---|
default | |||||||||||
default | |||||||||||
confused | |||||||||||
confused | |||||||||||
enunciated | |||||||||||
enunciated | |||||||||||
happy | |||||||||||
happy | |||||||||||
laughing | |||||||||||
laughing | |||||||||||
sad | |||||||||||
sad | |||||||||||
whisper | |||||||||||
whisper | |||||||||||
singing | |||||||||||
singing |
Style | Original | Encodec 8 | Encodec 1 | V1/2000_E/S+E | V1/2000_E/S | V1/2000_V1/S+E | V1/2000_V1/S | base/2000_E/S+E | base/2000_E/S | base/500_LS/S+E | base/500_LS/S |
---|---|---|---|---|---|---|---|---|---|---|---|
sad | |||||||||||
sad | |||||||||||
angry | |||||||||||
angry | |||||||||||
animal | |||||||||||
animal | |||||||||||
fearful | |||||||||||
fearful | |||||||||||
child | |||||||||||
child | |||||||||||
fast | |||||||||||
fast | |||||||||||
laughing | |||||||||||
laughing | |||||||||||
disgusted | |||||||||||
disgusted | |||||||||||
projected | |||||||||||
projected | |||||||||||
nonverbal | |||||||||||
nonverbal |
In this section, the vocoder is asked to reconstruct an expressive sample from Expresso, but using a speaker voice that is outside of the expresso dataset (LJ voice), and was therefore never recorded with this particular expression.
This tests whether the Expresso styles can be generalized outside of the small set of 4 speakers to standard TTS voices. Note that the LJ voice can be made to laugh and whisper and to a certain extent change it's tone of voice according to the required expressive style.
Whereas HuBERT units are speaker independant and can be used to resynthetize a message in a target voice that is different from the input voice, Encodec units represent their inputs hollistically with all their acoustic characteristics including voice and therefore cannot perform voice transfer. This is why we only present results with Hubert models.
Style | Original | V1/2000_E/S+E | V1/2000_E/S | V1/2000_V1/S+E | V1/2000_V1/S | base/2000_E/S+E | base/2000_E/S | base/500_LS/S+E | base/500_LS/S |
---|---|---|---|---|---|---|---|---|---|
default | |||||||||
default | |||||||||
confused | |||||||||
confused | |||||||||
enunciated | |||||||||
enunciated | |||||||||
happy | |||||||||
happy | |||||||||
laughing | |||||||||
laughing | |||||||||
sad | |||||||||
sad | |||||||||
whisper | |||||||||
whisper | |||||||||
singing | |||||||||
singing |
Style | Original | V1/2000_E/S+E | V1/2000_E/S | V1/2000_V1/S+E | V1/2000_V1/S | base/2000_E/S+E | base/2000_E/S | base/500_LS/S+E | base/500_LS/S |
---|---|---|---|---|---|---|---|---|---|
sad | |||||||||
sad | |||||||||
angry | |||||||||
angry | |||||||||
animal | |||||||||
animal | |||||||||
fearful | |||||||||
fearful | |||||||||
child | |||||||||
child | |||||||||
fast | |||||||||
fast | |||||||||
laughing | |||||||||
laughing | |||||||||
disgusted | |||||||||
disgusted | |||||||||
projected | |||||||||
projected | |||||||||
nonverbal | |||||||||
nonverbal |