Expresso

A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

[Paper]   [Dataset]   [Code]

Tu Anh Nguyen*, Wei-Ning Hsu*, Antony D'Avirro*, Bowen Shi*,
Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet,
Gabriel Synnaeve, Michael Hassid, Felix Kreuk,
Yossi Adi+, Emmanuel Dupoux+

Meta AI Research

We introduce Expresso, a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided. The task of the Expresso Benchmark is to resynthesize the input audio using a low-bitrate discrete code that has been obtained without supervision from text.

Here, we provide illustrative samples of the Expresso dataset and of resynthesis results from baseline systems using discrete HuBERT and Encodec units.

1. The Expresso Dataset


Data Styles and Statistics


The Expresso dataset contains 8 read styles (including narration) and 26 improvised styles, totaling 11 hours of read speech and 30 hours of improvised speech.

Styleangryanimalanimal_directedaweboredcalmchildchild_directedconfuseddefaultdesiredisgustedenunciatedfastfearfulhappylaughingnarrationnon_verbalprojectedsadsarcasticsingingsleepysympatheticwhisperTotal
Read (min)--------94133--116--749421--81----7911.5h
Improvised (min)82273292929328386615892118629898921037632941011064931008634.4h
Total (h)1.40.40.51.51.51.60.40.62.74.91.52.03.01.61.62.83.31.60.51.63.01.80.11.51.72.845.9h

The audio was recorded in a professional recording studio with minimal background noise at 48kHz/24bit. The files for read speech and singing are in a mono wav format; and for the dialog section in stereo (one channel per actor), where the original flow of turn-taking is preserved.

We also provide the transcriptions for the read speech section in the dataset.


Samples from Expresso Dataset


We present here samples from the Expresso dataset, which have been selected to represent the diversity of expression of this dataset.

Read Short Sentences

This section comprises short sentences read by speakers in 7 different styles: default, confused, enunciated, happy, laughing, sad and whisper.

Style ex01 ex02 ex03 ex04
default
confused
enunciated
happy
laughing
sad
whisper

Read Long-form

In this section, the speakers were asked to expressively read long-form narrative pieces (in narration style) or one news article (in default style). We limit the samples here to two speakers and 20 seconds.

Style ex01 ex04
narration

Read Emphasis

We have a default emphasis substyle in the dataset where the speakers are asked to emphasize different words in the sentence. We enclose certain words/spans in asterisks to denote emphasis.

Text Audio
Was this element *always* there?
Was *this* element always *there?*
Maybe *you* should change your priorities.
Maybe you should change your *priorities*.

Singing

We also provide a tiny (4 minutes) set of singing style, where the speakers are asked to sing songs acapella. We limit the samples here to 20 seconds, and the dataset is restricted to songs that are copyright free.

Style ex02 ex03
singing

Improvised Conversations

In this section, the speakers are prompted to improvise a dialogue with another speaker in an imaginary situation (eg, two drivers arguing about a car accident). To illustrate the diversity of the dataset, we selected two representative conversations for 9 of the available styles. The files are in stereo, with one speaker per channel. We limit the samples here to 30 seconds.

Style Sample 1 Sample 2
angry
animal-animaldir
child-childdir
disgusted
laughing
projected
sad-sympathetic
sarcastic
nonverbal

Data Distribution and License


The Expresso dataset can be downloaded from the following repository.

LICENSE. The Expresso dataset is distributed under the CC BY-NC 4.0 license.


2. Resynthesis Experiments


Here, we present resynthesis results for two classes of models: Encodec-based and HuBERT-based. Encodec-based models have been directly trained with a compression objective to encode and decode speech using a single codebook (Encodec 1) or 8 codebooks (Encodec 8). HuBERT-based models use a pretrained HuBERT encoder: V1, our best model, is trained on a mixture of read and spontaneous corpora; the base model has been trained on Librispeech. The HuBERT embeddings (layer 12 for V1, layer 9 for base) are then followed by kmeans clustering (k=2000 or 500) on either Expresso (E) or the same training set at the HuBERT model (V1 or LS, resp.). Finally, a HifiGAN vocoder trained on Expresso, VCTK and LJ speech converts the discrete units back to speech. The vocoder is either conditioned on speaker (S) or on both speaker and expressoin (S+E). The best model is on the left (V1/2000_E/S+E).

We present representative samples from the most difficult expressions in the dev set of the Expresso dataset, along with the resynthesisized outputs of our encoder/decoder models, in order to illustrate the challenges of the Expresso dataset.

We divide the samples into Read and Conversation sections; the singing example illustrate the fact that HuBERT units do not encode pitch, and many extreme pitch excursions (as in child or animal), result in voiceless renderings. In contrast, Encodec units encode well all expressions.

Same Speaker Resynthesis

Here, the decoder is conditionned with the same speaker and style as the input speech.

Style Original Encodec 8 Encodec 1 V1/2000_E/S+E V1/2000_E/S V1/2000_V1/S+E V1/2000_V1/S base/2000_E/S+E base/2000_E/S base/500_LS/S+E base/500_LS/S
default
default
confused
confused
enunciated
enunciated
happy
happy
laughing
laughing
sad
sad
whisper
whisper
singing
singing
Style Original Encodec 8 Encodec 1 V1/2000_E/S+E V1/2000_E/S V1/2000_V1/S+E V1/2000_V1/S base/2000_E/S+E base/2000_E/S base/500_LS/S+E base/500_LS/S
sad
sad
angry
angry
animal
animal
fearful
fearful
child
child
fast
fast
laughing
laughing
disgusted
disgusted
projected
projected
nonverbal
nonverbal

Out-of-domain Speaker Resynthesis

In this section, the vocoder is asked to reconstruct an expressive sample from Expresso, but using a speaker voice that is outside of the expresso dataset (LJ voice), and was therefore never recorded with this particular expression.

This tests whether the Expresso styles can be generalized outside of the small set of 4 speakers to standard TTS voices. Note that the LJ voice can be made to laugh and whisper and to a certain extent change it's tone of voice according to the required expressive style.

Whereas HuBERT units are speaker independant and can be used to resynthetize a message in a target voice that is different from the input voice, Encodec units represent their inputs hollistically with all their acoustic characteristics including voice and therefore cannot perform voice transfer. This is why we only present results with Hubert models.

Style Original V1/2000_E/S+E V1/2000_E/S V1/2000_V1/S+E V1/2000_V1/S base/2000_E/S+E base/2000_E/S base/500_LS/S+E base/500_LS/S
default
default
confused
confused
enunciated
enunciated
happy
happy
laughing
laughing
sad
sad
whisper
whisper
singing
singing
Style Original V1/2000_E/S+E V1/2000_E/S V1/2000_V1/S+E V1/2000_V1/S base/2000_E/S+E base/2000_E/S base/500_LS/S+E base/500_LS/S
sad
sad
angry
angry
animal
animal
fearful
fearful
child
child
fast
fast
laughing
laughing
disgusted
disgusted
projected
projected
nonverbal
nonverbal