Generative Spoken Language Modeling

Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia*, Evgeny Kharitonov*, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte,
Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

In press in Transactions of the Association for Computational Linguistics

[Paper]

We introduce generative spoken language modeling, the task of jointly learning the acoustic and linguistic characteristics of a language from raw audio (without text), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) and validate the proposed metrics with human evaluation. Across unsupervised speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependant way, and that some combinations approach text-based topline systems.

* Equal contribution.

Conditional Examples

In conditional generation, the system encodes a waveform prompt into pseudo-text, which is fed to the language model from which a continuation is sampled. The resulting extended pseudo-text is then fed to the decoder to produce a new waveform. The entire pipeline is trained without supervision or text. Below are samples from our worst and best models, compared to a supervised system trained from text.

	Prompts	Unsupervised (worst)	Unsupervised (best)	Supervised
Encoder	---	LogMel	HuBERT	Characters
# of units	---	100	100	28
0
1

Unconditional Examples

In unconditional generation, a pseudo-text is sampled from the language model and synthetized into a waveform.

	Unsupervised (worst)	Unsupervised (best)	Supervised
Encoder	LogMel	HuBERT	Characters
# of units	100	100	28
0
1

Samples

More samples are provided below, as a function of the dataset used to train the voice, the number of units and the generation task. Four different encoder types can be compared (CPC, HuBERT, LogMel and Wav2Vec2) and a supervised character-based system.

Choose a dataset:

Choose number of units:

Choose a generation task:

Loading......

Sample paged based on HiFi-GAN page.