In press in Transactions of the Association for Computational Linguistics
[Paper]
 
            We introduce generative spoken language modeling, the task of jointly learning the acoustic and linguistic characteristics of a language from raw audio (without text), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) and validate the proposed metrics with human evaluation. Across unsupervised speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependant way, and that some combinations approach text-based topline systems.
* Equal contribution.
In conditional generation, the system encodes a waveform prompt into pseudo-text, which is fed to the language model from which a continuation is sampled. The resulting extended pseudo-text is then fed to the decoder to produce a new waveform. The entire pipeline is trained without supervision or text. Below are samples from our worst and best models, compared to a supervised system trained from text.
| Prompts | Unsupervised (worst) | Unsupervised (best) | Supervised | |
| Encoder | --- | LogMel | HuBERT | Characters | 
|---|---|---|---|---|
| # of units | --- | 100 | 100 | 28 | 
| 0 | ||||
| 1 | 
In unconditional generation, a pseudo-text is sampled from the language model and synthetized into a waveform.
| Unsupervised (worst) | Unsupervised (best) | Supervised | |
| Encoder | LogMel | HuBERT | Characters | 
|---|---|---|---|
| # of units | 100 | 100 | 28 | 
| 0 | |||
| 1 | 
More samples are provided below, as a function of the dataset used to train the voice, the number of units and the generation task. Four different encoder types can be compared (CPC, HuBERT, LogMel and Wav2Vec2) and a supervised character-based system.