Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky,
Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
In conditional generation, the system encodes a stereo waveform prompt into two parallel streams of discrete units (or pseudo-text), which are fed to the Dialogue Language Model (DLM) system, a system attending to both unit streams with the help of cross-attention. The DLM model then generates new pseudo-text and feed them to the decoder to produce a new waveform. The entire pipeline is trained without supervision or text.
Below are samples from our best model,
compared to the ground truth. Note that the synthesized speakers are different from the original ones,
but we deliberately choose the speakers with the same gender as in original speech.
You will hear a ding sound at the end of the prompt duration.
|ID||original speech||synthesized speech|
|Prompt||Ground Truth||dGSLM (Continuation 1)||dGSLM (Continuation 2)|
In unconditional generation, two parallel streams of pseudo-text are sampled from the DLM and synthetized into a waveform.
Generation 1 (dGSLM)
Generation 2 (dGSLM)