Tu Anh Nguyena,1,2, Benjamin Mullera,1, Bokai Yua,1,
Marta R. Costa-jussab,1, Maha Elbayadb,1, Sravya Popurib,1,
Paul-Ambroise Duquenneb,1, Robin Algayresb,3, Ruslan Mavlyutovb,1, Itai Gatb,1,
Gabriel Synnaevec,1, Juan Pinoc,1, BenoƮt Sagotc,2, Emmanuel Dupouxc,1,3
1Meta AI, 2Inria, Paris, 3EHESS, ENS-PSL, CNRS, Paris
{ntuanh, benjaminmuller, bokai, dpx}@meta.com
a,b,cEqually contributed as co-first, co-second and co-last authors, resp.
We introduce SpiRit-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SpiRit-LM comes in two versions: a Base version that uses speech semantic units and an Expressive version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SpiRit-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
We prompt the model with either Text or Speech, and generate with either the same or different modality.
We give the model with a tag [Text]
or [Speech]
to signal a change of modality.
The (cherry picked) samples illustrate the capacity
of the model to continue a prompt in a semantically coherent
fashion across modalities.
Prompt | Continuation |
---|---|
one two three four five | |
f g h i j k l m n o p q r s t u v w x y z why don't we go back to the beginning and start all over again? Hello and welcome back to the improve your English podcast. This is episode 39 and I'm Jack from Jack Sullivan english.com and I'm Luke from Luke's English.com. And if you want to improve your English, then this is the right podcast for you because you're learning new vocabulary. (...) |
|
The largest country in the world is | |
the northwest corner of Wyoming. It is located in the Greater Yellowstone area and is one of the most popular National Parks in the States. It is visited by over three million people each year. Yellowstone was established as the first national park in the United States on March 1st, 1872 the lake is sometimes referred to as the largest high-elevation lake in the world. It is the largest alkaline lake in the world and the second-largest freshwater lake in the United States after Lake Michigan. |
|
Here's a story about a llama that can speak: |
We prompt the model with an expressive prompt in either
Text or Speech modality. For Speech prompt, we use the same
prompt content with different speaking styles to express the
emotion. The (cherry-picked) examples show that SpiRit-LM-Expressive is able to transfer
the expressive style of the prompt onto the continuation
within or across a change in modality.
Prompt | Continuation |
---|---|
I am absolutely thrilled to be embarking on this new journey, it's going to be an incredible adventure and | |
This is unbelievable! My brand new car, ruined! How can someone be so careless and disrespectful?! Seriously?! | |
I can't believe she's gone. I don't know how to cope without her. The pain of losing her is overwhelming. I feel so lost without her | |
Did you hear that? What is that sound? I'm really scared. It's so dark, and that noise... it sounds so creepy. | |
but I will say I did see this movie a time ago. It has nothing to do with it. Okay. Okay, but I will say that you know what? It's the second-best Tom Cruise. Movie I've seen | |
Yeah, it's a movie where they're running around in the rain and it's very noisy and you have to be able to hear over that. That's not a good strategy. That's why I'm | |