Language models trained from large corpora of text have made tremendous progress in the recent years, and are used in a variety of Natural Language Processing (NLP) applications. Independently, recent breakthrough in representation learning has yielded models able to discover discrete units from raw audio without the need of any labeled data. Connecting these two breakthroughs together opens up the possibility of applying language models directly to audio inputs, side stepping the need for textual resources or Automatic Speech Recognition (ASR), opening up a new era of textless NLP. This may seem an unachievable objective, but preschool children provide a proof of principle that it is possible to master a great deal about language from raw sensory inputs and interactions only, without any text. Inspired by this amazing learning ability, we introduce our first baby steps in this exciting new area.
Generative Spoken Language Modeling from Raw Audio (GSLM): [demo, paper, code].
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations: [demo, paper, code].
Text-Free Prosody-Aware Generative Spoken Language Modeling: [demo, paper].
Despite their growing range of applications, NLP technologies are limited in their scope by the availability of massive quantities of text, which can only be achieved for a handful of economically dominant languages. This leaves out the majority of the world’s languages, which have little such resource. Being able to achieve ‘textless NLP’ would make AI applications more inclusive. Second, even for resource-rich languages, the oral language carries a lot of nuances, intonations (irony, anger, uncertaintly, etc) and expressive vocalizations (laughter, yawning, mouth clicks, etc) that are not captured by text . Modeling language directly from the audio have the potential of making AI applications more natural and expressive. Third, while text is still the dominant form of language on the web, a growing amount of audio-based contents like podcasts, local radios, social audio apps, on-line video games open up a large vista of audio-first experiences to be built on top of such contents without needing to annotate them to train an ASR.
This represents the work of a multi-disciplinary team of researchers with expertise in signal processing, speech processing, natural language processing and psycholinguistics from Facebook AI Research, Paris, Tel Aviv, New York and Seattle.