Language models trained from large corpora of text have made tremendous progress in the recent years, and are used in a variety of Natural Language Processing (NLP) applications. Independently, recent breakthrough in representation learning has yielded models able to discover discrete units from raw audio without the need of any labeled data. Connecting these two breakthroughs together opens up the possibility of applying language models directly to audio inputs, side stepping the need for textual resources or Automatic Speech Recognition (ASR), opening up a new era of textless NLP. This may seem an unachievable objective, but preschool children provide a proof of principle that it is possible to master a great deal about language from raw sensory inputs and interactions only, without any text. Inspired by this amazing learning ability, we introduce our first baby steps in this exciting new area.