UNLOCKING SPEECH:

HOW AI MODELS DECODE SOUNDS LIKE US (AND BATS!)

In our latest episode, we sat down with Marianne de Heer Kloots, a PhD candidate at the Institute for Logic, Language, and Computation. Her research delves into the exciting intersection of linguistics, cognitive science, and artificial intelligence, specifically exploring how AI models process and represent human language. The conversation centered on speech models and the intricate ways AI tackles the complexities of sound.

From Languages to AI and Back

Marianne's journey into this field is as multifaceted as her research. Initially drawn to linguistics by her interest in language differences, she specialized in language and cognition, exploring psycholinguistics and neurolinguistics. While computational linguistics initially seemed challenging, she was captivated by "making things happen in a computer" and pursued an AI Master's. However, her scientific curiosity remained rooted in understanding human language. This led her into a different master's program and several projects studying human language by examining things that are not human or not language, including projects on how people learn miniature artificial languages and analyzing the vocalizations of seal pups in a rescue center. Her PhD now brings these threads together by studying AI models inspired by methodologies from the human cognitive and language sciences.

How AI Learns Without Being Taught

At the heart of Marianne's work are AI models, often Large Language Models (LLMs) or similar architectures designed for speech. These models don't "understand" language like humans do, but they learn to represent it mathematically. Imagine text or sound broken down into tiny pieces, each represented as a long list of numbers – a vector – placing it in a complex, multi-dimensional "map". Modern models, like Transformers, use components called "attention heads" to weigh the importance of different parts of the input when creating these representations.

Crucially, many of these models learn through "self-supervised learning". Instead of being explicitly told grammatical rules, they are trained on vast amounts of text or audio data and given simple tasks, like predicting the next word or guessing a masked-out section of audio. By learning to perform these tasks well, the models implicitly pick up on the underlying structures and patterns of language. This allows them to be trained on enormous datasets, like huge portions of the internet or many hours of spoken language recordings.

Why Speech Models Matter

While text-based AI has exploded, Marianne emphasizes the importance of studying models trained directly on audio. Why? Because humans primarily learn and use language through speech (or sign language), long before we learn to read or write. Text is a computer-centric way of processing language, but speech is how humans naturally do it. Models trained on audio must grapple with the same challenges humans face: mapping continuous sound signals to meaningful units. Furthermore, speech contains far more information than text – intonation, emotion, accent variations, and other subtle nuances of pronunciation (phonetics).

Lessons from Bat Sounds

Exploring bat song syllable representations in self-supervised audio encoders

Marianne's research took an intriguing turn when she teamed up with bat scientist Mirjam Knörnschild to analyze how AI models, primarily trained on human speech, respond to bat vocalizations. The idea is that the way sounds can be produced is limited by the "instrument" – be it a human vocal tract, a bat's larynx, or a violin. Could there be shared acoustic properties between different vocal systems?

The Songs of Bats

Using recordings of greater sac-winged bats (slowed down to be audible), they fed these complex "territorial songs," composed of different syllable types, into various AI audio models. They found that models trained specifically on human speech were better at distinguishing between different bat syllable types than models trained on music or general animal sounds. "Distinguishing" here means the models represented the same syllable types with similar vectors, clustering them together in their internal "map," separate from other syllable types. This suggests that learning to process the acoustic structures of human speech provides these models with tools that are surprisingly useful for analyzing bat calls, hinting at underlying similarities in vocal communication across species.

Peeking Inside the AI Brain: Explainable AI

https://mdhk.net/posters/learning-SSL-poster.pdf

Given the complexity of these AI models, how can we understand what they're truly learning about language? This challenge falls under the umbrella of Explainable AI. For researchers like Marianne, explainability is crucial for building trust and ensuring the AI systems used in real-world applications behave reliably and fairly. We need ways to verify that models used for tasks like speech transcription are focusing on the relevant linguistic information and not relying on sensitive or unwanted details, such as specific speaker identities, which might be encoded alongside the words themselves. Because these models learn implicitly through self-supervision, we don't have explicit control over exactly what they learn, making post-hoc analysis of trained models essential.
One powerful set of XAI techniques involves probing the model's internal states. This means analyzing the vectors (the numerical representations) the model generates at its various processing layers. Researchers can train simple "auxiliary" classifier models that take these internal vectors as input and try to predict specific linguistic features – for example, classifying a sound segment as a particular phoneme or identifying the part of speech (noun, verb, etc.) of a word represented by a vector. Marianne used another probing technique based on linear discriminant analysis, to map the model's complex, high-dimensional vector space onto a simpler, lower-dimensional space optimized to show clear separation between different linguistic categories. If such simple probes can successfully decode linguistic information from a specific layer's vectors, it suggests the model has learned to represent that linguistic information internally at that stage.

Marianne and her colleagues applied these probing techniques to a specific audio model architecture called Wav2Vec 2.0 (a type of Transformer model) that they trained on a large dataset – 960 hours – of spoken Dutch, including audiobooks and conversations. By feeding Dutch speech recordings into the trained model and extracting the internal vectors associated with specific phonemes and words, they uncovered a fascinating processing hierarchy across the model's 12 layers:

Earliest Layers (closest to the raw audio input): Representations here were best at capturing basic acoustic structure. This makes intuitive sense, as these layers are closest to the incoming sound signal.
Middle Layers (around layers 5 and 6): These layers showed progressively more abstract linguistic encoding. First, representations emerged that clearly distinguished phonemes (the basic sounds like 'p' or 'b'). Slightly later layers developed strong representations of syllables and their structure (e.g., consonant-vowel vs. consonant-vowel-consonant).
Later Middle Layers (around layer 7): Following syllables, word-level information became prominent inside the model's representations, including syntactic features like part of speech (noun, verb) and potentially some aspects of word meaning (semantics).

Intriguingly, the most distinct encoding of higher-level linguistic features (phonemes, syllables, words) was found in the middle layers, not in the final layer of the model. Why wouldn't the final layer show the most sophisticated understanding? Marianne suggests this might relate back to the model's self-supervised training objective. The Wav2Vec 2.0 model is ultimately trained to reconstruct masked-out parts of the lower-level acoustic signal (specifically, 20-millisecond audio frames). While extracting linguistic abstractions in the middle layers might be a useful intermediate step to achieve this goal, the final layers might need to shift focus back towards representing the fine-grained acoustic details required for the reconstruction task. The linguistic information is still used, but perhaps less explicitly represented right at the end.

AI Perception: Echoes of Human Hearing?

Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0

This insight into how models process information layer-by-layer, potentially emphasizing higher-level linguistic or semantic features in later stages even if precise acoustic distinctions become less sharp, offers a compelling parallel to human perception. Why would AI models learn to encode higher-level linguistic abstractions at all, if their primary prediction objective is acoustic?

Consider the classic Jimi Hendrix song "Purple Haze". This song contains a famous mondegreen lyric in which Hendrix sings "Excuse me while I kiss the sky", but which is often misheard by listeners as "Excuse me while I kiss this guy". Why does this happen? While there are many theories about why mondegreens happen, this nicely illustrates how our knowledge of a language and the world might influence what we perceive. We may subconsciously know "kissing this guy" to be a more semantically plausible event, or a more probable phrase, than "kissing the sky". We then "hear" the phrase that makes more conventional sense, even if it wasn't what was actually sung. It shows how our perception isn't just a passive reception of sound; it's an active interpretation,shaped by our linguistic and semantic experience.

Do AI models also make use of learned linguistic information to represent speech sounds? As for humans, this ability might be particularly useful for AI models when the acoustic input is ambiguous or noisy. Marianne set out to study this by using controlled experiments designed to test how linguistic context can affect the perception of individual speech sounds. Instead of relying on song lyrics, they created specific stimuli to first study a simpler phenomenon: how models might integrate phonotactic constraints (knowledge about the possible sequences of speech sounds in a language) in processing speech. Inspired by earlier experiments with human listeners, the stimuli consisted of non-word syllables where the crucial consonant sound was acoustically ambiguous, engineered to sit somewhere between a 'l' sound and an 'r' sound. Despite the acoustic ambiguity, human listeners typically perceive either a 'l' or a 'r', rather than something in between.

In order to study the effect of phonotactic constraints, the experiment involved manipulating the sound preceding the ambiguous 'l'/'r'. Again following earlier experiments on human listeners, Marianne used preceding consonants that, according to English phonotactics, make one interpretation much more likely than the other. For instance, English words start with 'tr' (like 'tread') but not 'tl', making an 'r' sound more expected after 't'. Conversely, words can start with 'sl' (like 'sled') but not 'sr', making an 'l' sound more likely after 's'.

The experiment consisted of feeding non-word stimuli incorporating these biasing sounds – like 't' followed by the ambiguous sound, or 's' followed by it – to the AI audio model (Wav2Vec 2.0), and studying its representations of the ambiguous sounds.

By probing the model's internal representations, Marianne found that the AI model, much like humans, was indeed influenced by this preceding context. When the ambiguous sound followed a 't', the model was more likely to represent it internally in a way similar to how it represented an unambiguous 'r' sound. Conversely, a preceding 's' biased the model towards a more 'l'-like representation. This experiment demonstrated that even without being explicitly programmed with English phonotactic rules, the model learned to use surrounding sound context to interpret ambiguous acoustic signals, mirroring a key aspect of human speech perception. And more importantly, it also opens up an exciting area of research in which experiments designed to test human listeners can be applied to study AI models.

Future Directions: AI as a Tool for Linguistics

Beyond similarities, AI models offer powerful tools for linguistics research. They can help generate hypotheses about language acquisition – for instance, exploring the order in which different linguistic structures start to be perceived (rather than produced); something difficult to probe directly in infants. Furthermore, researchers are now directly comparing the internal activity patterns of AI models processing speech to human brain activity (like EEG data) recorded during listening. Early results suggest that the way models integrate context is crucial for aligning with neural activity, reinforcing the importance of context in both artificial and biological speech processing.

Marianne de Heer Kloots's work highlights a dynamic interplay between AI and the study of human language. By building and dissecting AI models that learn from sound, we not only improve technology but also gain novel insights into the fundamental structures of communicative signals in humans, and potentially also in other species.