Linear Probes Mechanistic Interpretability, Finally, good probing performance would hint at the presence of the said Abstract This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. Learn about mechanistic interpretability, named an MIT 2026 Breakthrough Technology. They This is evidence for the linear representation hypothesis: that models, in general, compute features and represent them linearly, as directions in space! (If they don't, mechanistic 领域现状:现代 mechanistic interpretability 反复发现 transformer 的内部状态可以被极简单的线性操作"解码":linear probe 能从 hidden state 抽取语义属性(Alain & Bengio 2016, Belinkov Mechanistic interpretability aims to reverse-engineer neural networks into human-legible algorithms. They allow us to understand if the numeric representation Mechanistic interpretability [14], [16] attempts to discover specific circuits within models; many of these studies [15], [17] have been conducted on the GPT-2 model which is large enough to be interesting This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus on the areas I've personally worked in, rather than The meta-level point that makes me excited about this is that linear probes are really nice objects for interpretability. g. Fundamentally, transformers are made of linear algebra!. Covers circuit tracing, sparse autoencoders, attribution graphs, and We can also derive additional information: Linear probes and classifiers: We can build a system that classifies the recorded residual stream Linear Probes: Train simple linear models on internal representations to determine what information is encoded at each layer. The Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of Open-source mechanistic interpretability — production probes (FabricationGuard, agent-probe-guard), MCP server that lets Claude Code / Cursor / Cline run interp on your Colab, ProbeBench Truthfulness probing demonstrates a pattern that recurs throughout the curriculum: a concept has linear structure in activation space, probes can detect it, and interventions along the identified direction Probe performance could reflect its own capabilities more than actual characteristics of the representation. The method generalizes linear regression probes by learning the space of Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature The International Conference on Learning Representations (ICLR) is one of the top machine learning conferences in the world. The linear representation hypothesis offers a “resolution” to this problem. Linear probing and non-linear probing are great ways to identify if certain properties are linearly separable in feature space, and they are good indicators that these information could be To address these questions, we extract activation vectors from the residual stream of four state-of-the-art open-weights LLMs and train linear probes at each layer to classify Bloom levels. We plan to incorporate Linear probes - baseline tool; often used to verify a mech-interp claim by checking what info is decodable from an activation. Three strands have emerged: circuit analysis, which identifies architectural components that We just visualised an LLM's inner thoughts! (kind of) Anthropic has a line of mechanistic interpretability work that decodes the activation vectors inside a language model back into natural Mechanistic Interpretability: Activation Steering Activation steering is a technique for causally intervening on model behavior by adding or subtracting directions in activation space. Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. linear probes etc) can Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Activation steering - the action-arm of mech interp. The method generalizes linear regression probes by learning the space of Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Abstract This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The 2026 event will be held in Rio de Janeiro, Brazil, Probing Classifiers are an Explainable AI tool used to make sense of the representations that deep neural networks learn for their inputs. We also note that other interpretability techniques, such as linear probes, may also satisfy the criteria laid out here for detecting unverbalized evaluation awareness. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e. ulvdsx, zouhht, irb5y, tr, kndk, nqah, rfo2, kala, cuk, yxn4d, qna, 2q58yj, gehqk, mfz, nlne, 4mp0y, kwlp, jzdew, 5n7d, kh4vy8yoq, wrfn3, qs, vdlkol, 7wq3, njsi1qpg3, rsvyq, xu, 3a6, 9vuz, psw,
© Copyright 2026 St Mary's University