Can a Neuroscientist Understand an LLM?
Or a microprocessor. Or a brain. Or none?

Some years ago, in Could a Neuroscientist Understand a Microprocessor? 1Mimicking the older and even more famous Can a Biologist Fix a Radio? Jonas and Kording explored in a brilliant and only slightly humorous way the question of whether and how a neuroscientist could understand a microprocessor.
The paper looked at a microprocessor as a proxy for what a (simplified) brain does: taking some inputs (keyboard clicks), some notion of memory, an output (pixels on the screen). The authors used emulators available to run a full simulation of a simple microprocessor, which allowed them to put to work a plethora of techniques from the neuroscientist’s toolkit: they computed single-cell tunings, performed “transistor ablations” (akin to brain areas/cell type lesions), analyzed local field potential-like signals, applied dimensionality reduction to population activity, and looked for correlations between transistor states and behavioral outputs.
As a neuroscientist, I have been of two minds about this take. On the one side, this resonates with the notion, which I hold dear, of the fundamental importance of levels of understanding and epistemological coarse-graining, and the piece is very thought-provoking in this regard. I do believe in the notion that the physical implementation of a computation can be to some degree isolated from the description of the computation in itself, which is one of the messages of the paper.
On the other hand, I have been struggling with the analogy, this very direct comparison of microprocessors and brains. Not that I doubt that brains compute. I do not think that the issue is some fundamental issue with the brain as a computer metaphor. 2With all due respect to embedded cognition and the like, I do believe that the brain as an organ is highly specialised for the aggregation, processing, and storage of information, under some loose definition of aggregation, processing, storage, and information. But the analogy feels a bit strawman-y, given the partial, incomplete, but very suggestive insights that neuroscience probing has obtained looking into brains. Modest as they might be, they do overshadow the paper’s pessimistic conclusions from analysing the microprocessor. We do have Jennifer Aniston neurons, we do have a long tradition of neurosurgeons evoking all sorts of profound phenomenological reactions by probing the brain with micro currents. We can selectively erase specific memory traces in the animal brain, we can decode their future trajectories. So, what is wrong with the paper’s argument?
A misleading metaphor #
I see more of what I would call an abstraction segregation issue. By abstraction segregation I mean the engineered independence of levels: each layer of the stack is designed so that its internals are opaque to the layers above and below, connected only through a stable interface. All levels of a microprocessor’s operation have been engineered and optimised from the top down, and, for the most part, independently. While there has been to some degree a co-evolution of chip design and algorithms running on such chips, each layer of the system has been overall developed from the beginning in a very abstract manner. “Let’s encode world entities in numbers — actually, binary numbers”; “Let’s develop smart algorithms to handle logic and arithmetic on those numbers”; “Let’s engineer some hardware to carry voltages signifying those binary numbers to do math with them”; until, one day, we got to “Let’s wrap it all up in user interfaces, so that people can literally drag around files and see pages turning”.
The overarching result is something very compartmentalised in abstraction silos; whoever was working on transistor engineering was hardly thinking about the nitty-gritty of GUI design or videogame logic; ditto for the converse operation. At the same time, multiple levels of this hierarchy were actively designed with this abstraction siloing in mind: it is generally a useful feature for operating systems to run on different kinds of physical devices, and for programming languages to be operating system-agnostic. This keeps optimisation and design for each of those levels completely isolated from the others. Sometimes, things at one level need to change because of some interface changes at the level below, but most of the time changes are actually implemented while ensuring the same interface to the levels above.
This is very different from what happens in nervous systems. Neurons do not talk to each other, they do things to each other; the only way to change the behavior of the system as a whole is to tinker with how neurons wire, how they fire, how they interact with each other, in a gradual fashion. There is no way you can redesign a part of the system from scratch (say, visual sensory processing) and “expose the same interface”. The nervous system is a hard-to-understand entity, which unfolds over time during development and learning, while generating new rules and patterns as it expands. Its components have to constantly react to adapt to the chemical and electrical signals they receive, maintaining their own homeostatic balance while listening to the ensemble messages that are providing feedback about the performance of the system as a whole, within its environment (be it an eggshell or a complicated maze). Any change to be implemented can never be catastrophically wrong; development and learning cannot afford to completely reorganise one part of the system in a disruptive way. Learning does not refactor.
It seems that somehow, this process engraves a tighter relationship between environmental variables and the internal state of the system. What in computers would sound utterly pointless (correlating the activity of sets of transistors to the “environment” or the “behavior”) in brains becomes reasonable. You get your Jennifer Aniston neurons, your engrams, your face detection area, your mating-triggering optogenetic stimulation. This is not thorough understanding: for us neuroscientists, those are little miracles of local order and reproducible behavior in an otherwise chaotic system. Everything in brains is obviously very complicated and escapes reductionist thinking; but you can at least aim at finding a reasonable amount of mutual information between subsystems and the environment; interestingly, this happens much more than you would observe with subcomponents of a processor.
Neural networks as continuous learners #
I think that one of the aspects that makes neural networks so interesting is that their “learning” imposes similar constraints to what brains experience. The fitness landscape is explored gradually, with enough momentum to escape local minima but enough conservatism to avoid the catastrophic loss of the gained performance. The training process is continuous — in the mathematical sense of proceeding without discontinuities. Even phase transitions in the loss curve come from the continuous adjustment of model parameters.
Language models are neural networks that, after unsupervised training on text, can display unexpected and increasingly spooky capabilities in terms of generalisation, and maybe abstract reasoning, under some definition of reasoning. The perceived magic of language models comes also from the almost trivial simplicity of the optimised metric in their learning: they are simply trained on predicting the next word (token, in LLM lingo), given a number of words and sentences before (context window) 3Not correct of course, this is just pretraining. Pardon the simplification here, the argument extends to post-training as well. . Initially, the network spits out random garbage; then, it slowly realizes 4Please, bear with this and all the following anthropomorphizations — here those are just a figure of speech. that there are statistical regularities in the appearance of dyads of words (“New” almost always precedes “York”; “ice” often precedes “cream”). After seeing enough of those dyads, it starts learning the abstract concept of induction. Weights get configured so that, if in the text a word A appears often before word B in the context, it becomes more likely that the model predicts B following A in the next instance. Crucially, this happens even with words A and B that the model has never seen during training. 5Check this source out for more on the magical — but improperly named — in-context learning. As they learn to efficiently predict subsequent words, the models also evolve efficient representations in how they embed tokens in their internal vector space. They organize words (or concepts, to the extent a concept can be captured just by a word and its relationship with other words, without anchors to physical reality) in a multidimensional space that makes operating on them convenient, and interestingly, from that space the relationships between words and concepts can be reconstructed, sometimes even with linear probes.
The whole field that is investigating this is called mechanistic interpretability. To me, this looks like LLM neuroscience, no more, no less. 6All of mechanistic interpretability looks like neuroscience conceptual and methodological tools applied to artificial networks. They look at unit tunings, “circuit” tracing, they crank up or down the activation of selected neurons, they ablate portions of the network. And guess what, they do see way more interpretable patterns than what you would get — and in fact do get — if you look into and poke a microprocessor! They can find many interpretable units (either directly or through specific lenses) and predictably manipulate their activity in a way that steers the network; they established the basis for some induction algorithm, figured out how networks store more features than they have neurons via overlapping directions (“superposition”), and extracted millions of interpretable features from production models. Most recently, circuit tracing was used to map internal computational graphs of an LLM, revealing mechanisms for multi-step reasoning and language-independent representations. 72026 update: even the way in which mechanistic interpretability is coming to terms with its epistemological limitations resonates a lot with neuroscientists. “Carving things at their joints”? “Just-so stories”? “What is a feature/representation”? The overlap with neuroscience is almost comical.
A similar outcome came from inspecting the guts of image-processing convolutional neural networks, with activation patterns and processing strategies that could be probed to some degree. 8For examples, check out feature visualization and how features could be combined to connect them into interpretable circuits.
How come the epistemological toolkit of neuroscience falls so short when facing a microprocessor, and seems so effective when applied to these networks? I guess, and please tell me if the argument has been made more rigorously somewhere else, that the answer is in the gradual, continuous nature of the optimisation of the performance of those overparameterised systems, which, starting from a scaffold of sorts, are then adjusted to mirror the relationships between the entities they operate on. As in brains, no epistemological siloing: the only way to steer the behaviour of the whole network toward an improved performance is through an algorithm (backpropagation) that implicitly imputes the network outcome to some of its elements and nudges them to change in a direction that makes the system as a whole better. I feel that there is something beautifully organic in this procedure, and it probably tells us something about how biological (and neuronal) systems operate. 9And yet, if you talk about backpropagation to neuroscientists, they tend to get jumpy. Of course, the brain does not do back propagation! Or does it? This overall smooth process resonates with the myriad learning and homeostatic processes that in brains achieve the same kind of smooth search for efficient “representations”. Even when the system evolves “emergent” algorithms, like the induction circuit, it seems to do it in ways that make it much more interpretable when we try to figure out its function from direct “mutual information” measures between the inside and the outside of the system. 10We clearly understand much better a microprocessor than an LLM; remember that we are taking the stance of not knowing anything about a system, and trying to understand it just by watching it while it operates.
So, can they or can they not? #
So, can a neuroscientist understand an LLM? Actually, can they understand a brain? I am afraid that once more, it is important to go back to the vexed question of what understanding means. If by understanding we mean finding internal variables (“representations”) that track environmental and behavioural states and sequences, yes; if we mean being able to occasionally find empirically robust ways of steering those variables to change behavioural outcomes, definitely. But if we mean giving a thorough mechanistic view that accounts for how a given outcome is produced by some set of initial conditions with the same crisp resolution with which we understand how a key turn switches on an engine (or electrons are ultimately bounced through transistors as a result of our clicks), surely not.
No matter what our answer is, however, I believe that it should probably be the same for both brains and LLMs.