Can a Neuroscientist Understand an LLM?
Or a microprocessor. Or a brain. Or none?

Some years ago, in Could a Neuroscientist Understand a Microprocessor? 1Mimicking the older and even more famous Can a Biologist Fix a Radio? Jonas and Kording explored in a brilliant and only slightly humorous way the question of whether and how a neuroscientist could understand a microprocessor.
The paper looked at a microprocessor as a proxy for what a (simplified) brain does: taking inputs (keyboard clicks), holding some notion of memory, and producing an output (pixels on the screen). The authors used emulators available to run a full simulation of a simple microprocessor, which allowed them to deploy a plethora of techniques from the neuroscientist’s toolkit: they computed single-cell tunings, performed “transistor ablations” (akin to brain areas or cell type lesions), analysed local field potential-like signals, applied dimensionality reduction to population activity, and looked for correlations between transistor states and behavioural outputs.
As a neuroscientist, I have been of two minds about this take. On the one side, this resonates with the notion, which I hold dear, of the fundamental importance of levels of understanding and epistemological coarse-graining, and the piece is very thought-provoking in this regard. I do believe in the notion that the physical implementation of a computation can be to some degree isolated from the description of the computation in itself, which is one of the messages of the paper.
On the other hand, I have been struggling with the analogy, this very direct comparison of microprocessors and brains. I do not have fundamental issues with the idea of computing brains. 2With all due respect to embedded cognition and the like, I do believe that the brain as an organ is highly specialised for the aggregation, processing, and storage of information, under some loose definition of aggregation, processing, storage, and information. But the analogy here feels a bit strawman-y, especially given the (incomplete, limited, but) very suggestive insights that neuroscience has obtained by probing brains in those ways. Modest as they might be, they do overshadow the paper’s pessimistic conclusions from analysing the microprocessor. We do have Jennifer Aniston neurons, we do have a long tradition of neurosurgeons evoking all sorts of profound phenomenological reactions by probing the brain with microcurrents. We can selectively erase specific memory traces in the animal brain, we can decode behaving animal’s future trajectories within some small temporal horizon. So, what is wrong with the paper’s argument?
A misleading metaphor #
I see more of what I would call an abstraction segregation issue. By abstraction segregation I mean the engineered independence of levels: each layer of the stack is designed so that its internals are opaque to the layers above and below, connected only through a stable interface. All levels of a microprocessor’s operation have been engineered and optimised from the top down, and, for the most part, independently. While there has been to some degree a co-evolution of chip design and algorithms running on such chips, each layer of the system has been overall developed from the beginning in a very abstract manner. “Let’s encode world entities in numbers — actually, binary numbers”; “Let’s develop smart algorithms to handle logic and arithmetic on those numbers”; “Let’s engineer some hardware to carry voltages signifying those binary numbers to do math with them”; until, one day, we got to “Let’s wrap it all up in user interfaces, so that people can literally drag around files and see pages turning”.
The result is something very compartmentalised in abstraction silos; whoever was working on transistor engineering was hardly thinking about the nitty-gritty of GUI design or videogame logic, and vice versa. At the same time, multiple levels of this hierarchy were actively designed with this abstraction siloing in mind: it is generally a useful feature for operating systems to run on different kinds of physical devices, and for programming languages to be operating system-agnostic. This keeps optimisation and design for each of those levels completely isolated from the others. Sometimes, things at one level need to change because of some interface changes at the level below, but most of the time changes are implemented while ensuring the same interface to the levels above.
This is very different from what happens in nervous systems. Neurons do not talk to each other, they do things to each other; the only way to change the behaviour of the system as a whole is to tinker with how neurons wire, how they fire, how they interact with each other, in a gradual fashion. There is no way you can redesign a part of the system from scratch (say, visual sensory processing) and “expose the same interface”. The nervous system is a hard-to-understand entity, which unfolds over time during development and learning, while generating new rules and patterns as it expands. Its components have to constantly react to adapt to the chemical and electrical signals they receive, maintaining their own homeostatic balance. While they do that, they have to keep listening to the ensemble messages providing feedback about the performance of the system as a whole, within its environment (be it an eggshell or a complicated maze). Changes can never be catastrophically wrong; development and learning cannot afford to completely reorganise one part of the system in a disruptive way. Learning does not refactor.
It seems that somehow, this process engraves a tighter relationship between environmental variables and the internal state of the system. What in computers would sound utterly pointless (correlating the activity of sets of transistors to the “environment” or the “behavior”), in brains becomes reasonable. You get your Jennifer Aniston neurons, your engrams, your face detection area, your optogenetically triggered mting behavior. This is not thorough understanding: for us neuroscientists, those are little miracles of local order and reproducible behaviour in an otherwise chaotic system. Everything in brains is obviously very complicated and escapes reductionist thinking; but you can at least aim at finding a reasonable amount of mutual information between subsystems and the environment; interestingly, this happens much more than you would observe with subcomponents of a processor.
Neural networks as continuous learners #
I think that one of the aspects that makes neural networks so interesting is that their “learning” imposes similar constraints to what brains experience. The fitness landscape is explored gradually, with enough momentum to escape local minima but enough conservatism to avoid the catastrophic loss of the gained performance. The training process is continuous — in the mathematical sense of proceeding without discontinuities. Even phase transitions in the loss curve come from the continuous adjustment of model parameters.
Language models are neural networks that, after unsupervised training on text, can display unexpected and increasingly spooky capabilities in terms of generalisation, and maybe abstract reasoning - under some definition of reasoning. The perceived magic of language models also comes from the almost trivial simplicity of the optimised metric in their learning: they are simply trained on predicting the next word (token, in LLM lingo), given the preceeding words and sentences (context window) 3Not correct of course, this is just pretraining. Pardon the simplification here, the argument extends to post-training as well. . Initially, the network spits out random garbage; then, it slowly realises 4Please, bear with this and all the following anthropomorphizations — here those are just a figure of speech. that there are statistical regularities in the appearance of dyads of words (“New” almost always precedes “York”; “ice” often precedes “cream”). After seeing enough of those dyads, it starts learning the abstract concept of induction. Weights get configured so that, if in the text a word A appears often before word B in the context, it becomes more likely that the model predicts B following A the next time A appears. Crucially, this happens even with words A and B that the model has never seen during training. 5Check this source out for more on the magical — but improperly named — in-context learning. As they learn to efficiently predict subsequent words, the models also evolve efficient representations in how they embed tokens in their internal vector space. They organize words (or concepts, to the extent a concept can be captured just by a word and its relationship with other words, without anchors to physical reality) in a multidimensional space that makes operating on them tractable, and interestingly, from that space the relationships between words and concepts can be reconstructed, sometimes even with linear probes.
The field that is investigating this is called mechanistic interpretability. To me, this looks like LLM neuroscience, no more, no less. 6All of mechanistic interpretability looks like neuroscience conceptual and methodological tools applied to artificial networks. They look at unit tunings, “circuit” tracing, they crank up or down the activation of selected neurons, and ablate portions of the network. And guess what, they do see way more interpretable patterns than what you would get — and in fact do get — if you look into and poke a microprocessor! They can find many interpretable units (either directly or through specific lenses) and predictably manipulate their activity in a way that steers the network; they established the basis for an induction algorithm, figured out how networks store more features than they have neurons via overlapping directions (“superposition”), and extracted millions of interpretable features from production models. Most recently, circuit tracing was used to map internal computational graphs of an LLM, revealing mechanisms for multi-step reasoning and language-independent representations. 72026 update: even the way in which mechanistic interpretability is coming to terms with its epistemological limitations resonates a lot with neuroscientists. “Carving things at their joints”? “Just-so stories”? “What is a feature/representation”? The overlap with neuroscience is almost comical.
A similar outcome came from inspecting the guts of image-processing convolutional neural networks, with activation patterns and processing strategies that could be probed to some degree. 8For examples, check out feature visualization and how features could be combined to connect them into interpretable circuits.
How come the epistemological toolkit of neuroscience falls so short when facing a microprocessor, and seems so effective when applied to these networks? I guess, and please tell me if the argument has been made more rigorously somewhere else, that the answer is in the gradual, continuous nature of the optimisation of those overparameterised systems, which, starting from a scaffold of sorts, are then adjusted to mirror the relationships between the entities they operate on. As in brains, no epistemological siloing: the only way to steer the behaviour of the whole network toward improved performance is through an algorithm (backpropagation) that implicitly imputes the network outcome to some of its elements and nudges them to change in a direction that improves the system as a whole. I feel that there is something beautifully organic in this procedure, and it probably tells us something about how biological (and neuronal) systems operate. 9And yet, if you talk about backpropagation to neuroscientists, they tend to get jumpy. Of course, the brain does not do back propagation! Or does it? This overall smooth process resonates with the myriad learning and homeostatic processes that in brains achieve the same kind of gradual search for efficient “representations”. Even when the system evolves “emergent” algorithms, like the induction circuit, it seems to do so in ways that make it much more interpretable when we try to figure out its function from direct “mutual information” measures between the inside and the outside of the system. 10We clearly understand much better a microprocessor than an LLM; remember that we are taking the stance of not knowing anything about a system, and trying to understand it just by watching it while it operates.
So, can they or can they not? #
So, can a neuroscientist understand an LLM? Actually, can they understand a brain? I am afraid that we get once more at the vexed question: what does understanding mean? If by understanding we mean finding internal variables (“representations”) that track environmental and behavioural states and trajectories, yes; if we mean occasionally finding empirically robust ways to steer those variables and change behavioural outcomes, definitely. But if we mean giving a thorough mechanistic view that accounts for how a given outcome is produced by some set of initial conditions with the same crisp resolution with which we understand how turning a key switches on an engine (or electrons are ultimately bounced through transistors as a result of our clicks), surely not.
No matter what our answer is, however, I believe it should probably be the same for both brains and LLMs.
Enjoyed this?
Subscribe to get new articles in your inbox.