On the Biology of a Large Language Model
Interesting research from Anthropic:
The black-box nature of [LLMs] is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose. [...]
In recent years, many research groups have made exciting progress on tools for probing the insides of language models. These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models.
However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact. In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. ) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.