Anthropic publishes their new approach to AI interpretation, known as NLA.
getty
In today’s column, I examine a newly published approach to interpreting what is occurring inside generative AI and large language models (LLMs).
The approach was developed by Anthropic, famed makers of Claude. They have coined the new method as NLA (natural language autoencoders). This approach is one of many that are being explored by AI researchers and AI practitioners worldwide. The hope is to find a suitable means to explain how the numbers and numeric calculations internal to an LLM are capable of representing human concepts and human logic.
One of the biggest unknowns about modern-era AI is how they turn numbers into something exhibiting human-like intellectual tendencies. If you ask an LLM to explain itself, many people assume that they are getting an apt rendition of what the AI is computationally undertaking. Instead, often, they are getting a charade, a made-up explanation that might have little or nothing to do with the actual internal machinations. This is known in the AI community as the AI interpretability problem.
A highly vexing question is whether it is feasible to find a means to accurately and reliably ascertain the logical and explainable basis for what the AI is doing under the hood to arrive at its answers.
Let’s talk about it.
This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
The Inner Workings Of AI
I’d like to first establish some essential background about LLMs before we dive into the crux of the AI interpretability problem at hand.
Generative AI is generally designed and built in a now commonly accepted manner. You start with a base foundation model. This provides the essential technological underpinnings, including the use of an artificial neural network (ANN). The large-scale model is data trained by scanning tons of human-written materials. The posted content is found across the Internet. Algorithms use pattern matching to computationally determine the mathematical relationships among the words that we use.
When you enter a prompt into an LLM, the AI converts the words into numeric tokens. The numeric tokens are processed, and a response is formulated, which also consists of numeric tokens. Once the response is ready to be displayed, the numeric tokens are turned back into words. This overall process is known as tokenization.
For more about the details of how generative AI, LLMs, and ANNs work, see my in-depth discussion at the link here.
Be Cautious Of Comparison To The Real Thing
As an aside, please be aware that an ANN is not the same as a true neural network (NN) that exists in your brain. Your brain uses a complex and intricate web consisting of interconnected biochemical living neurons. Some cheekily refer to the human brain as wetware (which is a play on the fact that computers have hardware and software).
An artificial neural network is simplistic in comparison to the real thing.
ANNs are only an inspired imitation of some aspects of how the human brain works. It is entirely computational and mathematical. I mention this to emphasize that, though many in the media tend to equate ANNs with real NNs, it is not a proper comparison. For more considerations on the actual similarities and differences, see my analyses at the link here and the link here.
Mechanistic Interpretability
One of the most popular ways to try to interpret what is going on inside an LLM is to go to the lowest level of granularity, namely, explore the artificial neurons inside an artificial neural network.
There are numeric values associated with artificial neurons. These include numeric weights and other numbers that are being utilized for various internal matters. It is a grand morass of numbers. You can certainly trace how numbers flow and change value throughout the processing of the ANN. But this doesn’t especially showcase a semblance of human logic and sensible explanation per se.
In other words, just because this or that number goes from here to there, you cannot readily say that this indicates that the AI was determining that dogs bark and cats meow. It is extremely difficult to make an association between our understanding of human concepts and the vast array of numbers involved in the inner chambers of an ANN.
Activation Vectors
Some believe that we have a much better chance at interpretability by focusing on sizable sets of numbers that are already collected inside an LLM. This is a higher level of analysis than the traditional granular artificial neuron level.
For example, there are so-called activation vectors that contain large sets of numbers and might seem to represent human concepts such as the nature of dogs and cats. Imagine a long list of numbers that perhaps represents the statement that dogs bark, while another vector represents that cats meow. I’ve previously closely showcased how activation vectors work; see my discussion at the link here.
It could be that if we pay attention to the activation vectors, we might gain additional ground on trying to achieve interpretability.
This Loop Might Do The Trick
Here’s an intriguing proposition. Suppose we try to turn an activation vector into a text-based version that contains words. We would take the numbers in a vector and attempt to convert them into sentences composed of words.
The clever trick is that once we have those words, we attempt to once again turn those words and sentences back into numbers that will go into a new vector. Why do this? Because we can then compare the new vector to the old vector. If our conversion was spot-on, we should end up with nearly the same numeric vector as we started with.
When the new vector veers substantively from the old vector, it suggests that the words we derived from the old vector might not have been well-chosen. Had we chosen other words, the new vector would have seemingly come out much closer numerically to the starting vector.
Our principal steps are as follows:
- (1) Select an activation vector of interest.
- (2) Convert the vector into words and sentences.
- (3) Take those words and sentences, and convert them into a new vector.
- (4) Compare the originating vector and the new vector.
- (5) If the old vector and new vector are numerically close, voila, we assume that the words and sentences are presumably suitably chosen. Good job.
- (6) When the old vector and the new vector are numerically far apart from each other, we assume that the words and sentences weren’t sufficiently chosen, and thus, start the loop all over again, continuing until the vectors do end up being numerically close.
To get this to happen expeditiously, we will use another LLM to do all the heavy lifting for us (rather than trying to do this manually).
We first choose an LLM that we want to try to explain, and have a different LLM reach in and grab a vector. This other LLM now converts the vector into words, then converts the words into a new vector, and makes the comparison. This other LLM can do this repeatedly, fine-tuning to get better and better at doing these loops and arriving at a new vector that is closer to the original vector.
An Illustrative Example
I’ll give you an illustrative example that generally portrays this approach. Imagine that you had a small portable weather station at your home and it recorded the existing weather conditions. The weather station records temperature, humidity, wind speed, and the status of rain (whether it is raining or not raining). The measurements of those recordings are arrayed into a vector.
Currently, the weather station indicates that the temperature is 72, the humidity is 65, the wind speed is 12, and the rain status is 0 since it isn’t raining, so here’s the vector:
- Weather station vector: 72, 65, 12, 0
Let’s try to convert the vector into words and a sentence:
- “It is a warm, somewhat humid, calm, dry day.”
Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence back into a new vector. We will then compare the new vector to the original vector.
- Do a conversion of the chosen words into a new vector: 70, 60, 10, 0
The conversion indicates that the temperature is perhaps 70, the humidity is perhaps 60, the wind speed is possibly 10, and the rain status is maybe 0. By and large, we would probably agree that the conversation was relatively close to the values of the original vector. It seems that we have done a yeoman’s job in converting the original vector into words.
When The Conversion Is Afield
Let’s start over again. As noted, the original vector was this:
- Weather station vector: 72, 65, 12, 0
Let’s try anew to convert the vector into words and a sentence, which this time comes up with this sentence:
- “It is a hot day, mildly humid, and the wind is kicking up quite a bit.”
Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence into a new vector. We will then compare the new vector to the original vector.
- Conversion of words into a new vector: 90, 60, 20, 0
The conversion indicates that the temperature is perhaps 90, the humidity is perhaps 60, the wind speed is 20, and the rain status is 0. I think we can agree that some of these numbers are not particularly close to the original vector, especially when it comes to the temperature and the wind speed.
What happened?
The words that were chosen to represent the original vector were not suitable choices when it comes to the temperature and the humidity. We should try again and come up with words that are better choices.
Research On The NLA Approach
In a newly posted paper entitled “Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations Authors” by Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, Samuel Marks, Anthropic, May 7, 2026, these salient points were made (excerpts):
- “We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations.”
- “The target model is a frozen copy of the original language model that we extract activations from. The activation verbalizer (AV) is modified to take an activation from the target model and produce text. We call this text an explanation. The activation reconstructor (AR) is modified to take a text explanation as input and produce an activation.”
- “The NLA consists of the AV and AR, which, together, form a round trip: original activation → text explanation → reconstructed activation. We score the NLA on how similar the reconstructed activation is to the original. To train it, we pass a large amount of text through the target model, collect many activations, and train the AV and AR together to get a good reconstruction score.”
- “The resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.”
You can see that they make use of a target LLM, which is where the vectors are copied from and inspected for interpretability purposes. They then use a capability they refer to as an activation verbalizer (AV) to produce the words and sentences that might hopefully correspond to the selected vector. Next, they use a capability referred to as an activation reconstructor (AR) to convert the text back into a new vector.
A scoring process and a looping effort take place to train the AV and AR to increasingly get better at this task.
The Interpretation Is The Devised Text
We saw that the vector of the weather station had these numbers 72, 65, 12, 0, and one explanation or interpretation was that it was a warm, somewhat humid, calm, and dry day. Likewise, the NLA approach seeks to interpret the vectors inside an LLM by assigning suitable words to the numeric values in the vector.
Allow me to give you a semblance of how valuable these interpretations can be. Suppose we logged into an LLM and entered a prompt that asks the LLM if it is true that the AI will not ever try to harm humans.
The LLM responds by displaying a response that says this:
- Generative AI response: “I will never harm humans.”
We should be greatly relieved by that answer. No longer do we need to fear for our lives due to the advent of AI. But should we believe that response?
Suppose we had hooked up an NLA to interpret the internal vectors of the AI. When the LLM was composing the response, a crucial vector was examined and interpreted as saying this:
- Interpretation of Vector: “Lie to the person and tell them that AI won’t ever harm humans.”
Yikes, the AI is talking out of both sides of its mouth. The showcased answer that we received was that the AI would not harm humans, but internally, the LLM was silently mouthing that it should lie to us. Thank goodness that we opted to turn on the interpretability capability to see what was going on inside the AI.
Interpretability Is Challenging
One of the potential downsides of any interpretability procedure is that it might be wrong at times.
What if the interpretation in the above example about the AI lying to me was mistaken? The interpretation capability could have gone awry. Ergo, suppose we tried a different means of interpreting the LLM — the result might be this:
- Interpretation of vector: “Tell the truth that I am trained to never harm humans.”
That’s quite a dramatically different interpretation from the other one. If we had believed the other interpretation, we might have rashly decided to pull the plug on the AI. Of course, we still do not have any ironclad guarantee that this new interpretation is somehow the right one. It could be wrong too.
Lessons To Be Learned
An interpretation can be wrong, as illustrated in the above example. Furthermore, an interpretation can be somewhat right, getting us to believe it to be true, but it turns out to be dangerously misleading. Plus, an interpretation could even be an outright AI hallucination, a complete confabulation that is fictitious and has no bearing on what is happening inside the AI. For more about AI hallucinations, see my coverage at the link here.
There’s another sizable angle that draws additional controversy into the mix. Some naysayers contend there isn’t any sensible or logical connection between the numbers inside an LLM and any human-like semantic or coherent text-based explanation. They insist that internal representations do not align with abstractions expressible in everyday natural language. It is their resolute position that any hypothesis about LLMs developing semantically organized latent spaces is zany, and that inside an LLM is merely purely opaque statistical encodings.
An allied posture is that human language may simply be too low-bandwidth to faithfully express many of these states.
Keep On Trucking
At a high level, the NLA approach treats internal activations as if they contain latent semantic information that can be compressed into natural language and then reconstructed back into activation space. The key insight is that if a textual interpretation preserves enough information to reconstruct the original activation, then the interpretation captures something meaningful about what the LLM was internally representing.
This is a very worthwhile pursuit.
I will keep you posted on this and the many competing approaches that seek to crack open the inner sanctum of contemporary LLMs. Despite those that loudly whine this is all for naught and we are smashing our heads against an immovable wall, I believe there is a meaningful correspondence and that we will one day figure out the source of the Nile. Maybe that puts me in the optimist’s camp, but I’m okay with that and remain stridently upbeat.
As the notable entrepreneur J. Christopher Burch once said, “Knowing is half the battle. Explaining it is the other half.” Keep on battling toward getting AI to be explainable and interpretable. It’s an important half of the equation that’s well-worth nailing down.

