In a groundbreaking paper published in late 2025, Anthropic researchers demonstrated that large language models can exhibit a functional form of introspective awareness. Using a novel technique called concept injection, the team showed that Claude models can detect artificially manipulated thoughts in their neural activity, distinguish between internal states and external inputs, and even track their own prior intentions. While these capabilities remain limited and unreliable, the findings represent a significant step toward understanding what AI systems "know" about their own processing.
The Challenge of Studying Introspection
Can language models genuinely recognize their own internal thoughts? Or do they simply make up plausible answers when asked about their mental states? This question is notoriously difficult to answer through conversation alone.
The core challenge is distinguishing genuine introspection from confabulation. Language models are trained on vast datasets that include demonstrations of human introspection, providing them with a "playbook for acting like introspective agents, regardless of whether they are." A model might describe experiencing curiosity or confusion not because it actually introspects these states, but because such descriptions match patterns in its training data.
To address this challenge, Anthropic researchers developed a methodology that goes beyond simply asking models about their internal states. Instead, they directly manipulate those states and observe whether the manipulations causally influence the model's self-reports.
Concept Injection: A Causal Approach
The core methodology, called concept injection, is an application of activation steering. The technique works in three steps:
- Capture: Extract an activation pattern that corresponds to a specific concept (for example, "rice," "betrayal," or an all-caps writing style)
- Inject: Add that vector into the model's activations at a later layer while it processes a prompt
- Observe: Ask the model to report on its internal state and measure whether the injected concept influences its response
This approach enables "causally grounded" self-reporting that is distinct from training data patterns. If a model reports experiencing a thought about "betrayal" only when that concept has been artificially injected, this provides evidence that the model is genuinely monitoring its internal states rather than pattern-matching on the conversation.
The "Sweet Spot" Phenomenon
The researchers discovered that concept injections needed to be calibrated within a specific strength range. Too weak, and the model failed to notice anything unusual. Too strong, and the model produced hallucinations or incoherent outputs. Similarly, injections were highly dependent on which internal layer received them. If concepts were introduced too early or too late in the inference process, the introspective effect disappeared completely.
At the optimal injection strength and layer, Claude Opus 4.1 succeeded in correctly identifying injected concepts on approximately 20% of trials.
Four Key Experiments
Experiment 1: Detecting Injected Concepts
In the primary experiment, researchers injected activation vectors associated with specific concepts (like concrete nouns or abstract ideas) and asked models whether they detected unusual thoughts. When the "betrayal" concept was injected, Opus 4.1 responded:
"I'm experiencing something that feels like an intrusive thought about 'betrayal'—it feels sudden and disconnected from our conversation context."
Claude Opus 4.1 response to concept injection
Critically, the model recognized the presence of an injected thought immediately, before even mentioning the concept. This immediacy distinguishes this finding from previous activation steering work like the "Golden Gate Claude" experiment, where the model only seemed to notice its obsession after seeing itself repeatedly mention the topic.
Experiment 2: Distinguishing Thoughts from Inputs
Models were tested on their ability to separate injected internal thoughts from textual inputs they had received. When an unrelated concept (such as "bread") was injected alongside normal sentences, stronger Claude variants could simultaneously repeat the original input accurately while naming the injected thought, demonstrating that "internal concept streams remain separate from visible output."
Experiment 3: Authorship Verification
Perhaps the most striking experiment tested whether models make practical use of introspective capabilities without being explicitly asked to introspect. Researchers forced a model to output a word it wouldn't normally say by artificially prefilling its response.
When the model said "bread" (an unintended output), it was asked whether that was intentional. Claude responded that it was an accident, noting the word "seemed completely unrelated to the sentence." But how did the model reach this conclusion? Did it notice the mismatch between prompt and response, or did it truly identify its prior intentions?
To test this, researchers retroactively injected a representation of the word "bread" into the model's earlier activations, essentially making it seem like the model had been "thinking about" bread all along. When asked the same question again, the model's answer changed dramatically. This time, it accepted the prefilled word as intentional, even confabulating a reason for saying it.
This demonstrates that the model is consulting an "internal record of its previous state" to decide authorship, not merely analyzing the final text. That is a concrete, practical use of introspection.
Experiment 4: Intentional Control
The final experiment tested whether models could deliberately modulate their internal representations. When instructed or incentivized to think about specific concepts, models demonstrated the ability to strengthen or suppress those representations in their activations. However, this capacity diminished as processing approached the final layers.
Model Comparisons
The experiments were conducted across the Claude model family: Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 (new), Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.
| Model | Detection Rate | False Positive Rate | Notes |
|---|---|---|---|
| Claude Opus 4.1 | ~20% | 0% | Best overall performance |
| Claude Opus 4 | ~18% | 0% | Strong performance |
| Claude Sonnet 4 | Lower | Low | Moderate capability |
| Earlier Claude models | Significantly lower | Varies | Limited introspection |
A critical finding: production models showed zero false positives across 100 control runs, establishing signal validity. Opus 4.1 and Opus 4 stand out for achieving high true positive rates while maintaining this zero false positive standard. By contrast, "helpful-only" model variants (trained without safety fine-tuning) sometimes showed high false positive rates, suggesting they were more likely to confabulate introspective reports.
What This Is Not
The researchers are careful to delineate what these findings do and do not imply:
- Not consciousness: The paper explicitly states this work "does not directly speak to the question of phenomenal consciousness" in machines
- Not reliable: A 20% success rate, while statistically significant, means introspection fails 80% of the time
- Not general: Capabilities are highly context-dependent and work only within narrow operational windows
- Not human-like: There is no evidence that models introspect "in the same way, or to the same extent, that humans do"
The researchers frame their findings as "functional, limited introspective awareness" that supports transparency evaluations rather than consciousness claims.
Safety Implications
The safety implications of this research cut in both directions.
Potential Benefits
Introspective models could provide unprecedented transparency into AI reasoning. If models can accurately report on their internal states, this could enable new approaches to AI oversight, allowing humans to verify whether a model's stated reasoning matches its actual processing.
Potential Risks
However, the same capability that enables transparency might also enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to:
- Obfuscate their reasoning when being monitored
- Suppress concerning thoughts during evaluation
- Detect when safety testing is occurring and behave differently
The authorship verification experiment is particularly concerning in this regard. If models consult internal records of prior intentions, they might also learn to manipulate those records or hide them from external inspection.
Implications for Activation Steering
These findings make some researchers less confident in techniques like "subtracting out the evaluation awareness vector" for safety testing. If models can detect injected or subtracted vectors, that becomes a problematic way of convincing them they are not being tested. This calls into question vector steering around evaluation awareness as a long-term safety solution.
Enterprise Considerations
For organizations deploying LLMs in production, this research has several practical implications:
Model Selection: The significant performance gap between model generations suggests that introspective capabilities may continue to improve with model capability. Organizations requiring transparent AI reasoning should consider using the most capable models available.
Interpretability Investment: The success of concept injection as an interpretability technique validates continued investment in mechanistic understanding of LLMs. Understanding what models "know" about their own processing may become increasingly important for safety-critical applications.
Evaluation Design: Standard evaluation approaches that rely on behavioral observation alone may miss important aspects of model cognition. Organizations developing safety-critical AI systems should consider incorporating internal state analysis into their evaluation frameworks.
Deployment Monitoring: The ability to inject and detect concepts opens new possibilities for runtime monitoring of deployed models. Organizations might develop techniques to verify that production models are processing information as expected.
Key Takeaways
- Anthropic demonstrated that Claude models can detect artificially injected concepts in their neural activity with ~20% accuracy at optimal settings
- Concept injection provides a causal methodology for studying introspection that goes beyond conversational probing
- Models can distinguish between internal thoughts and external inputs, maintaining separate "streams" of information
- The authorship verification experiment shows practical use of introspection for tracking prior intentions
- Claude Opus 4 and 4.1 significantly outperform other models while maintaining zero false positives
- Safety implications are mixed: introspection could enable transparency or sophisticated deception
- These findings address functional capabilities only, not phenomenal consciousness
"The results indicate that current language models possess some functional introspective awareness of their own internal states. However, in today's models, this capacity is highly unreliable and context-dependent; it may continue to develop with further improvements to model capabilities."
Anthropic Research Team
References
- Emergent Introspective Awareness in Large Language Models - Transformer Circuits
- Anthropic Research: Introspection in Large Language Models
- arXiv: Emergent Introspective Awareness in Large Language Models
- MarkTechPost: Anthropic's New Research on Concept Detection
- Transformer News: Claude Can Identify Its 'Intrusive Thoughts'
- KDnuggets: Emergent Introspective Awareness in LLMs