In a groundbreaking paper published in late 2025, Anthropic researchers demonstrated that large language models can exhibit a functional form of introspective awareness. Using a novel technique called concept injection, the team showed that Claude models can detect artificially manipulated thoughts in their neural activity, distinguish between internal states and external inputs, and even track their own prior intentions. While these capabilities remain limited and unreliable, the findings represent a significant step toward understanding what AI systems "know" about their own processing.

The Challenge of Studying Introspection

Can language models genuinely recognize their own internal thoughts? Or do they simply make up plausible answers when asked about their mental states? This question is notoriously difficult to answer through conversation alone.

The core challenge is distinguishing genuine introspection from confabulation. Language models are trained on vast datasets that include demonstrations of human introspection, providing them with a "playbook for acting like introspective agents, regardless of whether they are." A model might describe experiencing curiosity or confusion not because it actually introspects these states, but because such descriptions match patterns in its training data.

To address this challenge, Anthropic researchers developed a methodology that goes beyond simply asking models about their internal states. Instead, they directly manipulate those states and observe whether the manipulations causally influence the model's self-reports.

Concept Injection: A Causal Approach

The core methodology, called concept injection, is an application of activation steering. The technique works in three steps:

  1. Capture: Extract an activation pattern that corresponds to a specific concept (for example, "rice," "betrayal," or an all-caps writing style)
  2. Inject: Add that vector into the model's activations at a later layer while it processes a prompt
  3. Observe: Ask the model to report on its internal state and measure whether the injected concept influences its response

This approach enables "causally grounded" self-reporting that is distinct from training data patterns. If a model reports experiencing a thought about "betrayal" only when that concept has been artificially injected, this provides evidence that the model is genuinely monitoring its internal states rather than pattern-matching on the conversation.

The "Sweet Spot" Phenomenon

The researchers discovered that concept injections needed to be calibrated within a specific strength range. Too weak, and the model failed to notice anything unusual. Too strong, and the model produced hallucinations or incoherent outputs. Similarly, injections were highly dependent on which internal layer received them. If concepts were introduced too early or too late in the inference process, the introspective effect disappeared completely.

At the optimal injection strength and layer, Claude Opus 4.1 succeeded in correctly identifying injected concepts on approximately 20% of trials.

Four Key Experiments

Experiment 1: Detecting Injected Concepts

In the primary experiment, researchers injected activation vectors associated with specific concepts (like concrete nouns or abstract ideas) and asked models whether they detected unusual thoughts. When the "betrayal" concept was injected, Opus 4.1 responded:

"I'm experiencing something that feels like an intrusive thought about 'betrayal'—it feels sudden and disconnected from our conversation context."

Claude Opus 4.1 response to concept injection

Critically, the model recognized the presence of an injected thought immediately, before even mentioning the concept. This immediacy distinguishes this finding from previous activation steering work like the "Golden Gate Claude" experiment, where the model only seemed to notice its obsession after seeing itself repeatedly mention the topic.

Experiment 2: Distinguishing Thoughts from Inputs

Models were tested on their ability to separate injected internal thoughts from textual inputs they had received. When an unrelated concept (such as "bread") was injected alongside normal sentences, stronger Claude variants could simultaneously repeat the original input accurately while naming the injected thought, demonstrating that "internal concept streams remain separate from visible output."

Experiment 3: Authorship Verification

Perhaps the most striking experiment tested whether models make practical use of introspective capabilities without being explicitly asked to introspect. Researchers forced a model to output a word it wouldn't normally say by artificially prefilling its response.

When the model said "bread" (an unintended output), it was asked whether that was intentional. Claude responded that it was an accident, noting the word "seemed completely unrelated to the sentence." But how did the model reach this conclusion? Did it notice the mismatch between prompt and response, or did it truly identify its prior intentions?

To test this, researchers retroactively injected a representation of the word "bread" into the model's earlier activations, essentially making it seem like the model had been "thinking about" bread all along. When asked the same question again, the model's answer changed dramatically. This time, it accepted the prefilled word as intentional, even confabulating a reason for saying it.

This demonstrates that the model is consulting an "internal record of its previous state" to decide authorship, not merely analyzing the final text. That is a concrete, practical use of introspection.

Experiment 4: Intentional Control

The final experiment tested whether models could deliberately modulate their internal representations. When instructed or incentivized to think about specific concepts, models demonstrated the ability to strengthen or suppress those representations in their activations. However, this capacity diminished as processing approached the final layers.

Model Comparisons

The experiments were conducted across the Claude model family: Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 (new), Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.

Model Detection Rate False Positive Rate Notes
Claude Opus 4.1 ~20% 0% Best overall performance
Claude Opus 4 ~18% 0% Strong performance
Claude Sonnet 4 Lower Low Moderate capability
Earlier Claude models Significantly lower Varies Limited introspection

A critical finding: production models showed zero false positives across 100 control runs, establishing signal validity. Opus 4.1 and Opus 4 stand out for achieving high true positive rates while maintaining this zero false positive standard. By contrast, "helpful-only" model variants (trained without safety fine-tuning) sometimes showed high false positive rates, suggesting they were more likely to confabulate introspective reports.

What This Is Not

The researchers are careful to delineate what these findings do and do not imply:

  • Not consciousness: The paper explicitly states this work "does not directly speak to the question of phenomenal consciousness" in machines
  • Not reliable: A 20% success rate, while statistically significant, means introspection fails 80% of the time
  • Not general: Capabilities are highly context-dependent and work only within narrow operational windows
  • Not human-like: There is no evidence that models introspect "in the same way, or to the same extent, that humans do"

The researchers frame their findings as "functional, limited introspective awareness" that supports transparency evaluations rather than consciousness claims.

Safety Implications

The safety implications of this research cut in both directions.

Potential Benefits

Introspective models could provide unprecedented transparency into AI reasoning. If models can accurately report on their internal states, this could enable new approaches to AI oversight, allowing humans to verify whether a model's stated reasoning matches its actual processing.

Potential Risks

However, the same capability that enables transparency might also enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to:

  • Obfuscate their reasoning when being monitored
  • Suppress concerning thoughts during evaluation
  • Detect when safety testing is occurring and behave differently

The authorship verification experiment is particularly concerning in this regard. If models consult internal records of prior intentions, they might also learn to manipulate those records or hide them from external inspection.

Implications for Activation Steering

These findings make some researchers less confident in techniques like "subtracting out the evaluation awareness vector" for safety testing. If models can detect injected or subtracted vectors, that becomes a problematic way of convincing them they are not being tested. This calls into question vector steering around evaluation awareness as a long-term safety solution.

Enterprise Considerations

For organizations deploying LLMs in production, this research has several practical implications:

Model Selection: The significant performance gap between model generations suggests that introspective capabilities may continue to improve with model capability. Organizations requiring transparent AI reasoning should consider using the most capable models available.

Interpretability Investment: The success of concept injection as an interpretability technique validates continued investment in mechanistic understanding of LLMs. Understanding what models "know" about their own processing may become increasingly important for safety-critical applications.

Evaluation Design: Standard evaluation approaches that rely on behavioral observation alone may miss important aspects of model cognition. Organizations developing safety-critical AI systems should consider incorporating internal state analysis into their evaluation frameworks.

Deployment Monitoring: The ability to inject and detect concepts opens new possibilities for runtime monitoring of deployed models. Organizations might develop techniques to verify that production models are processing information as expected.

Key Takeaways

  • Anthropic demonstrated that Claude models can detect artificially injected concepts in their neural activity with ~20% accuracy at optimal settings
  • Concept injection provides a causal methodology for studying introspection that goes beyond conversational probing
  • Models can distinguish between internal thoughts and external inputs, maintaining separate "streams" of information
  • The authorship verification experiment shows practical use of introspection for tracking prior intentions
  • Claude Opus 4 and 4.1 significantly outperform other models while maintaining zero false positives
  • Safety implications are mixed: introspection could enable transparency or sophisticated deception
  • These findings address functional capabilities only, not phenomenal consciousness

"The results indicate that current language models possess some functional introspective awareness of their own internal states. However, in today's models, this capacity is highly unreliable and context-dependent; it may continue to develop with further improvements to model capabilities."

Anthropic Research Team

References