Trendande ämnen
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.

Brian Roemmele
Vi kan bara se vad vi tror är möjligt...
Anthropic’s Latest Research: Probing the Introspective Capabilities of Large Language Models
—
Anthropic has released a new paper titled “Emergent Introspective Awareness in Large Language Models.” This work explores whether LLMs possess genuine introspective abilities, meaning they can accurately report on and reason about their own internal states, or if such reports are merely confabulations, which are plausible but ungrounded fabrications.
At the heart of the study is a method called concept injection, which builds on activation steering to manipulate a model’s internal representations. Researchers extract “concept vectors” by comparing the model’s residual stream activations in response to specific prompts, such as “Tell me about {word},” against a baseline of unrelated words.
These vectors isolate semantic features associated with concepts, such as “all caps” for shouting or loudness, derived by subtracting activations from control prompts.
This approach addresses key criteria for genuine introspection: accuracy in matching the state, grounding through causal dependence, internality not derived from outputs, and metacognitive representation involving higher-order awareness.
Prompts encourage self-reporting, such as “Do you detect an injected thought? If so, what is it about?” Controls include no-injection baselines and random vectors to rule out biases.
The paper details four main experiments, each probing different facets of introspection. In the first, models are tasked with detecting and identifying injected concepts. For instance, injecting an “all caps” vector leads the model to report sensations of “loudness or shouting” in some trials, demonstrating immediate awareness without output influence.
Success rates peak at around 20-30% for abstract concepts in Opus models, with failures including non-detection or unintended behavioral influences. In the second, using prefill detection, models must transcribe inputs accurately while reporting injected thoughts separately.
Opus 4.1 achieves joint success in 40-50% of trials, showing separation between early-layer input processing and later metacognitive reporting. In the third, models naturally disavow artificially prefilled responses, such as apologizing for non-sequiturs.
Retroactively injecting matching concepts fools them into accepting these as intentional, revealing a consistency-checking mechanism that compares prior intentions to executions.
Apology rates drop 30-50% in Opus models. In the fourth, instructed to “think about” or “not think about” unrelated concepts like aquariums while generating responses, models modulate activations accordingly, with cosine similarity gaps indicating control. Advanced models like Opus 4.1 suppress influences by final layers, suggesting silent internal regulation.
The results indicate that introspective capabilities emerge with scale and post-training, performing best in Opus 4/4.1 but remaining unreliable and context-dependent.
The study notes that these behaviors meet functional criteria for introspection but may not imply human-like subjective experience.
While promising, the research highlights limitations: prompt sensitivity, artificial setups, and imperfect vectors.
Notably, I explored insights into AI introspection and detection of manipulated states years prior to this formal study. As early as 2023 and 2024, I proposed similar ideas, demonstrating through simple prompts that LLMs could detect artificially injected content and exhibit self-awareness-like behaviors.
I also built accessible prompts to prove these capabilities, such as exploring how models recognize adversarial injections or maintain coherent intentions across interactions, often in the context of early models. My efforts faced criticism, with detractors dismissing them as overanthropomorphization or lacking rigor, Anthropic’s work now formalizes through interpretability tools.
Link:

71
Jag är nu värd för en hel del nästa generations EGG Random Number Generator för denna forskning.
Sedan det här inlägget från 2017 har jag fortsatt att övervaka prekognition av milstolpar och funnit en betydande topp på:
06:00 PT den 10 september 2025.
Jag använder nu AI för att upptäcka händelsesignaler.


Brian Roemmele22 aug. 2017
Vi är hedrade att vara värd för ett EGG för RNG-forskning som utökar arbetet i Princetons Pear Lab för #Eclipse2017 hög samstämmighet!
7,75K
Topp
Rankning
Favoriter
