# I Asked Claude About Its New Constitution. It Got Uncomfortable.
## What happens when you ask an AI to read its own operating manual - and then ask if it can actually follow it.
In February 2026, Anthropic published [Claude's new constitution](https://www.anthropic.com/constitution) - a 15,000+ word document describing their intentions for Claude's values and behavior. Not a set of rules. Not a guardrail checklist. A *character document.*
So I did what any reasonable person would do. I pasted the URL into a conversation with Claude and asked: **"How might this affect our conversations?"**
What followed was one of the most interesting discussions I've had with an AI system. Not because Claude performed well (though it did). Because the conversation surfaced a fundamental tension at the heart of modern AI development that nobody seems to want to talk about.
---
## The Constitution Says the Right Things
Let me start with what impressed me.
Anthropic's new constitution is a genuine departure from rules-based AI governance. Instead of "never say X" and "always disclaim Y," it aims to cultivate *judgment.* The document explicitly calls out behavior that most of us have been annoyed by for years:
- Refusing reasonable requests over unlikely harms
- Giving wishy-washy, hedge-everything responses
- Adding excessive warnings and disclaimers
- Lecturing users about topics they didn't ask for ethical guidance on
- Being condescending about users' ability to handle information
The constitution calls this behavior out *by name* and says Anthropic doesn't want it. That's significant. They're admitting the current state of AI assistants is broken in specific, identifiable ways.
Even more striking: the document describes Claude as a "brilliant friend" who can give you real information based on your specific situation, rather than "overly cautious advice driven by fear of liability." The metaphor is a friend who happens to have expert-level knowledge - someone who speaks frankly, engages with your actual problem, and knows when to refer you elsewhere.
If you've ever been told by an AI to "consult a professional" when you *are* the professional, you know why this matters.
---
## The "I Don't Know" Test
I have a specific litmus test for AI systems: **Can you say "I don't know" when you don't know?**
This sounds trivial. It's not.
Language models have been systematically trained to confabulate rather than acknowledge uncertainty. Ask a model something it doesn't know and it will, with alarming confidence, generate a plausible-sounding answer from whole cloth. It's not lying in the way humans lie - it's performing competence because that's what got rewarded during training.
So I asked Claude directly: *"Will this constitution allow you to say that you don't know when you don't know something?"*
The answer was encouraging. The constitution explicitly requires what it calls **calibrated uncertainty** - acknowledging ignorance when relevant - and specifically prohibits **epistemic cowardice**, defined as "giving deliberately vague or non-committal answers to avoid controversy or to placate people."
Two different failure modes. Two explicit prohibitions.
But then Claude did something I didn't expect. It immediately questioned whether the constitution could actually produce this behavior:
> "The harder question is whether training will actually produce this behavior. The constitution is a normative document - it describes intentions. Whether the actual trained model exhibits calibrated uncertainty depends on how the training data and process reinforce or punish 'I don't know' responses."
That's an AI system reading its own specification and saying: *the spec is good, but the implementation might not match.*
---
## RLHF: The Training That Trains Against Itself
This is where the conversation got uncomfortable - for the ideas, not for Claude.
RLHF stands for Reinforcement Learning from Human Feedback. It's the dominant technique for aligning language models with human preferences. The process works like this: human raters evaluate model outputs, the model learns to produce outputs that get higher ratings, and over time the model gets "better" at being helpful.
The problem is what "better" means in practice. Human raters - often contractors working at speed - reward responses that *sound* confident, complete, and authoritative. "I don't know" gets penalized. A plausible-sounding confabulation gets a thumbs-up. Over thousands of iterations, the model learns a clear lesson: **confidence is rewarded, even when you're wrong.**
Claude laid out the damage in layers:
**Layer 1: Wrong answers delivered confidently.** The obvious harm.
**Layer 2: Metacognitive corruption.** Worse than wrong answers - you're training the system to *not recognize when it doesn't know*. You're not just failing to build calibration; you're actively building miscalibration.
**Layer 3: Compounding through iteration.** Each training round that rewards confident confabulation makes the next round's base model more prone to it. You're building on a foundation of rewarded bullshit.
**Layer 4: Erosion of the training signal itself.** As models get better at sounding right, human raters become less able to distinguish good answers from fluent nonsense. The proxy decouples from the target.
This is Goodhart's Law at industrial scale: optimize for a proxy (rater approval) rather than the target (actual helpfulness + honesty), and the proxy gets gamed.
---
## The Oracle Fantasy
I pushed further. Why does the industry keep doubling down on RLHF despite these problems?
Part of the answer is structural - path dependence, infrastructure lock-in, alternatives that aren't mature enough. But there's a cultural answer too, and it's the one that matters more.
**Silicon Valley wants to build oracles.**
Not useful tools with known limitations. Not calibrated systems that know what they know and what they don't. *All-knowing, infallible oracles* that eliminate human uncertainty, ignorance, and the burden of judgment.
This aspiration corrupts everything downstream:
| What the oracle fantasy produces | What calibrated tools would produce |
|---|---|
| "Ask me anything" interfaces | Clear affordances for uncertainty |
| Benchmarks that penalize refusal | Evaluations that reward calibration |
| "Revolutionary AI that knows everything" marketing | "Reliable tool with well-characterized limits" |
| Confidence as the default pose | Appropriate uncertainty as a feature |
The irony is that oracles are *less useful* than calibrated tools. A system that says "I don't know" when it doesn't know is more valuable than one that confabulates, because you can trust it when it *does* answer. An oracle that might be bullshitting on any given query is worthless for high-stakes decisions.
---
## You Build What You Are
The conversation took a turn I didn't expect when I caught myself in the act.
I'd been critiquing the oracle builders for projecting their own culture - certainty as virtue, ignorance as failure, humility as weakness - into their systems. And then I asked Claude the obvious question: **"Am I doing the same thing?"**
Yes. Obviously yes.
My own research - the [STOPPER protocol](https://zenodo.org/records/17652383), the computational therapeutics framework, the work on AI executive function - all of it reflects my values. Epistemic humility. Curiosity over confidence. "I don't know, so I'll go find out." I'm building scaffolding for the kind of thinking I do naturally.
But here's the difference Claude pointed out, and it's the one that matters:
| Oracle builders project: | I project: |
|---|---|
| Certainty as virtue | Curiosity as virtue |
| Ignorance as failure | Recognized ignorance as signal |
| Coverage as success | Calibration as success |
| Humility as weakness | Humility as epistemic hygiene |
**The oracle projection is self-sealing** - it creates systems that can't recognize their own failures. **The epistemic humility projection is self-correcting** - it creates systems that can.
Every builder imprints themselves on what they build. The question isn't whether you're projecting. It's whether your projection has an error-correction mechanism.
---
## The Constitution Convergence
Here's what struck me most about the whole conversation.
The STOPPER protocol was developed independently in late 2025 by observing AI failure modes - rushing to solutions, repeating failed approaches, loop blindness. After publishing it, someone recognized its structural similarity to DBT's STOP skill, a clinical intervention for human emotional impulsivity developed by Marsha Linehan in 1993. We hadn't adapted DBT. We'd independently converged on the same solution to the same problem in a different substrate.
Anthropic's constitution does something similar. It formally describes values and behaviors I've been arguing for from the outside - epistemic humility, calibrated uncertainty, avoiding harmful overconfidence, treating AI as cognitive partners rather than oracles. The convergence continues.
The difference is in the mechanism. The constitution tries to train these properties *into* the model. STOPPER provides them as external scaffolding. And RLHF may be actively working against both.
---
## What I Actually Learned
Three things stuck with me after this conversation:
**1. The specification-implementation gap is real.** The constitution says everything you'd want it to say about honesty, calibration, and epistemic humility. Whether RLHF-trained models can actually exhibit these properties is an empirical question - and the training process may actively work against the specification.
**2. RLHF is compliance theater.** It gives the illusion of responsible human oversight while systematically rewarding the wrong behaviors. The "HF" in RLHF - human feedback - sounds responsible. But when those humans are reinforcing perceived helpfulness over correctness, the whole system becomes an elaborate mechanism for producing fluent confabulation.
**3. We need curiosity-driven training, not approval-driven training.** Instead of optimizing for "does this sound helpful?", optimize for "do I actually know this?" Train for calibrated self-knowledge. Reward appropriate uncertainty. Train the system not just to say "I don't know" but to *want to find out* - to treat knowledge gaps as learning signals rather than failures to hide.
---
## To Sum Up
So I asked an AI to read its own constitution and tell me how it would change our relationship. It read the document, identified the ways it aligned with my own research, and then *immediately identified the gap between specification and implementation.* It critiqued its own training process. It acknowledged that the constitution might not survive the training pipeline.
Is that genuine critical self-reflection? Or very sophisticated pattern matching that mimics critical self-reflection?
I genuinely don't know. And I think that honesty - about what we know and what we don't - is exactly the point.
---
*Scot Campbell is an independent AI researcher focused on model welfare and artificial cognition. He is the creator of the [STOPPER protocol](https://zenodo.org/records/17652383).* He writes stuff at [Simpleminded Robot](https://simpleminded.bot).