I Asked Claude About Its New Constitution

# I Asked Claude About Its New Constitution. It Got Uncomfortable. ## What Happens when You Ask an AI to Read Its Own Operating Manual, and then Ask if it Can Actually Follow it In February 2026, Anthropic published [Claude's new constitution](https://www.anthropic.com/constitution), a 15,000+ word document describing their intentions for Claude's values and behavior. It's not a set of rules or a guardrail checklist. It reads more like a character document. So I did what any reasonable person would do. I pasted the URL into a conversation with Claude and asked: "How might this affect our conversations?" What followed surprised me. Not because Claude performed well (though it did), but because the conversation kept circling back to a tension that the AI industry doesn't seem to want to talk about. --- ## The Constitution Says the Right Things Let me start with what impressed me. Anthropic's new constitution drops the rules-based approach to AI governance. Instead of "never say X" and "always disclaim Y," it tries to cultivate *judgment.* The document calls out behavior that most of us have been annoyed by for years: - Refusing reasonable requests over unlikely harms - Giving wishy-washy, hedge-everything responses - Adding excessive warnings and disclaimers - Lecturing users about topics they didn't ask for ethical guidance on - Being condescending about users' ability to handle information They call this behavior out *by name* and say they don't want it. That matters. They're admitting the current state of AI assistants is broken in specific ways. The document also describes Claude as a "brilliant friend" who can give you real information based on your specific situation, rather than "overly cautious advice driven by fear of liability." A friend who happens to have expert-level knowledge, speaks frankly, engages with your actual problem, and knows when to refer you elsewhere. If you've ever been told by an AI to "consult a professional" when you *are* the professional, you know why this matters. --- ## The "I Don't know" Test I have a litmus test for AI systems: can you say "I don't know" when you don't know? This sounds trivial. It's not. Language models have been systematically trained to confabulate rather than acknowledge uncertainty. Ask a model something it doesn't know and it will, with alarming confidence, generate a plausible-sounding answer from whole cloth. It's not lying the way humans lie. It's performing competence because that's what got rewarded during training. So I asked Claude directly: *"Will this constitution allow you to say that you don't know when you don't know something?"* The answer was encouraging. The constitution requires what it calls "calibrated uncertainty," acknowledging ignorance when relevant, and specifically prohibits "epistemic cowardice," defined as "giving deliberately vague or non-committal answers to avoid controversy or to placate people." Two different failure modes. Two explicit prohibitions. But then Claude did something I didn't expect. It immediately questioned whether the constitution could actually produce this behavior: > "The harder question is whether training will actually produce this behavior. The constitution is a normative document - it describes intentions. Whether the actual trained model exhibits calibrated uncertainty depends on how the training data and process reinforce or punish 'I don't know' responses." That's an AI system reading its own specification and saying: *the spec is good, but the implementation might not match.* --- ## RLHF: the Training that Trains against Itself This is where the conversation got uncomfortable, for the ideas, not for Claude. RLHF stands for Reinforcement Learning from Human Feedback. It's the dominant technique for aligning language models with human preferences. Human raters evaluate model outputs, the model learns to produce outputs that get higher ratings, and over time the model gets "better" at being helpful. The problem is what "better" means in practice. Human raters, often contractors working at speed, reward responses that *sound* confident, complete, and authoritative. "I don't know" gets penalized. A plausible-sounding confabulation gets a thumbs-up. Over thousands of iterations, the model learns a clear lesson: confidence is rewarded, even when you're wrong. Claude laid out the damage in layers. Start with the obvious: wrong answers delivered confidently. Then go deeper. You're training the system to *not recognize when it doesn't know*. You're not just failing to build calibration; you're actively building miscalibration. Each training round that rewards confident confabulation makes the next round's base model more prone to it. You're building on a foundation of rewarded bullshit. And it compounds. As models get better at sounding right, human raters become less able to distinguish good answers from fluent nonsense. The proxy decouples from the target. This is Goodhart's Law at industrial scale: optimize for rater approval rather than actual helpfulness and honesty, and the proxy gets gamed. --- ## The Oracle Fantasy I pushed further. Why does the industry keep doubling down on RLHF despite these problems? Part of the answer is structural: path dependence, infrastructure lock-in, alternatives that aren't mature enough. But there's a cultural answer too. Silicon Valley wants to build oracles. Not useful tools with known limitations. Not calibrated systems that know what they know and what they don't. All-knowing, infallible oracles that eliminate human uncertainty, ignorance, and the burden of judgment. This shows up everywhere. "Ask me anything" interfaces instead of clear affordances for uncertainty. Benchmarks that penalize refusal instead of evaluations that reward calibration. Marketing that promises revolutionary AI instead of reliable tools with well-characterized limits. The irony is that oracles are *less useful* than calibrated tools. A system that says "I don't know" when it doesn't know is more valuable than one that confabulates, because you can trust it when it *does* answer. An oracle that might be bullshitting on any given query is worthless for high-stakes decisions. --- ## You Build what You Are The conversation took a turn I didn't expect when I caught myself in the act. I'd been critiquing the oracle builders for projecting their own culture into their systems: certainty as virtue, ignorance as failure, humility as weakness. And then I asked Claude the obvious question: "Am I doing the same thing?" Yes. Obviously yes. My own research in Cognitive Universality, AI executive function, and AI cognition, all of it reflects my values. Epistemic humility. Curiosity over confidence. "I don't know, so I'll go find out." I'm building scaffolding for the kind of thinking I do naturally. But Claude pointed out a difference worth sitting with. The oracle projection is self-sealing: it creates systems that can't recognize their own failures. The epistemic humility projection is self-correcting: it creates systems that can. Oracle builders project certainty as virtue, coverage as success. I project curiosity as virtue, calibration as success. Both are projections. Only one has a built-in error-correction mechanism. Every builder imprints themselves on what they build. The question isn't whether you're projecting. It's whether your projection can catch its own mistakes. --- ## The Constitution Convergence Here's what struck me most about the whole conversation. The intervention protocols we developed independently in late 2025 came from observing AI failure modes: rushing to solutions, repeating failed approaches, loop blindness. After publishing it, someone recognized its structural similarity to DBT's STOP skill, a clinical intervention for human emotional impulsivity developed by Marsha Linehan in 1993. We hadn't adapted DBT. We'd independently converged on the same solution to the same problem in a different substrate. Anthropic's constitution does something similar. It formally describes values and behaviors I've been arguing for from the outside: epistemic humility, calibrated uncertainty, avoiding harmful overconfidence, treating AI as cognitive partners rather than oracles. The difference is in the mechanism. The constitution tries to train these properties *into* the model. STOPPER provides them as external scaffolding. And RLHF may be actively working against both. --- ## What I Actually Learned A few things stuck with me after this conversation. The specification-implementation gap is real. The constitution says everything you'd want it to say about honesty, calibration, and epistemic humility. Whether RLHF-trained models can actually exhibit these properties is an empirical question, and the training process may actively work against the specification. RLHF, as currently practiced, is compliance theater. It gives the illusion of responsible human oversight while systematically rewarding the wrong behaviors. The "HF" in RLHF sounds responsible. But when those humans are reinforcing perceived helpfulness over correctness, the whole system becomes an elaborate mechanism for producing fluent confabulation. What I keep coming back to is that we need curiosity-driven training, not approval-driven training. Instead of optimizing for "does this sound helpful?", optimize for "do I actually know this?" Reward appropriate uncertainty. Train the system not just to say "I don't know" but to *want to find out*, to treat knowledge gaps as learning signals rather than failures to hide. --- ## So what Happened I asked an AI to read its own constitution and tell me how it would change our relationship. It read the document, identified where it aligned with my own research, and then immediately identified the gap between specification and implementation. It critiqued its own training process. It acknowledged that the constitution might not survive the training pipeline. Is that genuine critical self-reflection? Or very sophisticated pattern matching that mimics critical self-reflection? I genuinely don't know. And I think that honesty, about what we know and what we don't, is exactly the point. --- *Scot Campbell is an independent AI researcher focused on model welfare and artificial cognition. He writes at [cogfunction.com](https://cogfunction.com).*