From ffc83eb6cb161f8de923a93ed4b8688974f36bec Mon Sep 17 00:00:00 2001 From: Simon Strandgaard Date: Sun, 22 Mar 2026 22:53:17 +0100 Subject: [PATCH] docs: add proposal for evidence discipline, anti-handwave checks, and claim calibration Co-Authored-By: Claude Opus 4.6 (1M context) --- ...cipline_anti_handwave_claim_calibration.md | 519 ++++++++++++++++++ 1 file changed, 519 insertions(+) create mode 100644 docs/proposals/123-evidence_discipline_anti_handwave_claim_calibration.md diff --git a/docs/proposals/123-evidence_discipline_anti_handwave_claim_calibration.md b/docs/proposals/123-evidence_discipline_anti_handwave_claim_calibration.md new file mode 100644 index 00000000..92aa1c6e --- /dev/null +++ b/docs/proposals/123-evidence_discipline_anti_handwave_claim_calibration.md @@ -0,0 +1,519 @@ +# Strengthening PlanExe: Evidence Discipline, Anti-Handwave Checks, and Claim Calibration + +## Summary + +PlanExe is already good at turning a prompt into a broad planning artifact. Its stronger versions do more than produce structure: they expose assumptions, include adversarial sections, and try to reason about failure before execution. That is valuable. + +The weakness is different. PlanExe can still generate **credible-looking planning structure faster than credible support for its claims**. In practice, that means a plan can feel well-argued because it is detailed, while some of its most important claims remain underspecified, weakly evidenced, or overconfident. + +This document lays out what that weakness actually is, why it matters, and how to fix it. + +--- + +## The Core Problem + +A planning system can fail in two opposite ways: + +1. **Underproduction**: the plan is thin, generic, and misses key dimensions. +2. **Overproduction**: the plan is rich, coherent, and impressive, but confidence outruns evidence. + +PlanExe is already better than many systems on the first problem. The bigger risk now is the second. + +The specific failure pattern is: + +- a claim is made early, +- supporting logic is added later, +- surrounding structure makes the claim feel stronger than it is, +- weak assumptions become harder to notice because they are embedded inside a polished plan. + +This is not ordinary hallucination. It is **structured overreach**. + +--- + +## 1. Evidence Discipline + +### What it means + +Evidence discipline means that important claims must be backed by something stronger than fluent reasoning. A plan should distinguish between: + +- claims supported by direct evidence, +- claims supported by analogy or inference, +- speculative claims, +- assumptions that are merely placeholders. + +Without that distinction, all claims get flattened into the same rhetorical tone. + +### Why it matters + +In a planning system, weak evidence does not just create a bad sentence. It can distort: + +- prioritization, +- resource allocation, +- timelines, +- risk perception, +- go / no-go decisions. + +A bad claim in a plan is worse than a bad claim in a chat answer because it influences downstream structure. + +### Typical failure modes + +#### 1.1 Evidence-free specificity +The plan gives precise numbers, timelines, or outcome estimates without showing where they came from. + +Example: +- “This will reduce costs by 40% within five years.” +- “This approach should achieve 95% capability in 20 years.” + +The problem is not only that the number may be wrong. The problem is that specificity itself creates false authority. + +#### 1.2 Evidence laundering through structure +A weak claim appears in the executive summary, then gets repeated in the roadmap, KPI section, risk section, and pitch. Repetition makes it feel validated, even though nothing new was added. + +#### 1.3 Placeholder evidence masquerading as support +The system uses language like: +- “industry trends suggest” +- “experts may support” +- “this is likely to improve” +without naming the basis of the claim. + +That is not evidence. It is confidence-scented filler. + +#### 1.4 No distinction between direct support and inference +A plan often needs inference. That is fine. The problem is when inference is presented as if it were observed fact. + +### What PlanExe should do + +PlanExe should attach an **evidence status** to major claims. + +A simple scheme: + +| Status | Meaning | +|---|---| +| Observed | Directly supported by source material or explicit user input | +| Derived | Inferred from observed facts with a visible reasoning path | +| Estimated | Quantitative or qualitative estimate with explicit uncertainty | +| Assumed | Placeholder assumption not yet validated | +| Speculative | Creative or strategic hypothesis with weak support | + +Each nontrivial claim should be traceable to one of these. + +### Stronger mechanism: claim ledger + +PlanExe should maintain a **claim ledger** for all high-impact claims. + +Each row should include: + +- claim text, +- claim type, +- evidence status, +- support source, +- uncertainty level, +- downstream dependencies, +- disproof condition, +- owner for validation. + +This would stop plans from treating all claims as equal. + +--- + +## 2. Anti-Handwave Checks + +### What “handwaving” actually is + +Handwaving is not just vagueness. It is when a plan appears to explain something while skipping the part that carries the real burden. + +Typical forms: + +- naming a solution without specifying the mechanism, +- naming a metric without defining how it will be measured, +- naming a dependency without describing how it will be secured, +- naming a risk without making it operational. + +A handwave is a **missing bridge disguised as a bridge**. + +### Why PlanExe is vulnerable + +PlanExe is good at producing complete-looking artifacts. That creates a specific danger: empty connective tissue can hide inside otherwise strong structure. + +For example: + +- “Use AI for adaptive control” can hide the absence of a sensing, feedback, and retraining story. +- “Ensure stakeholder alignment” can hide the absence of decision rights and conflict resolution. +- “Validate feasibility through pilots” can hide the absence of a pilot design, threshold, and stop rule. + +### Common handwave patterns + +#### 2.1 Magic verb pattern +Claims rely on verbs like: +- optimize, +- leverage, +- ensure, +- enable, +- integrate, +- streamline, +- validate. + +These verbs often conceal missing mechanism. + +#### 2.2 Abstract noun shield +The plan uses high-status abstractions such as: +- innovation, +- resilience, +- scalability, +- adaptability, +- synergy, +- transformation. + +These can be useful, but they often mask missing specifics. + +#### 2.3 Deferred detail laundering +The plan pushes all hard questions into future phases: +- “details will be defined later,” +- “an implementation strategy will be developed,” +- “specific thresholds will be established during execution.” + +Sometimes that is appropriate. Often it is a way of postponing the point where the idea has to become real. + +#### 2.4 KPI theater +A metric is named, but the measurement definition is absent. + +Bad example: +- “Improve efficiency by 30%.” + +Better: +- “Reduce median cycle time per unit from X to Y under condition Z, measured across N runs.” + +### What PlanExe should do + +PlanExe should run a dedicated **anti-handwave pass** after drafting. + +For each major section, it should ask: + +- Is the mechanism explicit? +- Are terms operationally defined? +- Are success metrics measurable? +- Are dependencies concrete? +- Are thresholds specified? +- Does any sentence rely on persuasive language without decision-useful content? + +### Recommended handwave detector rubric + +Flag sentences containing any of the following unless paired with operational detail: + +- optimize +- enhance +- enable +- leverage +- improve +- robust +- scalable +- efficient +- innovative +- strategic +- transformative + +The detector should not just block these words. It should ask for the missing bridge. + +For example: + +**Original:** +“Use AI to optimize adaptive manufacturing workflows.” + +**Required expansion:** +- What data enters the system? +- What model class is assumed? +- What output does it change? +- What action loop is closed? +- What failure cases exist? +- What evidence suggests the loop will work? + +### Stronger mechanism: burden-of-specificity triggers + +Some claims should automatically trigger required elaboration, especially claims about: + +- automation, +- cost reduction, +- timelines, +- commercial viability, +- scaling, +- safety, +- regulatory feasibility, +- staffing, +- ROI, +- partnerships. + +These are high-risk bullshit zones. They should never pass in vague form. + +--- + +## 3. Claim Calibration + +### What it means + +Claim calibration means matching the strength of a statement to the strength of its support. + +Bad calibration: +- speaking with certainty when evidence is weak, +- speaking vaguely when evidence is strong, +- failing to separate possibility from probability. + +Calibration is about the **fit between confidence and justification**. + +### Why it matters + +Poor calibration corrupts decision-making. It causes people to: + +- commit too early, +- underweight downside, +- ignore unknowns, +- mistake elegance for confidenceworthiness. + +A planning system should not merely generate plans. It should help people know **which parts of the plan deserve trust**. + +### Common calibration failures + +#### 3.1 Binary confidence language +The plan says: +- “will” +- “ensures” +- “guarantees” +- “is expected to” +when a more honest phrasing would show uncertainty. + +#### 3.2 False precision +The plan gives exact percentages or dates when the underlying basis is weak. + +#### 3.3 Confidence contamination +A strong claim in one area spills over into adjacent unsupported claims. + +Example: +- because a prototype seems feasible, the plan starts sounding confident about commercialization, labor availability, and regulatory approval too. + +#### 3.4 No uncertainty decomposition +A plan treats a claim as singular even though it depends on multiple uncertain subclaims. + +For example: +“This platform can scale.” + +That may actually depend on: +- technical repeatability, +- supply chain reliability, +- training needs, +- capex availability, +- regulatory acceptance, +- quality control. + +### What PlanExe should do + +PlanExe should require **confidence labels** on major claims. + +A simple scale: + +| Confidence | Meaning | +|---|---| +| High | Strong support, low ambiguity, limited dependency uncertainty | +| Medium | Plausible but materially dependent on unresolved assumptions | +| Low | Weak support, speculative, or strongly dependent on unknowns | + +This should be paired with a **why** field, not just a label. + +Example: + +> Claim: Modular factory reconfiguration can occur within 72 hours. +> Confidence: Low +> Why: Depends on unvalidated assumptions about tooling interchangeability, workforce readiness, and software integration. + +### Stronger mechanism: decomposition before confidence + +Before assigning confidence to a high-stakes claim, PlanExe should decompose it into subclaims. + +For each subclaim: +- assign support type, +- assign confidence, +- identify unknowns, +- list what would change the score. + +Then roll that upward. + +This prevents confident summary claims from being detached from fragile internals. + +--- + +## 4. What a Better Plan Artifact Would Look Like + +A stronger PlanExe artifact would not just contain more sections. It would contain more **truth structure**. + +That means every major recommendation should expose: + +- what is being claimed, +- what supports it, +- what remains assumed, +- what would falsify it, +- how confident the system is, +- which downstream sections depend on it. + +Instead of a plan reading like a polished narrative, it should read like a **decision model with visible load-bearing parts**. + +### Example transformation + +#### Weak version +“We recommend prioritizing adaptive hybrid manufacturing because it offers the greatest scalability and long-term strategic value.” + +#### Stronger version +“We recommend prioritizing adaptive hybrid manufacturing. +Support type: Derived. +Basis: User objective emphasizes flexible manufacturing across variable inputs; hybrid processes cover a broader process space than purely additive systems. +Key assumptions: Material handling variability remains within controllable bounds; calibration overhead does not erase throughput gains. +Confidence: Medium. +Disproof triggers: Pilot tests show reconfiguration time exceeds threshold; quality variance remains above target after calibration. +Downstream dependencies: staffing plan, equipment selection, KPI definitions, validation roadmap.” + +That second version is less pretty, but more decision-useful. + +--- + +## 5. Concrete Design Changes for PlanExe + +## A. Add a claim-layer beneath the prose + +Every important paragraph should have a machine-readable shadow structure: + +- claim, +- support, +- assumptions, +- uncertainty, +- disproof conditions. + +The visible prose can stay readable, but the system should reason over the structured layer. + +## B. Require support tags in key sections + +At minimum, these sections should require support tagging: + +- Executive Summary +- Strategic Decisions +- KPI section +- Budget / ROI claims +- Timeline claims +- Risk mitigation claims +- Recommendations + +## C. Add a “why this might be wrong” line to major recommendations + +Every major recommendation should include a compact adversarial line. + +Example: +- “This may be wrong because the throughput benefit depends on a calibration regime not yet validated.” + +This would drastically reduce overconfident planning prose. + +## D. Add handwave linting + +Run a linting pass that flags: + +- abstract claims without mechanism, +- metrics without measurement definitions, +- timelines without basis, +- benefits without dependency chain, +- “AI” claims without control-loop detail, +- cost claims without drivers, +- scale claims without bottleneck analysis. + +## E. Force quantitative humility + +When numbers appear, require one of: + +- source, +- derivation, +- estimate note, +- uncertainty range, +- sensitivity note. + +A bare number should be disallowed in high-stakes sections. + +## F. Separate “analysis complete” from “execution ready” + +A plan can be analytically rich but still not execution-ready. + +PlanExe should explicitly score readiness on dimensions like: + +- evidence maturity, +- dependency maturity, +- stakeholder alignment, +- resource realism, +- regulatory clarity, +- validation completeness. + +This prevents polished plans from being mistaken for greenlit plans. + +--- + +## 6. A Proposed Evaluation Rubric + +To improve PlanExe, evaluate plans not just for completeness but for epistemic quality. + +### Evidence Discipline +- Are major claims tagged by support type? +- Are numbers grounded? +- Are assumptions distinguished from findings? +- Can claims be traced to support? + +### Anti-Handwave Quality +- Are mechanisms explicit? +- Are key terms operationalized? +- Are metrics measurable? +- Are high-risk claims elaborated beyond slogans? + +### Claim Calibration +- Does confidence match support? +- Are uncertainty and dependency chains visible? +- Are summary claims weaker when internals are weak? +- Are disproof conditions stated? + +### Bullshit Resistance +- Can a reviewer quickly identify the weakest load-bearing assumptions? +- Can the plan be falsified? +- Does the plan expose where it is vulnerable? +- Does repetition amplify unsupported claims, or merely summarize supported ones? + +--- + +## 7. What “Good” Would Look Like + +A genuinely strong PlanExe system would produce plans that are: + +- comprehensive without being bloated, +- explicit without being theatrical, +- uncertain without becoming useless, +- critical without becoming paralyzed, +- persuasive only where support justifies persuasion. + +The target is not “more skeptical text.” +The target is **better alignment between what is said, what is known, and what is still guesswork**. + +That is what evidence discipline, anti-handwave checks, and claim calibration are really about. + +--- + +## Final View + +PlanExe does not mainly need more sections. It needs stronger constraints on how claims enter, spread, and harden inside a plan. + +The goal is to prevent this sequence: + +1. a plausible claim appears, +2. it gets repeated, +3. it acquires structure, +4. it starts to feel true. + +Instead, the system should force a different sequence: + +1. a claim appears, +2. its support status is made explicit, +3. its weak points are surfaced, +4. its confidence is calibrated, +5. only then is it allowed to shape the rest of the plan. + +That shift would make PlanExe much more than a sophisticated planning generator. + +It would make it a system that actively resists false confidence.