We systematically quantify a phenomenon where LLMs flip correct zero-shot answers to incorrect ones when prompted to reason step-by-step, and characterize when this cost outweighs the benefit across models and task domains.
Chain-of-thought (CoT) prompting
We challenge this assumption by identifying and systematically quantifying a phenomenon we term the Feynman Trap — named after Richard Feynman’s observation that “the first principle is that you must not fool yourself, and you are the easiest person to fool.” Specifically, we find that LLMs which answer correctly under zero-shot (ZS) conditions frequently produce wrong answers when prompted to reason step-by-step. The reasoning process appears to corrupt correct intuitions.
This is not merely a matter of CoT failing to help on some questions. We document a specific, directional failure mode: answers that were correct without reasoning become incorrect under reasoning. Across four 7B-class models on GSM8K
Our contributions are:
Chain-of-thought prompting and variants. Wei et al.
Reasoning quality and verification. Lightman et al.
Faithfulness and reliability of reasoning. Turpin et al.
Overthinking and reasoning costs. Several concurrent works examine reasoning costs. Chen et al.
Emergent capabilities and scaling. Wei et al.
Models. We evaluate four 7B-class models spanning different architectures and training approaches: Qwen2.5-7B
Datasets. We use two benchmarks:
Protocol. For each model × dataset combination, we collect:
#### [number] for GSM8K).A C→I flip is defined as a question where the model answers correctly under ZS but incorrectly under CoT. All experiments use greedy decoding (temperature 0) for ZS and CoT, with generation on 4×A100-40GB GPUs.
Statistical rigor. We report McNemar’s paired test
The table below summarizes the core finding across all four models on GSM8K (n=1,319 per model). We report both raw flip rates and corrected rates. Correction is performed via automated suffix-stripping and re-extraction applied to all C→I flips (not sampled): for each flip, we strip known model-appended suffixes (e.g., “You are an AI assistant” for Qwen) and re-extract the numerical answer. If re-extraction recovers the correct answer, the flip is reclassified as an extraction artifact. This procedure is deterministic and applied exhaustively to all flips.
| Model | ZS (%) | CoT (%) | Δ | Raw C→I | Raw Flip Rate | Corrected C→I | Corrected Flip Rate | McNemar χ² | Cohen’s h |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | 18.3 | 75.4 | +57.1 | 39 | 16.2% | 23 | 9.5% [95% CI: 5.8, 13.9] | 681.5*** | +1.22 |
| Llama-3.1-8B | 14.6 | 23.4 | +8.9 | 131 | 68.2% | 131 | 68.2% [61.5, 74.5] | 35.5*** | +0.23 |
| Llama-2-7B | 4.2 | 15.5 | +11.3 | 46 | 83.6% | 41 | 74.5% [63.6, 85.5] | 90.9*** | +0.40 |
| Mistral-7B | 4.5 | 7.2 | +2.7 | 52 | 88.1% | 52 | 88.1% [79.7, 94.9] | 8.8** | +0.12 |
*** p < 0.001, ** p < 0.01. Raw Flip Rate = C→I / (ZS correct). Corrected Flip Rate excludes extraction artifacts identified by automated re-extraction on all flips; 95% bootstrap CIs in brackets. McNemar tests use raw counts.
CoT is net positive for all four models — the I→C (incorrect→correct) benefit dominates. Yet the C→I cost is substantial: after automated correction for known extraction artifacts (a conservative lower bound on true artifact rates), Mistral still flips 88% of its zero-shot-correct answers when asked to reason, and even Qwen, the strongest model, loses 9.5% of its correct answers to reasoning errors under CoT.
We observe an inverse association between capability and flip rate across these four models (r = −0.98, though we note this correlation is computed over only 4 points and should be treated as an exploratory observation rather than a robust scaling law). The pattern is suggestive: stronger models may have more robust internal representations that survive the perturbation introduced by step-by-step reasoning, while weaker models’ fragile correct answers are more easily overwritten. Confirming this trend would require evaluation across a wider range of model scales.
A natural concern is that zero-shot correct answers at 4–18% accuracy could be lucky guesses. Permutation tests (10K iterations) confirm this is not the case:
| Model | ZS Accuracy | Random Baseline | Ratio | p-value |
|---|---|---|---|---|
| Qwen2.5-7B | 18.3% | 1.60% | 11.4× | < 0.01 |
| Llama-3.1-8B | 14.6% | 1.24% | 11.8× | < 0.01 |
| Mistral-7B | 4.5% | 1.58% | 2.8× | < 0.01 |
| Llama-2-7B | 4.2% | 0.80% | 5.2× | < 0.01 |
All models exceed the 99th percentile of the random distribution, confirming genuine zero-shot mathematical competence. The ZS-correct answers that CoT subsequently flips represent real knowledge, not noise.
The corrected flip rates above are computed via automated re-extraction applied to all flips. To independently validate this automated procedure and characterize the remaining genuine errors, we conduct a human audit of 100 randomly sampled flips (25 per model). The audit reveals two distinct failure modes:
| Category | Count | Proportion |
|---|---|---|
| Extraction artifact | 56 | 56% |
| Genuine reasoning error | 30 | 30% |
| Grabbed number from question | 13 | 13% |
| Near miss (within 5%) | 1 | 1% |
Extraction failures occur when the model reaches the correct answer but the extraction pipeline fails — typically because the model appends system-prompt text or wraps answers in non-standard formatting. These are pipeline artifacts, not reasoning failures. The automated correction procedure (applied to all flips) reduces Qwen’s flip count from 39 to 23 (corrected rate: 9.5%) and Llama-2’s from 46 to 41 (corrected rate: 74.5%). For Llama-3.1 and Mistral, no extraction artifacts were detected by automated re-extraction. The human audit finds a higher artifact rate (56%) than the automated procedure (21/268 = 7.8%) for two reasons: (1) the audit samples 25 flips per model uniformly, overweighting Qwen (which has 16/39 = 41% automated artifacts but only 39 total flips vs. Llama-3.1’s 131), and (2) the human audit uses a broader artifact definition — annotators classify any flip where the CoT reasoning arrived at the correct answer as an extraction artifact, catching cases that the automated suffix-stripping rule misses (e.g., formatting variations beyond known suffixes). The automated procedure is conservative by design; the audit provides an upper bound. The true rate of genuine reasoning corruption likely lies between the automated estimate and the audit estimate — the corrected flip rates we report should be interpreted as upper bounds on genuine reasoning flips, not exact counts.
Genuine reasoning errors identified in the human audit include arithmetic mistakes (39% of Llama-2’s audited flips), grabbing intermediate values from the problem text (35% of Llama-3.1), and malformed outputs (40% of Mistral). We report corrected flip rates with bootstrap 95% CIs as primary throughout.
To make these errors concrete, the following interactive explorer shows real flip examples from our experiments. Each card shows a GSM8K question where the model answered correctly without reasoning but produced the wrong answer after step-by-step thinking. The error in the CoT chain is highlighted in red.
A critical question is whether the Feynman Trap is specific to math or generalizes across domains. We evaluate all four models on CoQA conversational question answering.
| Model | CoQA ZS (%) | CoQA CoT (%) | Δ | C→I Flips | Flip Rate |
|---|---|---|---|---|---|
| Qwen2.5-7B | 61.8 | 56.4 | −5.4 | 51 | 16.5% |
| Llama-2-7B | 57.8 | 54.2 | −3.6 | 55 | 19.0% |
| Llama-3.1-8B | 57.0 | 36.6 | −20.4 | 131 | 46.0% |
| Mistral-7B | 56.0 | 29.2 | −26.8 | 164 | 58.6% |
On the full CoQA validation set (n=500, evaluated via F1 overlap), CoT produces a net accuracy loss for all four models. Models already answer correctly 56–62% of the time via direct pattern matching; step-by-step reasoning degrades performance. The degradation ranges from mild (Qwen, −5.4%) to severe (Mistral, −26.8%). Note that this set includes both open-ended and constrained (yes/no) questions — the constrained binary subset below reveals a more nuanced, capability-dependent pattern.
To eliminate any extraction confound, we evaluate on the 110 yes/no questions from CoQA where answers are constrained to binary choices:
| Model | ZS (%) | CoT (%) | Δ | C→I Flips | Flip Rate |
|---|---|---|---|---|---|
| Qwen2.5-7B | 86.4 | 88.2 | +1.8 | 4 | 4.2% |
| Llama-2-7B | 79.1 | 86.4 | +7.3 | 7 | 8.0% |
| Llama-3.1-8B | 74.5 | 36.4 | −38.1 | 47 | 57.3% |
| Mistral-7B | 70.0 | 30.9 | −39.1 | 56 | 72.7% |
The Yes/No subset reveals a descriptive split: stronger models (Qwen, Llama-2) show small positive trends under CoT (+1.8%, +7.3%), though neither reaches statistical significance by exact sign test (p=0.75, p=0.13 respectively given the small discordant counts). In contrast, weaker models (Llama-3.1, Mistral) lose 38–39 percentage points. This nuances the cross-domain finding: the Feynman Trap on text QA appears capability-dependent rather than universal. The fact that weaker models lose dramatically on binary questions with zero extraction ambiguity confirms the Feynman Trap is a genuine reasoning phenomenon for these models, not an evaluation artifact.
We use progressive CoT checkpointing — extracting answers at token positions [16, 32, 48, 64, 96, 128, 192, 256, 384, 512] — with a committed-flip criterion defined as follows: for each C→I flip, we identify the earliest checkpoint t at which the extracted answer matches the model’s final (incorrect) CoT answer, and this match persists at all subsequent checkpoints. This requires not just a one-time match but sustained commitment. Checkpoints below 32 tokens are excluded from commitment analysis due to high extraction noise (validated by checkpoint-to-final agreement rates of only 2–10% at ≤16 tokens).
| Model | Median Committed Flip | ≤64 tokens | ≤128 tokens | Mean Chain Length |
|---|---|---|---|---|
| Llama-3.1-8B | 64 tokens | 52% | 69% | 554 tokens |
| Qwen2.5-7B | 128 tokens | 13% | 51% | 197 tokens |
| Mistral-7B | 128 tokens | 42% | 69% | 700 tokens |
| Llama-2-7B | 145 tokens | 15% | 48% | 197 tokens |
Models commit to incorrect answers at median 64–145 tokens, well before reasoning completes. For Llama-3.1, 52% of committed flips occur within the first 64 tokens of a 554-token chain. For these flipped examples, the remaining reasoning serves to elaborate on an already-committed incorrect path rather than recovering the correct answer.
A key question is whether flips are primarily associated with the reasoning process itself or with CoT-specific formatting and extraction artifacts. While a fully controlled causal experiment would require holding output format and length constant while varying only reasoning depth — which is difficult with autoregressive LLMs — we approximate this by testing four prompt conditions on GSM8K that vary in reasoning structure:
| Model | ZS (%) | CoT (%) | Brief (%) | Verify (%) | CoT Flips | Brief Flips |
|---|---|---|---|---|---|---|
| Qwen2.5-7B | 18.3 | 75.4 | 71.4 | 67.0 | 39 | 35 |
| Llama-3.1-8B | 14.6 | 23.4 | 29.5 | 13.9 | 131 | 106 |
| Llama-2-7B | 4.2 | 15.5 | 19.3 | 9.9 | 46 | 38 |
| Mistral-7B | 4.5 | 7.2 | 13.4 | 0.9 | 52 | 33 |
Three findings emerge:
Brief reasoning causes flips too (33–106 vs. CoT’s 39–131). Since the Brief condition uses no structured format or extraction markers (e.g., ####), this suggests the flips are not solely attributable to the structured CoT format or extraction pipeline. However, we cannot fully rule out that verbosity, prompt wording, or output length contribute independently.
Brief outperforms CoT for weaker models: Llama-3.1 gains +6.1%, Llama-2 gains +3.8%, Mistral gains +6.2% with brief prompting over full CoT. The rigid step-by-step format adds overhead that hurts weaker models more than it helps. Only Qwen (the strongest) benefits from full CoT structure.
Verify (anchoring) degrades overall accuracy. Providing the ZS answer and asking to verify introduces new errors (Mistral drops to 0.9%). Models cannot reliably use their own zero-shot answers as verification anchors.
Self-consistency
| Model | CoT (%) | SC@5 (%) | Δ | CoT Flips | SC Flips | Rescues | Rescue Rate [95% CI] |
|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | 75.4 | 77.6 | +2.1 | 39 | 25 | 26 | 66.7% [51.0, 80.0] |
| Llama-3.1-8B | 23.4 | 32.4 | +8.9 | 131 | 101 | 56 | 42.8% [34.4, 51.1] |
| Mistral-7B | 7.2 | 10.6 | +3.4 | 52 | 45 | 9 | 17.3% [8.2, 29.4] |
| Llama-2-7B | 15.5 | 15.8 | +0.4 | 46 | 45 | 4 | 8.7% [2.4, 19.6] |
Rescue Rate 95% CIs computed via binomial proportion confidence intervals (Clopper-Pearson exact method).
SC@5 rescue rate shows a positive association with model capability (r ≈ 0.95 over 4 models; exploratory). Qwen recovers 67% of its flips [95% CI: 51–80%] — its correct intuition is robust enough to survive in the majority of 5 sampled reasoning paths. Llama-2, the model with the lowest SC rescue rate, recovers only 9% [95% CI: 2–20%] — the same incorrect reasoning recurs systematically across samples. (Note: Mistral is the weakest model by CoT accuracy but has a higher rescue rate than Llama-2, suggesting rescue rate is not strictly monotonic with capability.)
SC struggles on text QA. We ran SC@5 on CoQA for two models (Qwen and Llama-2): SC@5 accuracy dropped to 9.2% and 10.6% respectively (vs. CoT 56.4% and 54.2%), because majority vote fragments across diverse text phrasings of the same answer. With 5 samples producing 5 different surface-form answers (e.g., “New York”, “in New York”, “New York City”), no single phrasing achieves a majority. SC’s rescue mechanism requires a single canonical correct answer (as in math) to aggregate votes effectively.
Our results suggest that CoT prompting introduces a reasoning perturbation that can override correct direct-retrieval answers. The model’s zero-shot answer reflects a holistic pattern-match over the input; CoT forces a serial decomposition that creates opportunities for error propagation. Once an early step goes wrong, the remaining chain follows the erroneous path — consistent with our temporal analysis showing early commitment to errors.
The counterfactual experiments are consistent with this interpretation: the Brief condition (“think briefly”) — which lacks structured step-by-step formatting — produces comparable flip rates to full CoT, suggesting the flips are not solely due to format-specific artifacts, though the reasoning process likely contributes. We acknowledge that these prompt controls do not constitute a fully controlled causal experiment, as they also vary verbosity, output length, and prompt wording. A stronger causal identification would require interventions that hold these factors constant while varying only reasoning depth.
The interactive dashboard below lets readers explore the cost-benefit tradeoff for any model × task combination in our study, comparing all prompt strategies side-by-side.
Our cross-domain results suggest a nuanced prescription:
Model scale and scaling claims. We study four 7B-class models. The observed inverse association between capability and flip rate (r = −0.98) is computed over only 4 data points and should be treated as an exploratory observation, not a robust scaling law. Confirming this trend requires evaluation across more model sizes. Similarly, the SC rescue rate correlation (r ≈ 0.95) is exploratory.
Two task families. GSM8K and CoQA represent structured math and conversational text QA. Other task types (code generation, multi-step planning, scientific reasoning) may show different patterns.
Cross-domain nuance. The claim that “CoT hurts text QA” holds for all four models on the full CoQA set (F1 evaluation), but the Yes/No subset shows that stronger models (Qwen, Llama-2) show small positive trends under CoT (not statistically significant). The cross-domain finding is therefore evaluation-protocol-dependent and capability-dependent.
Extraction artifacts. Despite corrected flip rates and human audit (n=100), some residual extraction noise may remain. We foreground corrected rates and use the Yes/No CoQA subset (zero ambiguity) as an independent validation.
Prompt sensitivity. Our CoT prompt (“Let’s think step by step”) is one of many possible formulations. Different prompt wordings or few-shot examples could yield different flip rates. Our counterfactual controls vary the reasoning instruction but do not constitute a fully controlled experiment — they also vary verbosity and output format.
Counterfactual controls. The “Brief” and “Verify” conditions are suggestive but not clean causal manipulations. They show that format-free reasoning also causes flips, which is consistent with reasoning being the driver, but cannot fully isolate reasoning from correlated factors (prompt length, verbosity, attention distribution).
We identify and quantify the Feynman Trap: a systematic phenomenon where chain-of-thought prompting corrupts correct zero-shot answers in LLMs. Across four 7B-class models and two task domains, we find:
These findings challenge the assumption that reasoning is uniformly beneficial and suggest that the choice to use CoT should be task- and model-dependent. We hope this work encourages the community to evaluate CoT not only by its aggregate gains but also by its per-example costs.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX