DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
Key Results
Overview: DAGGER reformulates math word problems as executable computational graphs. Each node represents an operation or value, with explicit distractor: true/false annotations to identify irrelevant information.
Abstract
Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. We introduce DistractMath-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.
Evaluating seven models (3B-12B parameters), we observe substantial performance degradation under distractors: standard models drop up to 41 points, while reasoning-specialized models decline 14-20 points despite consuming 5x more tokens.
We propose DAGGER (Distractor-Aware Graph Generation for Executable Reasoning), which reformulates mathematical problem solving as executable computational graph generation with dedicated modeling of distractor nodes. Fine-tuning Gemma-3 models using SFT followed by GRPO achieves comparable weighted accuracy on augmented benchmarks while using 89% fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples.
The Problem: Distractors Break Mathematical Reasoning
We identify three types of distractors that cause significant performance degradation in LLMs:
RED Related Entity Distractor
Numerical info about same object type but different entities
"āϤāĻžāϰ āĻŦā§āύ āĻŦā§āϧāĻŦāĻžāϰ ⧧⧍ āĻāύ āĻā§āϞā§āĻŽā§āϝāĻŧā§āϰ āϏāĻā§āĻā§ āϞā§āĻā§āĻā§āϰāĻŋ āĻā§āϞā§āĻāĻŋāϞāĨ¤"
"Her sister played hide-and-seek with 12 children on Wednesday."
OAD Orthogonal Attribute Distractor
Properties in different dimensions than queried attribute
"āϏā§āĻŽāĻŦāĻžāϰ āĻā§āϞāϤ⧠⧧ āĻāĻŖā§āĻāĻž āϏāĻŽāϝāĻŧ āϞā§āĻā§āĻāĻŋāϞāĨ¤"
"It took 1 hour to play on Monday." (when question asks about count)
NEED Null-Effect Event Distractor
Actions with zero net impact (planned but not executed)
"āϰāĻžāĻā§ ā§§ā§Ļā§Ļā§Ļ āĻāĻŋ āĻĻāĻŋāϤ⧠āϰāĻžāĻāĻŋ āĻšāϞ, āĻāĻŋāύā§āϤ⧠āĻĒāϰ⧠āĻāϰ āĻĻāĻŋāϞāύāĻžāĨ¤"
"Raju agreed to give 1000, but later didn't."
Performance Degradation Under Distractors
Standard CoT models drop up to 41 points accuracy. Even reasoning models with 5x more tokens drop 14-20 points.
Our Solution: Computational Graph Generation
Why Graphs?
Instead of free-form Chain-of-Thought reasoning, DAGGER generates structured computational graphs:
distractor field to mark irrelevant information
Graph Example
{
"nodes": [
{"id": "n1", "op": "const", "val": 100,
"distractor": false, "label": "āĻŽāĻŋāύāĻžāϰ āĻāϞāĻŽ"},
{"id": "n2", "op": "const", "val": 50,
"distractor": true, "label": "āϰāĻžāĻā§āϰ āĻāϞāĻŽ"},
{"id": "n3", "op": "const", "val": 5,
"distractor": false, "label": "āĻāϞāĻŽā§āϰ āĻĻāĻžāĻŽ"},
{"id": "total", "op": "mul",
"args": ["n1", "n3"], "distractor": false},
{"id": "final_result", "op": "identity",
"args": ["total"], "distractor": false}
]
}
Node n2 (Raju's pens) is marked as distractor and excluded from computation path.
Training Pipeline: SFT + GRPO
Stage 1: Supervised Fine-Tuning (SFT)
- Train on 3,000 verified graph examples
- LoRA rank 64, 4 epochs
- Establishes graph generation capability
Stage 2: Group Relative Policy Optimization (GRPO)
- 8 generations per prompt
- Multi-component reward:
- Format: +0.5 for valid JSON
- Execution: +0.5 for successful run
- Accuracy: +1.0 for correct answer
Key Finding: SFT initialization is critical for smaller models. 4B models without SFT show +18 point improvement when SFT is added before GRPO.
Results
Main Comparison
DAGGER achieves comparable accuracy to reasoning models while using 89% fewer tokens:
| Model | MGSM | MSVAMP | MGSM (+D) | MSVAMP (+D) | Tokens |
|---|---|---|---|---|---|
| Qwen 3-8B (Reasoning) | 88.0 | 81.1 | 70.5 | 66.9 | 3,128 |
| Gemma 3-12B (CoT) | 76.8 | 72.3 | 54.3 | 48.7 | 599 |
| DAGGER-12B Ours | 78.4 | 78.8 | 64.0 | 66.8 | 359 |
(+D) = with distractors
Model Zoo
All models available on HuggingFace:
| Model | Training | MGSM | MSVAMP | MGSM (+D) | MSVAMP (+D) | Weighted Avg |
|---|---|---|---|---|---|---|
| dagger-12B_SFT_GRPO Best | SFT â GRPO | 78.4 | 78.8 | 64.0 | 66.8 | 69.4 |
| dagger-12B_SFT | SFT only | 70.0 | 76.8 | 56.8 | 65.4 | 66.7 |
| dagger-12B_GRPO | Base â GRPO | 67.6 | 75.0 | 48.4 | 59.6 | 58.5 |
| dagger-4B_SFT_GRPO | SFT â GRPO | 54.8 | 70.3 | 31.4 | 42.9 | 47.3 |
| dagger-4B_SFT | SFT only | 40.4 | 65.0 | 25.1 | 42.4 | 44.3 |
| dagger-4B_GRPO | Base â GRPO | 29.2 | 57.1 | 13.1 | 29.3 | 29.0 |
Datasets
DistractMath-BN (Evaluation)
Distractor-augmented benchmark for evaluating robustness
DAGGER Training Data
Verified computational graph annotations for training
BibTeX
will be updated