DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

1University of California, Riverside 2University of Maryland, Baltimore County 3Oracle Health AI
arXiv Preprint

Key Results

89%
Fewer Tokens vs. Reasoning Models
69.4%
Weighted Avg. Accuracy
12-14
pts Drop (vs. 18-41 for CoT)
3,685
Distractor-Augmented Problems
DAGGER Overview: Instead of Chain-of-Thought reasoning, DAGGER generates executable computational graphs with explicit distractor node modeling.

Overview: DAGGER reformulates math word problems as executable computational graphs. Each node represents an operation or value, with explicit distractor: true/false annotations to identify irrelevant information.

Abstract

Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. We introduce DistractMath-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.

Evaluating seven models (3B-12B parameters), we observe substantial performance degradation under distractors: standard models drop up to 41 points, while reasoning-specialized models decline 14-20 points despite consuming 5x more tokens.

We propose DAGGER (Distractor-Aware Graph Generation for Executable Reasoning), which reformulates mathematical problem solving as executable computational graph generation with dedicated modeling of distractor nodes. Fine-tuning Gemma-3 models using SFT followed by GRPO achieves comparable weighted accuracy on augmented benchmarks while using 89% fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples.

The Problem: Distractors Break Mathematical Reasoning

We identify three types of distractors that cause significant performance degradation in LLMs:

RED Related Entity Distractor

Numerical info about same object type but different entities

"āϤāĻžāϰ āĻŦā§‹āύ āĻŦ⧁āϧāĻŦāĻžāϰ ⧧⧍ āϜāύ āϛ⧇āϞ⧇āĻŽā§‡āϝāĻŧ⧇āϰ āϏāĻ™ā§āϗ⧇ āϞ⧁āϕ⧋āϚ⧁āϰāĻŋ āϖ⧇āϞ⧇āĻ›āĻŋāϞāĨ¤"

"Her sister played hide-and-seek with 12 children on Wednesday."

OAD Orthogonal Attribute Distractor

Properties in different dimensions than queried attribute

"āϏ⧋āĻŽāĻŦāĻžāϰ āϖ⧇āϞāϤ⧇ ā§§ āϘāĻŖā§āϟāĻž āϏāĻŽāϝāĻŧ āϞ⧇āϗ⧇āĻ›āĻŋāϞāĨ¤"

"It took 1 hour to play on Monday." (when question asks about count)

NEED Null-Effect Event Distractor

Actions with zero net impact (planned but not executed)

"āϰāĻžāϜ⧁ ā§§ā§Ļā§Ļā§Ļ āϟāĻŋ āĻĻāĻŋāϤ⧇ āϰāĻžāϜāĻŋ āĻšāϞ, āĻ•āĻŋāĻ¨ā§āϤ⧁ āĻĒāϰ⧇ āφāϰ āĻĻāĻŋāϞāύāĻžāĨ¤"

"Raju agreed to give 1000, but later didn't."

Performance Degradation Under Distractors

Standard CoT models drop up to 41 points accuracy. Even reasoning models with 5x more tokens drop 14-20 points.

-41 pts
Worst-case CoT drop

Our Solution: Computational Graph Generation

Why Graphs?

Instead of free-form Chain-of-Thought reasoning, DAGGER generates structured computational graphs:

1
Structured Output: JSON graphs are deterministically executable - no ambiguity in reasoning steps
2
Explicit Distractor Modeling: Each node has a distractor field to mark irrelevant information
3
Verifiable Execution: Graphs can be executed to verify correctness, enabling RL with execution-based rewards
4
Token Efficiency: Compact graph representation uses 89% fewer tokens than verbose reasoning chains

Graph Example

{
  "nodes": [
    {"id": "n1", "op": "const", "val": 100,
     "distractor": false, "label": "āĻŽāĻŋāύāĻžāϰ āĻ•āϞāĻŽ"},
    {"id": "n2", "op": "const", "val": 50,
     "distractor": true, "label": "āϰāĻžāϜ⧁āϰ āĻ•āϞāĻŽ"},
    {"id": "n3", "op": "const", "val": 5,
     "distractor": false, "label": "āĻ•āϞāĻŽā§‡āϰ āĻĻāĻžāĻŽ"},
    {"id": "total", "op": "mul",
     "args": ["n1", "n3"], "distractor": false},
    {"id": "final_result", "op": "identity",
     "args": ["total"], "distractor": false}
  ]
}

Node n2 (Raju's pens) is marked as distractor and excluded from computation path.

Training Pipeline: SFT + GRPO

Stage 1: Supervised Fine-Tuning (SFT)

  • Train on 3,000 verified graph examples
  • LoRA rank 64, 4 epochs
  • Establishes graph generation capability

Stage 2: Group Relative Policy Optimization (GRPO)

  • 8 generations per prompt
  • Multi-component reward:
    • Format: +0.5 for valid JSON
    • Execution: +0.5 for successful run
    • Accuracy: +1.0 for correct answer

Key Finding: SFT initialization is critical for smaller models. 4B models without SFT show +18 point improvement when SFT is added before GRPO.

Results

Main Comparison

DAGGER achieves comparable accuracy to reasoning models while using 89% fewer tokens:

Model MGSM MSVAMP MGSM (+D) MSVAMP (+D) Tokens
Qwen 3-8B (Reasoning) 88.0 81.1 70.5 66.9 3,128
Gemma 3-12B (CoT) 76.8 72.3 54.3 48.7 599
DAGGER-12B Ours 78.4 78.8 64.0 66.8 359

(+D) = with distractors

Model Zoo

All models available on HuggingFace:

Model Training MGSM MSVAMP MGSM (+D) MSVAMP (+D) Weighted Avg
dagger-12B_SFT_GRPO Best SFT → GRPO 78.4 78.8 64.0 66.8 69.4
dagger-12B_SFT SFT only 70.0 76.8 56.8 65.4 66.7
dagger-12B_GRPO Base → GRPO 67.6 75.0 48.4 59.6 58.5
dagger-4B_SFT_GRPO SFT → GRPO 54.8 70.3 31.4 42.9 47.3
dagger-4B_SFT SFT only 40.4 65.0 25.1 42.4 44.3
dagger-4B_GRPO Base → GRPO 29.2 57.1 13.1 29.3 29.0

Datasets

DistractMath-BN (Evaluation)

Distractor-augmented benchmark for evaluating robustness

3,685
Total Examples
3
Distractor Types
2
Source Datasets
Download Benchmark

DAGGER Training Data

Verified computational graph annotations for training

3,000
Training Examples
481
Validation Examples
2
Configs (SFT/GRPO)
Download Training Data

BibTeX

will be updated