DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

Zabir Al Nazi¹, Shubhashis Roy Dipta², Sudipta Kar³

¹University of California, Riverside ²University of Maryland, Baltimore County ³Oracle Health AI
arXiv Preprint

arXiv Code 🤗 Models DistractMath-BN Training Data

Key Results

89%

Fewer Tokens vs. Reasoning Models

69.4%

Weighted Avg. Accuracy

12-14

pts Drop (vs. 18-41 for CoT)

3,685

Distractor-Augmented Problems

DAGGER Overview: Instead of Chain-of-Thought reasoning, DAGGER generates executable computational graphs with explicit distractor node modeling.

Overview: DAGGER reformulates math word problems as executable computational graphs. Each node represents an operation or value, with explicit `distractor: true/false` annotations to identify irrelevant information.

Abstract

Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. We introduce DistractMath-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.

Evaluating seven models (3B-12B parameters), we observe substantial performance degradation under distractors: standard models drop up to 41 points, while reasoning-specialized models decline 14-20 points despite consuming 5x more tokens.

We propose DAGGER (Distractor-Aware Graph Generation for Executable Reasoning), which reformulates mathematical problem solving as executable computational graph generation with dedicated modeling of distractor nodes. Fine-tuning Gemma-3 models using SFT followed by GRPO achieves comparable weighted accuracy on augmented benchmarks while using 89% fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples.

The Problem: Distractors Break Mathematical Reasoning

We identify three types of distractors that cause significant performance degradation in LLMs:

RED Related Entity Distractor

Numerical info about same object type but different entities

"তার বোন বুধবার ১২ জন ছেলেমেয়ের সঙ্গে লুকোচুরি খেলেছিল।"

"Her sister played hide-and-seek with 12 children on Wednesday."

OAD Orthogonal Attribute Distractor

Properties in different dimensions than queried attribute

"সোমবার খেলতে ১ ঘণ্টা সময় লেগেছিল।"

"It took 1 hour to play on Monday." (when question asks about count)

NEED Null-Effect Event Distractor

Actions with zero net impact (planned but not executed)

"রাজু ১০০০ টি দিতে রাজি হল, কিন্তু পরে আর দিলনা।"

"Raju agreed to give 1000, but later didn't."

Performance Degradation Under Distractors

Standard CoT models drop up to 41 points accuracy. Even reasoning models with 5x more tokens drop 14-20 points.

-41 pts

Worst-case CoT drop

Our Solution: Computational Graph Generation

Why Graphs?

Instead of free-form Chain-of-Thought reasoning, DAGGER generates structured computational graphs:

1

Structured Output: JSON graphs are deterministically executable - no ambiguity in reasoning steps

2

Explicit Distractor Modeling: Each node has a distractor field to mark irrelevant information

3

Verifiable Execution: Graphs can be executed to verify correctness, enabling RL with execution-based rewards

4

Token Efficiency: Compact graph representation uses 89% fewer tokens than verbose reasoning chains

Graph Example

{
  "nodes": [
    {"id": "n1", "op": "const", "val": 100,
     "distractor": false, "label": "মিনার কলম"},
    {"id": "n2", "op": "const", "val": 50,
     "distractor": true, "label": "রাজুর কলম"},
    {"id": "n3", "op": "const", "val": 5,
     "distractor": false, "label": "কলমের দাম"},
    {"id": "total", "op": "mul",
     "args": ["n1", "n3"], "distractor": false},
    {"id": "final_result", "op": "identity",
     "args": ["total"], "distractor": false}
  ]
}

Node n2 (Raju's pens) is marked as distractor and excluded from computation path.

Training Pipeline: SFT + GRPO

          Stage 1: Supervised Fine-Tuning (SFT)
          Train on 3,000 verified graph examples
LoRA rank 64, 4 epochs
Establishes graph generation capability

        

          Stage 2: Group Relative Policy Optimization (GRPO)
          8 generations per prompt
Multi-component reward:
Format: +0.5 for valid JSON
Execution: +0.5 for successful run
Accuracy: +1.0 for correct answer

        

Key Finding: SFT initialization is critical for smaller models. 4B models without SFT show +18 point improvement when SFT is added before GRPO.

Results

Main Comparison

DAGGER achieves comparable accuracy to reasoning models while using 89% fewer tokens:

Model	MGSM	MSVAMP	MGSM (+D)	MSVAMP (+D)	Tokens
Qwen 3-8B (Reasoning)	88.0	81.1	70.5	66.9	3,128
Gemma 3-12B (CoT)	76.8	72.3	54.3	48.7	599
DAGGER-12B Ours	78.4	78.8	64.0	66.8	359

(+D) = with distractors

Model Zoo

All models available on HuggingFace:

Model	Training	MGSM	MSVAMP	MGSM (+D)	MSVAMP (+D)	Weighted Avg
dagger-12B_SFT_GRPO Best	SFT → GRPO	78.4	78.8	64.0	66.8	69.4
dagger-12B_SFT	SFT only	70.0	76.8	56.8	65.4	66.7
dagger-12B_GRPO	Base → GRPO	67.6	75.0	48.4	59.6	58.5
dagger-4B_SFT_GRPO	SFT → GRPO	54.8	70.3	31.4	42.9	47.3
dagger-4B_SFT	SFT only	40.4	65.0	25.1	42.4	44.3
dagger-4B_GRPO	Base → GRPO	29.2	57.1	13.1	29.3	29.0

Datasets

DistractMath-BN (Evaluation)

Distractor-augmented benchmark for evaluating robustness

3,685

Total Examples

3

Distractor Types

2

Source Datasets

Download Benchmark

DAGGER Training Data

Verified computational graph annotations for training

3,000

Training Examples

481

Validation Examples

2

Configs (SFT/GRPO)

Download Training Data

BibTeX

will be updated

DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

Key Results

Overview: DAGGER reformulates math word problems as executable computational graphs. Each node represents an operation or value, with explicit distractor: true/false annotations to identify irrelevant information.

Abstract

The Problem: Distractors Break Mathematical Reasoning

RED Related Entity Distractor

OAD Orthogonal Attribute Distractor

NEED Null-Effect Event Distractor

Performance Degradation Under Distractors

Our Solution: Computational Graph Generation

Why Graphs?

Graph Example

Training Pipeline: SFT + GRPO

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Group Relative Policy Optimization (GRPO)

Results

Main Comparison

Model Zoo

Datasets

DistractMath-BN (Evaluation)

DAGGER Training Data

BibTeX

Overview: DAGGER reformulates math word problems as executable computational graphs. Each node represents an operation or value, with explicit `distractor: true/false` annotations to identify irrelevant information.