DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

University of Maryland, Baltimore County

TL;DR: We treat claim decomposition as an RL policy optimization problem with seven multi-faceted rewards that define what makes a sub-question useful, informative, and diverse — closing the accuracy–traceability gap in fact verification with only 5k training claims.

Key Results

86.3
Avg Balanced Accuracy on 9 In-Domain Benchmarks
11
Fact-Verification Benchmarks Evaluated
96.5%
Less Training Data vs. Strongest Baseline
7B
Policy Matching 32B & Frontier Models
DecomposeRL overview: Three reward axes filter candidate sub-questions — useful (necessity), informative (atomicity, answerability, correctness), and diverse (non-redundant) — composing the surviving questions into an auditable verdict.

What makes a sub-question useful, informative, and diverse? DecomposeRL filters candidate sub-questions along three reward axes — useful (its answer can change the verdict), informative (atomic, answerable, and grounded), diverse (non-redundant) — and composes the surviving questions into an auditable verdict.

Abstract

Fact verification splits between end-to-end classifiers that are accurate but opaque, without inspectable traces, and decomposition-based methods that produce inspectable traces but lag on harder benchmarks. We argue the gap between accuracy and obtaining inspectable traces exists because the decomposer is never trained to optimize what makes a decomposition useful.

We propose DecomposeRL to close the gap by treating the decomposer as a reinforcement-learning policy with a reward ensemble that defines decomposition quality. DecomposeRL adapts GRPO with a proposed multi-faceted reward stack, complemented by a semi-supervised training mode that scores unlabeled claims with per-prompt majority-vote pseudo-labels. DecomposeRL overcomes prohibitively large training time for GRPO with a multi-layer curation pipeline to distill existing fact-verification corpora into a small, learning-signal-dense subset.

Experiments on 11 fact-verification benchmarks spanning biomedical, political, scientific, and general-domain claims demonstrate that DecomposeRL outperforms strong prompted and fine-tuned baselines while producing structured, traceable verification outputs, using only 5k training data compared to 14k for the strongest fine-tuned baseline.

Key Contributions

  1. Multi-faceted reward ensemble with two complementary formulations: per-question quality via a leave-one-out necessity matrix, and question-set-level contribution via a joint multiplicative reward.
  2. Semi-supervised training via a self-consistency reward that replaces gold verdicts with per-prompt majority-vote pseudo-labels, enabling training when annotated data is scarce.
  3. Data curation pipeline that distills 155k raw claims into a 96.5% smaller, learning-signal-dense training set of 5,464 claims.
  4. State-of-the-art results across 11 benchmarks — a 7B RL-trained policy matching or exceeding 32B and frontier models while producing traceable verification traces.

Our Approach

Data Curation Pipeline

A multi-stage funnel that distills 155k raw claims from 14 corpora into 5,464 learning-signal-dense training claims:

1
Source Aggregation: Pool 14 public fact-verification corpora into a single (claim, evidence, label) set
2
Rule & NER Filtering: Remove structurally unsuitable rows (too short/long evidence, <2 named entities)
3
Difficulty Filtering: Keep only claims in the informative difficulty band (0.3 ≤ confidence ≤ 0.8), dropping trivial and likely mislabeled claims
4
Dedup & Decontamination: MinHash + semantic dedup, three-pronged decontamination against all 11 test sets
5
Diversity Selection: Submodular Facility-Location maximization with label-balanced, source-proportional budgets

Reward Stack (7 Rewards)

Multi-faceted rewards that capture what makes a decomposition useful:

Programmatic Anchors (no judge calls):

  • Rfmt Format — well-formed XML, Q→A alternation, valid verdict
  • Rqc Question count — triangular kernel on ratio to silver decomposition
  • Rdiv Diversity — MMR penalty over question embeddings

Per-Question Quality (judge-based, multiplicative):

  • Rans Answerability — is the question answerable from evidence?
  • Ratom Atomicity — 5-criterion binary checklist
  • Rcorr Correctness — is the answer faithful to the document?

Set-Level Signals (outcome-grounded):

  • Rver Verification — does the policy's verdict match the gold label?
  • Rcov Coverage — can the answers alone reconstruct the gold verdict?
  • Rnec Necessity — leave-one-out: does removing a question change the verdict?
DecomposeRL reward stack and semi-supervised training overview

The DecomposeRL reward stack and semi-supervised training. (A) Seven rewards across three categories feed GRPO. (B) A supervision rate splits claims into labeled and unlabeled paths, enabling semi-supervised training with self-consistency pseudo-labels.

Results

All 7B Methods Comparison

Balanced accuracy (%) of all 7B baselines and DecomposeRL across 9 in-domain datasets plus 2 out-of-domain benchmarks:

Method FEVER ClaimDec. HoVer FEVEROUS WiCE Ex-FEVER PubHealth PubMedCl. FoolMe2x Micro Avg CoverBench LLMAggreFact
DecomposeRL-7B Ours 74.1 98.6 76.4 93.1 86.5 87.6 87.5 85.5 87.7 84.4 86.3 62.5 77.0
Simple (7B) 72.7 94.9 71.0 93.5 83.2 82.7 84.2 84.1 86.6 82.0 83.7 52.5 74.9
CoT (7B) 70.0 95.5 70.9 92.2 85.6 83.8 83.8 83.2 85.0 81.8 83.3 59.7 77.2
DecomP (7B) 65.5 95.3 69.0 91.9 85.0 78.0 85.7 82.5 84.1 79.3 81.9 55.3 76.2
HiSS (7B) 67.7 92.8 70.2 92.7 83.6 82.4 79.2 77.0 84.5 80.7 81.1 58.3 75.7
MiniCheck 69.9 77.5 73.8 89.2 87.2 82.9 76.3 83.0 84.5 81.9 80.5 54.6 80.3
Self-Ask (7B) 66.5 92.7 66.9 91.9 82.5 71.7 84.2 82.6 82.8 76.7 80.2 56.9 77.1
FOLK (7B) 65.0 90.8 68.2 91.0 83.6 80.2 80.5 77.8 83.1 79.0 80.0 53.8 75.6
QACheck (7B) 65.4 97.3 59.1 92.7 83.0 65.4 91.0 78.0 81.6 73.1 79.3 52.8 68.9
Chen-2024 (7B) 65.4 91.1 65.3 87.9 79.6 73.3 83.3 79.2 82.3 75.7 78.6 56.8 70.2
ProgramFC (7B) 60.5 92.9 65.9 88.2 85.4 74.6 77.4 74.3 76.9 75.2 77.3 53.1 73.5
ClaimDecomp (7B) 65.2 78.9 63.5 85.5 79.2 71.6 76.0 77.6 79.4 73.3 75.2 52.1 71.6

Bold = best in column; underline = second best. DecomposeRL-7B leads on 8 of 11 in-domain columns and CoverBench (OOD).

DecomposeRL (7B) vs. Best at Every Scale

Best baseline at each model scale (3B / 7B / 14B / 32B) plus the best frontier system (GPT-4.1-mini):

Method FEVER ClaimDec. HoVer FEVEROUS WiCE Ex-FEVER PubHealth PubMedCl. FoolMe2x Micro Avg CoverBench LLMAggreFact
Best @ 3B (Simple) 71.5 94.0 63.7 89.0 73.7 82.0 81.2 78.7 79.3 77.6 79.2 51.3 74.0
Best @ 7B (Simple) 72.7 94.9 71.0 93.5 83.2 82.7 84.2 84.1 86.6 82.0 83.7 52.5 74.9
Best @ 14B (DecomP) 71.1 100.0 75.0 90.9 89.0 83.4 86.7 85.3 88.3 83.1 85.5 61.3 79.3
Best @ 32B (DecomP) 68.6 100.0 76.2 93.2 91.3 85.1 86.8 87.4 90.3 84.7 86.5 64.2 79.4
Best Frontier (GPT-4.1-mini) 70.9 100.0 76.7 93.5 87.2 88.3 86.4 87.1 91.1 85.8 86.8 68.6 78.9
DecomposeRL-7B Ours 74.1 98.6 76.4 93.1 86.5 87.6 87.5 85.5 87.7 84.4 86.3 62.5 77.0

Bold = best in column; underline = second best. DecomposeRL (7B) outperforms all 7B baselines by +2.6 Avg points and is competitive with 32B and frontier models.

All 7B Methods (Plot)

Balanced accuracy comparison of all 7B methods across 9 in-domain benchmarks

Balanced accuracy of all 7B methods across 9 in-domain benchmarks. DecomposeRL consistently outperforms all prompted and fine-tuned 7B baselines.

Model Scaling

Performance scaling comparison across model sizes: 3B, 7B, 14B, 32B, and frontier

DecomposeRL at 7B closes the gap with significantly larger models and frontier systems.

Resources

Model

The trained DecomposeRL-7B policy is available on HuggingFace:

7B
Parameters
GRPO
Training
5.4k
Training Claims
🤗 Download Model

Dataset & Collection

The curated training set and full collection of artifacts:

14
Source Corpora
5,464
Curated Claims
11
Test Benchmarks

BibTeX


@article{dipta2025decomposerl,
  title={DecomposeRL: Learning to Ask Useful, Informative, and Diverse
         Questions for Semi-Supervised, Traceable Claim Verification},
  author={Shubhashis Roy Dipta and Ankur Padia and Francis Ferraro},
  year={2025},
  url={https://arxiv.org/},
}