গণিতLLM
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

1University of Maryland, Baltimore County 2University of North Carolina at Charlotte

Key Results

+8
Accuracy on Bn-MGSM
88%
Bengali Reasoning Tokens
79%
Fewer Generated Words
3.8x
Faster Convergence
GanitLLM Overview: Current models reason in English even for Bengali questions. Our solution combines the GANIT dataset with SFT and Curriculum-GRPO to achieve native Bengali reasoning.

Overview: Current models reason in English even for Bengali questions (left). Our solution combines the GANIT dataset with SFT and Curriculum-GRPO (center) to achieve native Bengali reasoning with 88% Bengali tokens, 79% fewer words, and +8 accuracy (right).

Abstract

We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, Ganit), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings.

To address this, we construct GANIT, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning.

On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.

The Problem: English Reasoning for Bengali Questions

Current LLMs reason in English even when asked Bengali math questions, reducing interpretability for native speakers.

Bengali Question:
দুই সংখ্যার যোগফল ১২ এবং গুনফল ৩২। সংখ্যা দুইটি কি?
Base Model Reasoning (English):
Let the two numbers be x and y.
Then, x + y = 12 and xy = 32
From these conditions, we form the quadratic equation:
t² - 12t + 32 = 0
Solving this equation gives: t = 4 and t = 8
Answer: ৪ এবং ৮

Only 14% Bengali tokens

Bengali Question:
দুই সংখ্যার যোগফল ১২ এবং গুনফল ৩২। সংখ্যা দুইটি কি?
GanitLLM Reasoning (Bengali):
ধরি সংখ্যা দুটি x এবং y
x + y = ১২
xy = ৩২
দ্বিঘাত সমীকরণ পাই:
t² - ১২t + ৩২ = 0
সুতরাং, t = ৪ এবং t = ৮
Answer: ৪ এবং ৮

88% Bengali tokens

Our Approach

1. GANIT Dataset

A rigorously-processed, difficulty-aware Bengali math dataset:

  • Quality Screening: Manual evaluation, >95% accuracy threshold
  • Rule-based Filtering: Numerical solutions, >99% Bengali text
  • Deduplication: Fuzzy matching + MinHash
  • Decontamination: Against MGSM & MSVAMP
  • Difficulty Tagging: Using pass@32 from Qwen3-32B
Split Examples Purpose
GanitSFT 11,023 Instruction tuning
GanitRLVR 7,328 RL training (balanced)
GanitDEV 776 Evaluation

2. Curriculum-GRPO

A novel training recipe to tackle the cold-start problem:

1
SFT Stage: Train on GANIT-SFT to ground reasoning in Bengali (not English)
2
Curriculum Sampling: Order training data by difficulty (easy → hard) to avoid zero-reward updates
3
Multi-component Rewards:
  • Format: Validates <think> and <answer> tags
  • Correctness: +2.0 Bengali, +1.0 English match
  • Bengali: Ensures >80% Bengali reasoning
GANIT construction pipeline

Figure: GANIT dataset construction pipeline. Starting from ~1.5M Bengali math problems, we apply multi-stage quality filtration, verification, deduplication, and decontamination.

Results

Main Results

GanitLLM enables smaller models to match or exceed larger counterparts while reasoning natively in Bengali:

Model Bn-MGSM ↑ Bn-MSVAMP ↑ Words ↓ Bengali % ↑
GPT-4.1 89.20 82.30 200 88.16%
GPT-4.1-mini 87.20 78.60 232 88.18%
Qwen3-32B 85.60 76.10 712 21.08%
Qwen3-14B 83.60 75.80 767 17.87%
Qwen3-8B 69.20 52.60 977 19.26%
Qwen3-4B 69.20 70.50 943 14.79%
GanitLLM-4B Ours 76.80 76.40 193 88.71%
Qwen3-1.7B 15.20 14.10 1124 19.64%
GanitLLM-1.7B Ours 52.80 66.80 210 87.80%
Qwen3-0.6B 8.40 12.20 1265 12.43%
GanitLLM-0.6B Ours 28.40 52.40 248 88.70%

Ablation: Why Multi-stage Training?

SFT grounds language, GRPO improves accuracy. Both are necessary:

Configuration Bn-MGSM Bn-MSVAMP Words Bengali %
Qwen3-4B (base) 69.20 70.50 943 14.79%
+ SFT only 74.00 74.60 184 86.65%
+ CGRPO only 82.40 78.50 844 14.94%
SFT + CGRPO (Ours) 76.80 76.40 193 88.71%

CGRPO alone achieves highest accuracy but only 14.94% Bengali reasoning. Our multi-stage approach balances accuracy with interpretability.

Models & Dataset

Pre-trained Models

All models available on HuggingFace:

Model Params Training
GanitLLM-4B_SFT_CGRPO 4B SFT + CGRPO
GanitLLM-4B_SFT_GRPO 4B SFT + GRPO
GanitLLM-1.7B_SFT_CGRPO 1.7B SFT + CGRPO
GanitLLM-1.7B_SFT_GRPO 1.7B SFT + GRPO
GanitLLM-0.6B_SFT_CGRPO 0.6B SFT + CGRPO
GanitLLM-0.6B_SFT_GRPO 0.6B SFT + GRPO

GANIT Dataset

Difficulty-aware Bengali math dataset:

19K+
Total Examples
4
Difficulty Levels
3
Splits
Download Dataset

Difficulty Distribution

Difficulty Criteria GanitDEV
Easy >75% LLMs correct 28.7%
Medium 50-75% correct 26.0%
Hard 25-50% correct 24.3%
Olympiad <25% correct 21.3%

BibTeX

will be updated