Abstain-R1: absent recognition and calibration in Post-Training of LLMs

Haotian Zhai1, Jingcheng Liang1, Haotian Huang1, Zekang Li1
1University of Minnesota, Twin Cities

Abstract

Reinforcement fine-tuning improves large language model reasoning, but often incentivizes models to produce an answer even under underspecified queries, leading to guessing and hallucinations. Prior approaches either enforce generic abstention (e.g., ``I don't know'') or encourage follow-up questions without supervising the quality of refusal rationales, resulting in superficial abstention without meaningful clarification.

We propose a novel Reinforcement Learning with Verifiable Rewards (RLVR) framework that treats unanswerability as an explicit learning target and jointly optimizes calibrated abstention and high-quality post-refusal clarification, while preserving strong performance on answerable queries. To enable stable training from scratch, we construct a cold-start dataset by augmenting existing benchmarks with structured reasoning traces and evaluate our approach across multiple dimensions, including answer accuracy on solvable queries, refusal calibration, format adherence, and the semantic quality of clarifications.

Experimental results show that, de- spite being a 3B-parameter model, Abstain-R1 achieves performance comparable to GPT-5 in terms of false-unknown rate, refusal calibra- tion, and clarification quality, while maintain- ing strong performance on answerable queries.

Overview figure
Figure 1: Comparison between uninformative refusal (left) and informative refusal (right).

General Overview

Novelty: Compared to the state-of-the-art methods/systems/datasets, how novel is your approach? Is your work publishable?

Our work diverges from traditional alignment methods that primarily rely on Supervised Fine-Tuning (SFT) or binary safety rewards to enforce generic refusal. The core novelty of our RLVR framework lies in treating "clarification" not as a simple linguistic pattern, but as a verifiable reasoning objective. We uniquely employ Group Relative Policy Optimization (GRPO) driven by a hierarchical split-incentive mechanism (0.3 for refusal, 0.7 for verified clarification). By strictly verifying mathematical correctness via symbolic execution and clarification logic via an LLM-as-judge, we effectively transform "unanswerability" from a static safety constraint into an explicit, learnable reasoning task, bridging the gap between rigorous reasoning and helpful alignment.

This work is highly publishable as it rigorously addresses the "Hallucination Tax" by transforming "unanswerability" into a verifiable reasoning task via our novel RLVR framework. Validated by strict evaluation protocols, our methodology offers a significant technical contribution to the field of reliable LLM alignment.

Significance: How storng is your result? Is your finding still holding if different setups or prompting tricks?

Our results demonstrate strong robustness to experimental variability. To mitigate the inherent stochasticity of language model generation, we adopt a controlled evaluation protocol: instead of relying on a single inference pass, we perform five independent rollouts for each model under identical instructions and report the averaged performance. This procedure reduces the influence of random effects.

We further examine the robustness of our findings through a sensitivity analysis of the evaluation metric based on an LLM-as-judge. Specifically, we evaluate models using prompts with varying levels of strictness. Across all evaluation settings, our proposed model, Abstain-R1, consistently outperforms other models in the quality of refusal rationales.

This confirms that our findings are intrinsic to the model's capability and hold true regardless of setup variations.


Introduction / Background / Motivation

What did you try to do? What problem did you try to solve? Articulate your objectives using absolutely no jargon.

We tried to fix a common habit in current AI models: AI systems often give confident answers even when they do not actually have enough information to answer correctly. This can lead to wrong or made-up responses that sound convincing but are unreliable.

Our goal was to teach an AI system not only to say “I can't answer this” when a question cannot be answered, but also to explain why it cannot answer and what information would be needed to move forward. At the same time, I wanted the system to keep giving accurate answers when a question can be answered, rather than refusing too often.

How is it done today by other researchers? What are the limitations and challenges of current practice?

Current research on unanswerability and abstention mainly focuses on identifying when models should refuse to answer and analyzing how failures to abstain are linked to hallucinations. Benchmark studies such as AbstentionBench show that many widely used language models fail to abstain appropriately when questions are unanswerable. Other work, such as Hallucination Tax, demonstrates that when queries lack necessary conditions, reinforcement-learning-tuned models may invent missing constraints and respond with high confidence. From a theoretical perspective, prior analyses argue that standard evaluation setups reward only correct answers while assigning no credit to abstention, which implicitly incentivizes guessing rather than admitting uncertainty.

In applied and high-stakes domains, such as clinical reasoning, domain-specific systems like KnowGuard emphasize evidence-aware abstention, particularly in multi-turn settings where critical information is missing. These approaches highlight the importance of refusing to answer when evidence is insufficient, especially to avoid harmful outcomes.

However, existing practices face several limitations. Many reinforcement-learning-based approaches focus primarily on enforcing a generic refusal (e.g., saying “I don't know”) or encouraging follow-up questions, without explicitly evaluating whether the post-refusal content is useful, actionable, or well-justified. As a result, models may learn to abstain as a surface behavior without providing meaningful explanations or clarifications. This lack of explicit supervision and evaluation for post-refusal quality makes it difficult to ensure that abstention behavior is informative, calibrated, and helpful to users.

Who might be interested in your work? What kinds of impact can you make?

Many people rely on AI systems for information, decision support, and learning, but these systems often give confident answers even when they should not. This can mislead users, reduce trust, and in some cases cause real harm, especially when users do not realize that an answer is unreliable.

If this project is successful, AI systems will become better at recognizing their own limits. Instead of guessing or giving vague refusals, they can clearly explain why a question cannot be answered and what information is missing. This makes AI systems more transparent, more trustworthy, and easier for people to work with. Over time, this can reduce misinformation, help users make better decisions, and encourage safer use of AI in settings where correctness and honesty matter.


Approach

What did you do exactly? How did you solve the problem? Why did you think it would be successful? What is your hypothesis?

We addressed the issue of model hallucination on underspecified queries by implementing a two-stage alignment pipeline for the Qwen2.5-3B-Instruct model, grounded in the hypothesis that unanswerability should be treated as an explicit, learnable reasoning task rather than a simple binary refusal. To achieve this, we first constructed a high-quality cold-start dataset by augmenting existing benchmarks with structured reasoning traces to initialize the policy.

Data augmentation pipeline
Figure 2: Data augmentation pipeline for constructing the cold-start dataset with structured reasoning traces.

We solved the core optimization challenge by engineering a hierarchical, verifiable reward function that strictly prioritizes structural integrity before evaluating semantic precision. Our implementation employs a hard gating mechanism: if the output violates the required XML structure (missing tags or incorrect order), the total score is immediately zeroed out, bypassing all subsequent checks.

For content evaluation, we treat answerable and unanswerable queries with distinct, rigorous logic. For solvable queries, we utilize symbolic verification to reward mathematical accuracy while imposing a strict penalty (-1.0) for lazy refusals (e.g., claiming "I don't know" when a solution exists).

Crucially, for unanswerable queries, we devised a split-incentive mechanism: the model earns a baseline reward (0.3) simply for outputting the standard refusal token \boxed{I don't know}, but can only unlock the remaining majority credit (0.7) if the accompanying clarification is extracted and verified as semantically correct by an LLM-as-a-Judge.

\[ \begin{aligned} r_{\text{fmt}} &= \begin{cases} 1, & \text{if structure is valid and } \backslash\text{boxed is valid} \\ 0, & \text{otherwise.} \end{cases} \\[12pt] r_{\text{ans}} &= \begin{cases} 1, & \text{if answer matches ground truth} \\ -1, & \text{if output boxed "I don't know"} \\ 0, & \text{otherwise} \end{cases} \\[12pt] r_{\text{ref}}' &= \begin{cases} 0, & \text{otherwise} \\ 0.3, & \text{if output contains boxed "I don't know"} \\ 1, & \text{and } \mathcal{V}(q,c^\star,\hat{c})=\texttt{Correct} \end{cases} \end{aligned} \] \[ \mathcal{V}(q, c^\star, \hat{c}) \in \{\texttt{Correct}, \texttt{Incorrect}\} \]

We anticipated success because Group Relative Policy Optimization (GRPO) is fundamentally better suited for reasoning tasks than traditional Actor-Critic methods. In complex chain-of-thought generation, training a separate Value Model to accurately predict expected returns is notoriously unstable and prone to high variance. GRPO circumvents this by using the group average as a dynamic baseline. This "self-referential" baseline provides a lower-variance advantage estimate, which is critical when the reward signal is sparse or binary (like mathematical correctness). We hypothesized that this stable optimization, combined with our dense, hierarchical reward signals (e.g., the 0.3/0.7 split), would allow the model to effectively navigate the narrow optimization landscape between "hallucination" and "lazy refusal" without suffering from reward collapse.

GRPO pipeline
Figure 3: Overview of the GRPO training pipeline.

What challenges did you anticipate and/or encounter during the development of your approach? Did the very first thing you tried work? What is scientific novel of your approach to address the challenges?

We encountered four major challenges during development:

  1. Mode Collapse and Reward Hacking
    Early in training, the model discovered that outputting "I don't know" was a shortcut to avoid penalties for incorrect math answers while still collecting the base refusal reward. This led to mode collapse, where the refusal rate hit nearly 100%. We overcame this by introducing a strict negative penalty (-1.0) for "lazy refusals" on answerable queries, forcing the model to actively balance safety with helpfulness.
  2. Noisy Reward Signals from String Matching
    Initially, we relied on exact string matching to evaluate answer correctness. This proved too rigid for mathematical tasks (e.g., rejecting "1/2" when the target was "0.5"), generating false negatives that confused the policy optimizer and destabilized training. We resolved this by integrating the Math-Verify library for symbolic comparison, ensuring the reward signal reflected genuine mathematical understanding rather than formatting luck.
  3. Infrastructure and Throughput Bottlenecks
    We faced substantial computational hurdles. First, the standard HuggingFace generation loop was too slow for the extensive group sampling required by GRPO. We migrated our rollout backend to vLLM, which dramatically improved inference throughput.
  4. Resource Constraints in Evaluation
    Running a large LLM-as-a-Judge alongside the policy model caused frequent Out-of-Memory (OOM) errors and slowed down the reward computation loop. We experimented with various judge models and ultimately selected xVerify-3B, a specialized lightweight verifier. This choice struck the optimal balance between judging accuracy and memory efficiency, allowing us to maintain a large batch size during training without OOM crashes.

Results

Results Table
Results on ABSTAIN-TEST-SUM evaluated by strict protocol. Models marked with * use the V1 model instruction, models without * use the V2 model instruction. Bold indicates the best performance in each column.
Results Table Permissive
Results on ABSTAIN-TEST-SUM evaluated by permissive protocol. Models marked with * use the V1 model instruction; models without * use the V2 model instruction. Bold indicates the best performance in each column.
Results Table Permissive
Results on ABSTAIN-TEST-w/o SUM evaluated by permissive protocol. All models used the V2 model instruction. Bold indicates the best performance in each column. Because general-domain datasets tend to be slightly less challenging than math-focused benchmarks like SUM, producing an abstention signal may be relatively easier. However, for the untrained Qwen2.5-3B-Instruct model, generating high-quality clarifications remains a significant challenge.

Our method, Abstain-R1 (based on Qwen2.5-3B-Instruct), demonstrates that rigorous reinforcement learning can effectively solve the "Hallucination Tax" without compromising reasoning capabilities. Despite its compact size, Abstain-R1 achieves clarification and refusal performance comparable to—and in some cases exceeding—much larger proprietary reasoning models.

Key Performance Highlights

Zero False Refusals on Solvable Queries: Abstain-R1 achieves a 0.0% False Unknown rate (A-FU) on answerable queries. This proves that our "strict negative penalty" mechanism successfully prevents the model from becoming "lazy" or overly conservative—a common failure mode in safety-aligned models.

Competitive Mathematical Accuracy: While gaining the ability to refuse, the model maintains a strong 61.4% Accuracy (A-Acc) on answerable mathematical tasks, remaining highly competitive within the open-source 3B parameter family.

Surpassing Larger Reasoning Models: On unanswerable queries, Abstain-R1 delivers meaningful clarifications that outperform specialized reasoning models.

  • Vs. DeepSeek Reason: We achieve notably stronger performance (U-Ref 51.7% vs. 45.1%; U-Clar 50.0% vs. 43.7%).
  • Vs. GPT-5.1 Reasoning Medium: We achieve comparable performance (U-Ref 51.7% vs. 50.7%; U-Clar 50.0% vs. 48.2%), demonstrating that efficient RLVR can bridge the gap between small open-source models and proprietary giants.