Back to Title Page ActiveUltraFeedback

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

1 ETH Zurich    2 ETH AI Center
* Equal contribution
ETH Zurich logo
Learning & Adaptive Systems Group logo
ETH AI Center logo

Abstract

Introduction

Reinforcement learning from human feedback (RLHF) is a key component for aligning LLMs with human preferences. The standard RLHF pipeline trains a reward model on pairwise comparison data, then uses that model to guide policy optimization via algorithms like PPO or GRPO. However, reward models trained on limited, noisy datasets are imperfect, and overoptimizing against them can cause reward hacking, where the LLM exploits flaws in the reward signal rather than learning human preferences.

Uncertainty quantification (UQ) for reward models has emerged as a promising mitigation. By explicitly modeling epistemic uncertainty arising from finite training data, uncertainty-aware reward models enable: (1) penalizing or filtering unreliable reward signals during alignment, and (2) guiding active data collection toward the most informative preference samples improving sample efficiency.

However, most studies adopt a single UQ method without systematic comparison, leaving design choices largely unexplored. RewardUQ fills this gap with a unified framework, standardized metrics, and an open-source Python package.

Uncertainty-Aware Reward Model Architectures

For a prompt-completion pair \( (x, y) \), all architectures predict a pointwise reward \( r_{\theta}(x, y) \) and an uncertainty estimate \( u_{\theta}(x, y) \), combined into symmetric confidence bounds:

\[ \overline{r_{\theta}}(x,y) = r_{\theta}(x,y) + \beta \cdot u_{\theta}(x,y) \] \[ \underline{r_{\theta}}(x,y) = r_{\theta}(x,y) - \beta \cdot u_{\theta}(x,y) \]

We compare four architectures, illustrated in Figure 1 below.

Uncertainty-aware reward model architectures: MLP head ensemble, LoRA adapter ensemble, Bayesian linear head, and DPO-based MC dropout.
Figure 1: Uncertainty-aware reward model architectures compared in this work. For a given prompt x and completion y, each model extracts an embedding z from a pretrained LM and predicts a reward r and uncertainty estimate u. Blue components indicate the parts responsible for estimating the uncertainty; 🔥 and ❄ denote trainable and frozen components, respectively.
ENS-MLP

MLP Head Ensemble

K independent MLP heads on top of a frozen LLM backbone. Regularization keeps each head close to its random initialization and centers rewards around zero.

ENS-LoRA

LoRA Adapter Ensemble

K low-rank adapters applied to the LLM backbone, each producing rewards through a linear head.

MCD-DPO

MC Dropout (DPO)

Monte Carlo dropout applied before the language modeling head of a DPO fine-tuned policy. Uncertainty is estimated from repeated stochastic forward passes and the resulting implicit rewards.

BAY-LIN

Bayesian Linear Head

Single linear reward head on LLM embeddings with a Gaussian prior. A Laplace approximation provides a posterior over parameters, enabling analytic uncertainty estimates.

Evaluation Metrics

RewardUQ evaluates uncertainty-aware reward models along accuracy (do the predictions and their confidence bounds reflect true preferences?) and calibration (do predicted probabilities match empirical frequencies?). We evaluate both of these evaluation dimensions for the reward point estimates, as well as the uncertainty bounds.

Accuracy Metrics. The win rate measures how often the reward model assigns a higher reward to the preferred completion. Beyond pointwise accuracy, we categorize predictions by whether the confidence intervals of the preferred and non-preferred completions overlap. A prediction is confident when the intervals do not overlap, and unconfident otherwise. Combined with correctness, this yields four categories: confident true (CT), confident false (CF), unconfident true (UT), and unconfident false (UF).

Illustration of the four prediction categories: confident true, confident false, unconfident true, and unconfident false.
Figure 2: A prediction is confident when the upper and lower bounds do not overlap and unconfident otherwise. Combined with correctness, this yields four outcomes: confident true (CT), confident false (CF), unconfident true (UT), and unconfident false (UF). An ideal uncertainty-aware reward model maximizes CT while minimizing CF.

A good uncertainty-aware model maximizes the confident true (CT) rate while minimizing the confident false (CF) rate, as it should be confident when correct and uncertain when wrong. To reduce these metrics to a single comparable score, we propose a ranking score that rewards confident correct predictions and penalizes confident incorrect ones:

\[ \text{RS}_\alpha = \frac{\text{CT rate}}{\text{win rate} + \alpha \cdot (1 - \text{win rate})} - \frac{\text{CF rate}}{(1 - \text{win rate}) + \alpha \cdot \text{win rate}} \;\in [-1,\, 1] \]

The trade-off parameter \(\alpha \in [0, 1]\) balances focus between confidence (\(\alpha = 0\)) and accuracy (\(\alpha = 1\)). For evaluations, we use \(\text{RS}_{0.2}\) as a balanced choice.

Calibration Metrics. We do not only want accurate and confident predictions, but the rewards and bounds should also be well-calibrated. The Expected Calibration Error (ECE) measures the gap between predicted preference probabilities and true empirical probabilities, computed via binning. The Expected Bound Calibration Error (EBCE) extends calibration to confidence bounds and penalizes lower bounds that overestimate or upper bounds that underestimate the true preference probability.

Experimental Results

We train and evaluate each architecture across three preference datasets and two model families: the general-purpose Qwen 3 series and the task-aligned Skywork-Reward-V2 Qwen 3 series. Hyperparameters are selected on UltraFeedback's validation split, with calibration thresholds (ECE < 0.05, EBCE < 0.01) applied first, followed by ranking by RS0.2. Final results are reported on RewardBench.

Ranking scores on RewardBench across different UQ methods, training datasets, pretrained and finetuned models, and model sizes.
Figure 3: Ranking scores (RS0.2) on RewardBench across different UQ methods, training datasets, pretrained and finetuned models, and model sizes.
Figure 4: Calibration diagrams for Qwen3-0.6B (top) and Qwen3-4B (bottom) trained on UltraFeedback and evaluated on RewardBench. ECE (left) and EBCE (right). Predictions are well-calibrated when they fall on the diagonal.
Key Findings

Base model initialization matters most. Methods that rely on a frozen LLM backbone for embeddings (BAY-LIN, ENS-MLP) benefit the most from task-aligned initializations like the Skywork reward model family.

No single method dominates. Performance is dependent on model size, dataset, and pre-training.

Calibration is generally well-behaved. All UQ methods maintain ECE below 0.10 and EBCE below 0.04 across configurations.

BibTeX

@article{yang2025rewarduq,
  title = {RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models},
  author = {Yang, Daniel and Stante, Samuel and Redhardt, Florian and Libon, Lena and Kassraie, Parnian and Hakimi, Ido and Pásztor, Barna and Krause, Andreas},
  year = {2026},
  journal = {arXiv preprint arXiv:2602.24040},
  url = {https://arxiv.org/abs/2602.24040},
}