ICML 2026 Spotlight

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

University of Illinois Urbana-Champaign
The model capability continuum connects model-strong tasks to prior-leaning objectives and model-weak tasks to NLL.

The right SFT objective depends on how much useful prior knowledge the base model already has.

Abstract

Supervised fine-tuning (SFT) is the standard approach for post-training large language models, yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm where models already encode task-relevant priors and supervision can be long and noisy.

We systematically study probability-based objectives and characterize when and why different objectives succeed or fail. Across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a model-capability continuum: near the model-strong end, prior-leaning objectives that downweight low-probability tokens consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails.

Why look beyond NLL?

Standard SFT with -log p gives the largest gradient to tokens the model currently assigns low probability. That is natural when training from scratch: low probability usually means the model has not learned the concept yet.

Post-training is different. A pretrained LLM may already know much of the target task, while long chain-of-thought demonstrations can contain noisy, redundant, or locally unreliable tokens. In those cases, blindly forcing the model to imitate every low-probability token can damage an otherwise useful prior.

Prior-Averse

-log p

Emphasizes low-probability tokens and broadly corrects weak or misaligned priors.

Prior-Leaning

-p, -p10, thresholded variants

Emphasizes already plausible tokens and protects reliable model priors from noisy supervision.

Probability-based objectives as gradient shape

Logit-gradient weights induced by different probability-based objectives.

We study objectives of the form Lf(pθ) = E[f(pθ(˜y|x))]. The key object is not just the loss value, but how the objective distributes logit-gradient weight over the model's current probability p. NLL is prior-averse because it keeps strong pressure on low-probability tokens. Objectives such as -p are prior-leaning because they shift training signal toward tokens the model already considers plausible.

The model-capability continuum

Model-Strong

Math and code-like settings where pretraining already provides useful priors.

Prior-leaning objectives win.

Model-Intermediate

Domains such as medical reasoning where models have partial but unstable knowledge.

No fixed objective dominates.

Model-Weak

Novel symbolic or low-coverage tasks where the base model has little reliable prior.

NLL remains the strong baseline.

The continuum is not a task-name taxonomy. It can be diagnosed by pretraining coverage and by measuring how confidently the base model assigns probability to training targets. The same objective can help in one regime and hurt in another.

Empirical validation across objectives and regimes

Average performance comparison across probability-based objectives and percentile thresholds.

Experiments across 8 backbones, 27 benchmarks, and 7 domains show the same pattern: when the base model's prior is reliable, objectives that downweight low-probability tokens can substantially improve generalization. When the model lacks relevant prior knowledge, NLL's strong correction signal is necessary.

Model-strong and model-weak settings show opposite likelihood and performance preferences.

The mechanism is visible in likelihood estimation and thresholding analyses: low-probability tokens can be noise against a strong prior, but become crucial corrective signals when the prior is weak.

A practical rollout for SFT objective selection

A practical objective selection rollout for supervised fine-tuning.

The takeaway is not to replace NLL with a new universal default. Instead, choose the objective according to model capability: estimate the base model's target-token confidence, locate the task on the continuum, and sweep prior-leaning or prior-averse objectives accordingly.

Poster

ICML 2026 poster preview for Beyond Log Likelihood.

BibTeX

@article{li2025beyond,
  title={Beyond log likelihood: Probability-based objectives for supervised fine-tuning across the model capability continuum},
  author={Li, Gaotang and Qiu, Ruizhong and Chen, Xiusi and Ji, Heng and Tong, Hanghang},
  journal={arXiv preprint arXiv:2510.00526},
  year={2025}
}