Computation and Language 115
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
Recent video generation models can produce high-fidelity, temporally coherent
videos, indicating that they may encode substantial world knowledge. Beyond
realistic synthesis, they also exhibit emerging behaviors indicative of visual
perception, modeling, and manipulation. Yet, an important question still
remains: Are video models ready to serve as zero-shot reasoners in challenging
visual reasoning scenarios? In this work, we conduct an empirical study to
comprehensively investigate this question, focusing on the leading and popular
Veo-3. We evaluate its reasoning behavior across 12 dimensions, including
spatial, geometric, physical, temporal, and embodied logic, systematically
characterizing both its strengths and failure modes. To standardize this study,
we curate the evaluation data into MME-CoF, a compact benchmark that enables
in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our
findings reveal that while current video models demonstrate promising reasoning
patterns on short-horizon spatial coherence, fine-grained grounding, and
locally consistent dynamics, they remain limited in long-horizon causal
reasoning, strict geometric constraints, and abstract logic. Overall, they are
not yet reliable as standalone zero-shot reasoners, but exhibit encouraging
signs as complementary visual engines alongside dedicated reasoning models.
Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ Gistify! Codebase-Level Understanding via Runtime Execution
Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, Lucas Caccia
As coding agents are increasingly deployed in large codebases, the need to
automatically design challenging, codebase-level evaluation is central. We
propose Gistify, a task where a coding LLM must create a single, minimal,
self-contained file that can reproduce a specific functionality of a codebase.
The coding LLM is given full access to a codebase along with a specific
entrypoint (e.g., a python command), and the generated file must replicate the
output of the same command ran under the full codebase, while containing only
the essential components necessary to execute the provided command. Success on
Gistify requires both structural understanding of the codebase, accurate
modeling of its execution flow as well as the ability to produce potentially
large code patches. Our findings show that current state-of-the-art models
struggle to reliably solve Gistify tasks, especially ones with long executions
traces.
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often
suffers from instability due to the numerical mismatch between the training and
inference policies. While prior work has attempted to mitigate this issue
through algorithmic corrections or engineering alignments, we show that its
root cause lies in the floating point precision itself. The widely adopted
BF16, despite its large dynamic range, introduces large rounding errors that
breaks the consistency between training and inference. In this work, we
demonstrate that simply reverting to \textbf{FP16} effectively eliminates this
mismatch. The change is simple, fully supported by modern frameworks with only
a few lines of code change, and requires no modification to the model
architecture or learning algorithm. Our results suggest that using FP16
uniformly yields more stable optimization, faster convergence, and stronger
performance across diverse tasks, algorithms and frameworks. We hope these
findings motivate a broader reconsideration of precision trade-offs in RL
fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks
AIs have made rapid progress on research-oriented benchmarks of knowledge and
reasoning, but it remains unclear how these gains translate into economic value
and automation. To measure this, we introduce the Remote Labor Index (RLI), a
broadly multi-sector benchmark comprising real-world, economically valuable
projects designed to evaluate end-to-end agent performance in practical
settings. AI agents perform near the floor on RLI, with the highest-performing
agent achieving an automation rate of 2.5%. These results help ground
discussions of AI automation in empirical evidence, setting a common basis for
tracking AI impacts and enabling stakeholders to proactively navigate AI-driven
labor automation.
comment: Website: https://www.remotelabor.ai
☆ AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with
Olympiad level or even higher difficulty, comprising 50 human-crafted problems.
Existing benchmarks have widely leveraged high school math competitions for
evaluating mathematical reasoning capabilities of large language models (LLMs).
However, many existing math competitions are becoming less effective for
assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To
address this, AMO-Bench introduces more rigorous challenges by ensuring all 50
problems are (1) cross-validated by experts to meet at least the International
Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original
problems to prevent potential performance leakages from data memorization.
Moreover, each problem in AMO-Bench requires only a final answer rather than a
proof, enabling automatic and robust grading for evaluation. Experimental
results across 26 LLMs on AMO-Bench show that even the best-performing model
achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%.
Beyond these poor performances, our further analysis reveals a promising
scaling trend with increasing test-time compute on AMO-Bench. These results
highlight the significant room for improving the mathematical reasoning in
current LLMs. We release AMO-Bench to facilitate further research into
advancing the reasoning abilities of language models.
https://amo-bench.github.io/
comment: 14 pages, 9 figures
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been
predominantly abstracted as a brute-force lookup of co-occurrences between
entities. We contrast this associative view against a geometric view of how
memory is stored. We begin by isolating a clean and analyzable instance of
Transformer reasoning that is incompatible with memory as strictly a storage of
the local co-occurrences specified during training. Instead, the model must
have somehow synthesized its own geometry of atomic facts, encoding global
relationships between all entities, including non-co-occurring ones. This in
turn has simplified a hard reasoning task involving an $\ell$-fold composition
into an easy-to-learn 1-step geometric task.
From this phenomenon, we extract fundamental aspects of neural embedding
geometries that are hard to explain. We argue that the rise of such a geometry,
despite optimizing over mere local associations, cannot be straightforwardly
attributed to typical architectural or optimizational pressures.
Counterintuitively, an elegant geometry is learned even when it is not more
succinct than a brute-force lookup of associations.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry
stems from a spectral bias that -- in contrast to prevailing theories -- indeed
arises naturally despite the lack of various pressures. This analysis also
points to practitioners a visible headroom to make Transformer memory more
strongly geometric. We hope the geometric view of parametric memory encourages
revisiting the default intuitions that guide researchers in areas like
knowledge acquisition, capacity, discovery and unlearning.
☆ Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
This paper presents a comprehensive cross-platform evaluation of reasoning
capabilities in contemporary foundation models, establishing an
infrastructure-agnostic benchmark across three computational paradigms: HPC
supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and
university clusters (a node with eight H200 GPUs).
We evaluate 15 foundation models across 79 problems spanning eight academic
domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics,
Calculus, and Optimization) through three experimental phases: (1) Baseline
establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b,
Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing
methodology and reference performance; (2) Infrastructure validation: The
19-problem benchmark repeated on university cluster (seven models including
Falcon-Mamba state-space architecture) and Nebius AI Studio (nine
state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3
30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic
reproducibility; (3) Extended evaluation: Full 79-problem assessment on both
university cluster and Nebius platforms, probing generalization at scale across
architectural diversity.
The findings challenge conventional scaling assumptions, establish training
data quality as more critical than model size, and provide actionable
guidelines for model selection across educational, production, and research
contexts. The tri-infrastructure methodology and 79-problem benchmark enable
longitudinal tracking of reasoning capabilities as foundation models evolve.
☆ Value Drifts: Tracing Value Alignment During LLM Post-Training
Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy
As LLMs occupy an increasingly important role in society, they are more and
more confronted with questions that require them not only to draw on their
general knowledge but also to align with certain human value systems.
Therefore, studying the alignment of LLMs with human values has become a
crucial field of inquiry. Prior work, however, mostly focuses on evaluating the
alignment of fully trained models, overlooking the training dynamics by which
models learn to express human values. In this work, we investigate how and at
which stage value alignment arises during the course of a model's
post-training. Our analysis disentangles the effects of post-training
algorithms and datasets, measuring both the magnitude and time of value drifts
during training. Experimenting with Llama-3 and Qwen-3 models of different
sizes and popular supervised fine-tuning (SFT) and preference optimization
datasets and algorithms, we find that the SFT phase generally establishes a
model's values, and subsequent preference optimization rarely re-aligns these
values. Furthermore, using a synthetic preference dataset that enables
controlled manipulation of values, we find that different preference
optimization algorithms lead to different value alignment outcomes, even when
preference data is held constant. Our findings provide actionable insights into
how values are learned during post-training and help to inform data curation,
as well as the selection of models and algorithms for preference optimization
to improve model alignment to human values.
☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a
non-differentiable decoding process that requires laborious, hand-tuning of
hyperparameters like temperature and top-p. This paper introduces AutoDeco, a
novel architecture that enables truly "end-to-end" generation by learning to
control its own decoding strategy. We augment the standard transformer with
lightweight heads that, at each step, dynamically predict context-specific
temperature and top-p values alongside the next-token logits. This approach
transforms decoding into a parametric, token-level process, allowing the model
to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that
AutoDeco not only significantly outperforms default decoding strategies but
also achieves performance comparable to an oracle-tuned baseline derived from
"hacking the test set"-a practical upper bound for any static method.
Crucially, we uncover an emergent capability for instruction-based decoding
control: the model learns to interpret natural language commands (e.g.,
"generate with low randomness") and adjusts its predicted temperature and top-p
on a token-by-token basis, opening a new paradigm for steerable and interactive
LLM decoding.
☆ Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios -- including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.
comment: Kimi Linear tech report
☆ Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Large language models (LLMs) have demonstrated exceptional capabilities
across multiple domains by leveraging massive pre-training and curated
fine-tuning data. However, in data-sensitive fields such as healthcare, the
lack of high-quality, domain-specific training corpus hinders LLMs' adaptation
for specialized applications. Meanwhile, domain experts have distilled domain
wisdom into ontology rules, which formalize relationships among concepts and
ensure the integrity of knowledge management repositories. Viewing LLMs as
implicit repositories of human knowledge, we propose Evontree, a novel
framework that leverages a small set of high-quality ontology rules to
systematically extract, validate, and enhance domain knowledge within LLMs,
without requiring extensive external datasets. Specifically, Evontree extracts
domain ontology from raw models, detects inconsistencies using two core
ontology rules, and reinforces the refined knowledge via self-distilled
fine-tuning. Extensive experiments on medical QA benchmarks with
Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both
unmodified models and leading supervised baselines, achieving up to a 3.7%
improvement in accuracy. These results confirm the effectiveness, efficiency,
and robustness of our approach for low-resource domain adaptation of LLMs.
☆ The Era of Agentic Organization: Learning to Organize with Language Models
We envision a new era of AI, termed agentic organization, where agents solve
complex problems by working collaboratively and concurrently, enabling outcomes
beyond individual intelligence. To realize this vision, we introduce
asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large
language models, which organizes the internal thinking process into
concurrently executable structures. Specifically, we propose a thinking
protocol where an organizer dynamically assigns sub-queries to workers, merges
intermediate knowledge, and produces coherent solutions. More importantly, the
thinking structure in this protocol can be further optimized through
reinforcement learning. Experiments demonstrate that AsyncThink achieves 28%
lower inference latency compared to parallel thinking while improving accuracy
on mathematical reasoning. Moreover, AsyncThink generalizes its learned
asynchronous thinking capabilities, effectively tackling unseen tasks without
additional training.
☆ Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Recent large language model (LLM) research has undergone an architectural
shift from encoder-decoder modeling to nowadays the dominant decoder-only
modeling. This rapid transition, however, comes without a rigorous comparative
analysis especially \textit{from the scaling perspective}, raising concerns
that the potential of encoder-decoder models may have been overlooked. To fill
this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent
recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison
between RedLLM, pretrained with prefix language modeling (LM), and DecLLM,
pretrained with causal LM, at different model scales, ranging from $\sim$150M
to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for
instruction tuning, our experiments show that RedLLM produces compelling
scaling properties and surprisingly strong performance. While DecLLM is overall
more compute-optimal during pretraining, RedLLM demonstrates comparable scaling
and context length extrapolation capabilities. After instruction tuning, RedLLM
achieves comparable and even better results on various downstream tasks while
enjoying substantially better inference efficiency. We hope our findings could
inspire more efforts on re-examining RedLLM, unlocking its potential for
developing powerful and efficient LLMs.
comment: The scaling study inspiring T5Gemma
☆ SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Multi-page visual documents such as manuals, brochures, presentations, and
posters convey key information through layout, colors, icons, and cross-slide
references. While large language models (LLMs) offer opportunities in document
understanding, current systems struggle with complex, multi-page visual
documents, particularly in fine-grained reasoning over elements and pages. We
introduce SlideAgent, a versatile agentic framework for understanding
multi-modal, multi-page, and multi-layout documents, especially slide decks.
SlideAgent employs specialized agents and decomposes reasoning into three
specialized levels-global, page, and element-to construct a structured,
query-agnostic representation that captures both overarching themes and
detailed visual or textual cues. During inference, SlideAgent selectively
activates specialized agents for multi-level reasoning and integrates their
outputs into coherent, context-aware answers. Extensive experiments show that
SlideAgent achieves significant improvement over both proprietary (+7.9
overall) and open-source models (+9.8 overall).
comment: https://slideagent.github.io/
☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic
modality, such as obligation and permission. While large language models (LLMs)
have demonstrated remarkable performance across various reasoning tasks, their
ability to handle normative reasoning remains underexplored. In this paper, we
systematically evaluate LLMs' reasoning capabilities in the normative domain
from both logical and modal perspectives. Specifically, to assess how well LLMs
reason with normative modals, we make a comparison between their reasoning with
normative modals and their reasoning with epistemic modals, which share a
common formal structure. To this end, we introduce a new dataset covering a
wide range of formal patterns of reasoning in both normative and epistemic
domains, while also incorporating non-formal cognitive factors that influence
human reasoning. Our results indicate that, although LLMs generally adhere to
valid reasoning patterns, they exhibit notable inconsistencies in specific
types of normative reasoning and display cognitive biases similar to those
observed in psychological studies of human reasoning. These findings highlight
challenges in achieving logical consistency in LLMs' normative reasoning and
provide insights for enhancing their reliability. All data and code are
released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
☆ Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Large Language Models (LLMs) face significant inference latency challenges
stemming from their autoregressive design and large size. To address this,
speculative decoding emerges as a solution, enabling the simultaneous
generation and validation of multiple tokens. While recent approaches like
EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures,
they often neglect the impact of crucial system variables such as GPU devices
and batch sizes.
Therefore, we introduce a new dynamic tree decoding approach called CAST that
takes into account inference costs, including factors such as GPU
configurations and batch sizes, to dynamically refine the tree structure.
Through comprehensive experimentation across six diverse tasks and utilizing
six distinct LLMs, our methodology demonstrates remarkable results, achieving
speeds up to 5.2 times faster than conventional decoding methods. Moreover, it
generally outperforms existing state-of-the-art techniques from 5% to 20%.
☆ InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach
for enhancing agentic deep search. However, its application is often hindered
by low \textbf{Reward Density} in deep search scenarios, where agents expend
significant exploratory costs for infrequent and often null final rewards. In
this paper, we formalize this challenge as the \textbf{Reward Density
Optimization} problem, which aims to improve the reward obtained per unit of
exploration cost. This paper introduce \textbf{InfoFlow}, a systematic
framework that tackles this problem from three aspects. 1) \textbf{Subproblem
decomposition}: breaking down long-range tasks to assign process rewards,
thereby providing denser learning signals. 2) \textbf{Failure-guided hints}:
injecting corrective guidance into stalled trajectories to increase the
probability of successful outcomes. 3) \textbf{Dual-agent refinement}:
employing a dual-agent architecture to offload the cognitive burden of deep
exploration. A refiner agent synthesizes the search history, which effectively
compresses the researcher's perceived trajectory, thereby reducing exploration
cost and increasing the overall reward density. We evaluate InfoFlow on
multiple agentic search benchmarks, where it significantly outperforms strong
baselines, enabling lightweight LLMs to achieve performance comparable to
advanced proprietary LLMs.
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in
Hernandez et al. [2023] that decode specific relational facts in transformer
language models. We extend their single-relation findings to a collection of
relations and systematically chart their organization. We show that such
collections of relation decoders can be highly compressed by simple order-3
tensor networks without significant loss in decoding accuracy. To explain this
surprising redundancy, we develop a cross-evaluation protocol, in which we
apply each linear decoder operator to the subjects of every other relation. Our
results reveal that these linear maps do not encode distinct relations, but
extract recurring, coarse-grained semantic properties (e.g., country of capital
city and country of food are both in the country-of-X property). This
property-centric structure clarifies both the operators' compressibility and
highlights why they generalize only to new relations that are semantically
close. Our findings thus interpret linear relational decoding in transformer
language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ Hebrew Diacritics Restoration using Visual Representation
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate
word pronunciation and disambiguating textual meaning. Despite the language's
high degree of ambiguity when unvocalized, recent machine learning approaches
have significantly advanced performance on this task.
In this work, we present DIVRIT, a novel system for Hebrew diacritization
that frames the task as a zero-shot classification problem. Our approach
operates at the word level, selecting the most appropriate diacritization
pattern for each undiacritized word from a dynamically generated candidate set,
conditioned on the surrounding textual context. A key innovation of DIVRIT is
its use of a Hebrew Visual Language Model, which processes undiacritized text
as an image, allowing diacritic information to be embedded directly within the
input's vector representation.
Through a comprehensive evaluation across various configurations, we
demonstrate that the system effectively performs diacritization without relying
on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting
where the correct diacritized form is guaranteed to be among the provided
candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic
architectural enhancements and optimized training methodologies yield
significant improvements in the system's overall generalization capabilities.
These findings highlight the promising potential of visual representations for
accurate and automated Hebrew diacritization.
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
comment: ICDM 2025 Workshop
☆ A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
Adam E. Flanders, Yifan Peng, Luciano Prevedello, Robyn Ball, Errol Colak, Prahlad Menon, George Shih, Hui-Ming Lin, Paras Lakhani
Purpose: The purpose of this study was to determine if an ensemble of
multiple LLM agents could be used collectively to provide a more reliable
assessment of a pixel-based AI triage tool than a single LLM.
Methods: 29,766 non-contrast CT head exams from fourteen hospitals were
processed by a commercial intracranial hemorrhage (ICH) AI detection tool.
Radiology reports were analyzed by an ensemble of eight open-source LLM models
and a HIPAA compliant internal version of GPT-4o using a single multi-shot
prompt that assessed for presence of ICH. 1,726 examples were manually
reviewed. Performance characteristics of the eight open-source models and
consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were
tested for rating the performance of the triage tool.
Results: The cohort consisted of 29,766 head CTs exam-report pairs. The
highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78).
The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76).
Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater
precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the
ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3
Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522
(0.500-0.543). No statistically significant differences were observed between
Top-3, Full-9, and Consensus (p > 0.05).
Conclusion: An ensemble of medium to large sized open-source LLMs provides a
more consistent and reliable method to derive a ground truth retrospective
evaluation of a clinical AI triage tool over a single LLM alone.
comment: 29 pages, 3 figures, 4 tables
☆ Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Linzhuang Sun, Tianyu Guo, Hao Liang, Yuying Li, Qifeng Cai, Jingxuan Wei, Bihui Yu, Wentao Zhang, Bin Cui
Recent advances in Text-to-SQL have achieved strong results in static,
single-turn tasks, where models generate SQL queries from natural language
questions. However, these systems fall short in real-world interactive
scenarios, where user intents evolve and queries must be refined over multiple
turns. In applications such as finance and business analytics, users
iteratively adjust query constraints or dimensions based on intermediate
results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a
benchmark assessing model performance under evolving user interactions. Unlike
previous manually curated datasets, DySQL-Bench is built through an automated
two-stage pipeline of task synthesis and verification. Structured tree
representations derived from raw database tables guide LLM-based task
generation, followed by interaction-oriented filtering and expert validation.
Human evaluation confirms 100% correctness of the synthesized data. We further
propose a multi-turn evaluation framework simulating realistic interactions
among an LLM-simulated user, the model under test, and an executable database.
The model must adapt its reasoning and SQL generation as user intents change.
DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling
1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the
Pass@5 metric, underscoring the benchmark's difficulty. All code and data are
released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
☆ Context Engineering 2.0: The Context of Context Engineering
Qishuo Hua, Lyumanshan Ye, Dayuan Fu, Yang Xiao, Xiaojie Cai, Yunze Wu, Jifan Lin, Junfei Wang, Pengfei Liu
Karl Marx once wrote that ``the human essence is the ensemble of social
relations'', suggesting that individuals are not isolated entities but are
fundamentally shaped by their interactions with other entities, within which
contexts play a constitutive and essential role. With the advent of computers
and artificial intelligence, these contexts are no longer limited to purely
human--human interactions: human--machine interactions are included as well.
Then a central question emerges: How can machines better understand our
situations and purposes? To address this challenge, researchers have recently
introduced the concept of context engineering. Although it is often regarded as
a recent innovation of the agent era, we argue that related practices can be
traced back more than twenty years. Since the early 1990s, the field has
evolved through distinct historical phases, each shaped by the intelligence
level of machines: from early human--computer interaction frameworks built
around primitive computers, to today's human--agent interaction paradigms
driven by intelligent agents, and potentially to human--level or superhuman
intelligence in the future. In this paper, we situate context engineering,
provide a systematic definition, outline its historical and conceptual
landscape, and examine key design considerations for practice. By addressing
these questions, we aim to offer a conceptual foundation for context
engineering and sketch its promising future. This paper is a stepping stone for
a broader community effort toward systematic context engineering in AI systems.
☆ Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Large language models (LLMs) continue to advance, with an increasing number
of domain-specific variants tailored for specialised tasks. However, these
models often lack transparency and explainability, can be costly to fine-tune,
require substantial prompt engineering, yield inconsistent results across
domains, and impose significant adverse environmental impact due to their high
computational demands. To address these challenges, we propose the Bayesian
network LLM fusion (BNLF) framework, which integrates predictions from three
LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic
mechanism for sentiment analysis. BNLF performs late fusion by modelling the
sentiment predictions from multiple LLMs as probabilistic nodes within a
Bayesian network. Evaluated across three human-annotated financial corpora with
distinct linguistic and contextual characteristics, BNLF demonstrates
consistent gains of about six percent in accuracy over the baseline LLMs,
underscoring its robustness to dataset variability and the effectiveness of
probabilistic fusion for interpretable sentiment classification.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE 2026
Identifying and addressing security issues during the early phase of the
development lifecycle is critical for mitigating the long-term negative impacts
on software systems. Code review serves as an effective practice that enables
developers to check their teammates' code before integration into the codebase.
To streamline the generation of review comments, various automated code review
approaches have been proposed, where LLM-based methods have significantly
advanced the capabilities of automated review generation. However, existing
models primarily focus on general-purpose code review, their effectiveness in
identifying and addressing security-related issues remains underexplored.
Moreover, adapting existing code review approaches to target security issues
faces substantial challenges, including data scarcity and inadequate evaluation
metrics. To address these limitations, we propose SecureReviewer, a new
approach designed for enhancing LLMs' ability to identify and resolve
security-related issues during code review. Specifically, we first construct a
dataset tailored for training and evaluating secure code review capabilities.
Leveraging this dataset, we fine-tune LLMs to generate code review comments
that can effectively identify security issues and provide fix suggestions with
our proposed secure-aware fine-tuning strategy. To mitigate hallucination in
LLMs and enhance the reliability of their outputs, we integrate the RAG
technique, which grounds the generated comments in domain-specific security
knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric
designed to assess the effectiveness of review comments in addressing security
issues. Experimental results demonstrate that SecureReviewer outperforms
state-of-the-art baselines in both security issue detection accuracy and the
overall quality and practical utility of generated review comments.
comment: Accepted by ICSE 2026. Code and data:
https://github.com/SIMIAO515/SecureReviewer
☆ 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models EMNLP 2025
Large Language Models (LLMs) have demonstrated remarkable proficiency in
language comprehension and generation; however, their widespread adoption is
constrained by substantial bandwidth and computational demands. While pruning
and low-rank approximation have each demonstrated promising performance
individually, their synergy for LLMs remains underexplored. We introduce
\underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank
\underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths
of both techniques: low-rank approximation compresses the model by retaining
its essential structure with minimal information loss, whereas sparse
optimization eliminates non-essential weights, preserving those crucial for
generalization. Based on theoretical analysis, we first formulate the low-rank
approximation and sparse optimization as a unified problem and solve it by
iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models
(7B-70B) show that SSLC, without any additional training steps, consistently
surpasses standalone methods, achieving state-of-the-arts results. Notably,
SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least
1.63$\times$ speedup, offering a practical solution for efficient LLM
deployment.
comment: 15 pages, 6 figures, EMNLP 2025 findings
☆ Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Test oracle generation in non-regression testing is a longstanding challenge
in software engineering, where the goal is to produce oracles that can
accurately determine whether a function under test (FUT) behaves as intended
for a given input. In this paper, we introduce Nexus, a novel multi-agent
framework to address this challenge. Nexus generates test oracles by leveraging
a diverse set of specialized agents that synthesize test oracles through a
structured process of deliberation, validation, and iterative self-refinement.
During the deliberation phase, a panel of four specialist agents, each
embodying a distinct testing philosophy, collaboratively critiques and refines
an initial set of test oracles. Then, in the validation phase, Nexus generates
a plausible candidate implementation of the FUT and executes the proposed
oracles against it in a secure sandbox. For any oracle that fails this
execution-based check, Nexus activates an automated selfrefinement loop, using
the specific runtime error to debug and correct the oracle before
re-validation. Our extensive evaluation on seven diverse benchmarks
demonstrates that Nexus consistently and substantially outperforms
state-of-theart baselines. For instance, Nexus improves the test-level oracle
accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The
improved accuracy also significantly enhances downstream tasks: the bug
detection rate of GPT4.1-Mini generated test oracles on HumanEval increases
from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of
automated program repair improves from 35.23% to 69.32%.
comment: Under Review
☆ OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
Min Zhang, Hao Chen, Hao Chen, Wenqi Zhang, Didi Zhu, Xin Lin, Bo Jiang, Aimin Zhou, Fei Wu, Kun Kuang
With the rapid development of large language models (LLMs), various LLM-based
works have been widely applied in educational fields. However, most existing
LLMs and their benchmarks focus primarily on the knowledge dimension, largely
neglecting the evaluation of cultivation capabilities that are essential for
real-world educational scenarios. Additionally, current benchmarks are often
limited to a single subject or question type, lacking sufficient diversity.
This issue is particularly prominent within the Chinese context. To address
this gap, we introduce OmniEduBench, a comprehensive Chinese educational
benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs.
The data is meticulously divided into two core dimensions: the knowledge
dimension and the cultivation dimension, which contain 18.121K and 6.481K
entries, respectively. Each dimension is further subdivided into 6 fine-grained
categories, covering a total of 61 different subjects (41 in the knowledge and
20 in the cultivation). Furthermore, the dataset features a rich variety of
question formats, including 11 common exam question types, providing a solid
foundation for comprehensively evaluating LLMs' capabilities in education.
Extensive experiments on 11 mainstream open-source and closed-source LLMs
reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro
surpassed 60\% accuracy, while in the cultivation dimension, the
best-performing model, QWQ, still trailed human intelligence by nearly 30\%.
These results highlight the substantial room for improvement and underscore the
challenges of applying LLMs in education.
☆ On the Role of Context for Discourse Relation Classification in Scientific Writing
With the increasing use of generative Artificial Intelligence (AI) methods to
support science workflows, we are interested in the use of discourse-level
information to find supporting evidence for AI generated scientific claims. A
first step towards this objective is to examine the task of inferring discourse
structure in scientific writing.
In this work, we present a preliminary investigation of pretrained language
model (PLM) and Large Language Model (LLM) approaches for Discourse Relation
Classification (DRC), focusing on scientific publications, an under-studied
genre for this task. We examine how context can help with the DRC task, with
our experiments showing that context, as defined by discourse structure, is
generally helpful. We also present an analysis of which scientific discourse
relation types might benefit most from context.
comment: Accepted at Joint Sixth Workshop on Computational Approaches to
Discourse, Context and Document-Level Inferences (CODI 2025) and Eighth
Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC
2025)
☆ The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
While a multi-agent approach based on large language models (LLMs) represents
a promising strategy to surpass the capabilities of single models, its success
is critically dependent on synergistic team composition. However, forming
optimal teams is a significant challenge, as the inherent opacity of most
models obscures the internal characteristics necessary for effective
collaboration. In this paper, we propose an interaction-centric framework for
automatic team composition that does not require any prior knowledge including
their internal architectures, training data, or task performances. Our method
constructs a "language model graph" that maps relationships between models from
the semantic coherence of pairwise conversations, and then applies community
detection to identify synergistic model clusters. Our experiments with diverse
LLMs demonstrate that the proposed method discovers functionally coherent
groups that reflect their latent specializations. Priming conversations with
specific topics identified synergistic teams which outperform random baselines
on downstream benchmarks and achieve comparable accuracy to that of
manually-curated teams based on known model specializations. Our findings
provide a new basis for the automated design of collaborative multi-agent LLM
teams.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.
☆ From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Large Language Models (LLMs) excel at general tasks but underperform in
specialized domains like economics and psychology, which require deep,
principled understanding. To address this, we introduce ACER (Automated
Curriculum-Enhanced Regimen) that transforms generalist models into domain
experts without sacrificing their broad capabilities. ACER first synthesizes a
comprehensive, textbook-style curriculum by generating a table of contents for
a subject and then creating question-answer (QA) pairs guided by Bloom's
taxonomy. This ensures systematic topic coverage and progressively increasing
difficulty. The resulting synthetic corpus is used for continual pretraining
with an interleaved curriculum schedule, aligning learning across both content
and cognitive dimensions.
Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized
MMLU subsets. In challenging domains like microeconomics, where baselines
struggle, ACER boosts accuracy by 5 percentage points. Across all target
domains, we observe a consistent macro-average improvement of 3 percentage
points. Notably, ACER not only prevents catastrophic forgetting but also
facilitates positive cross-domain knowledge transfer, improving performance on
non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on
knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points,
while maintaining stable performance on general reasoning tasks. Our results
demonstrate that ACER offers a scalable and effective recipe for closing
critical domain gaps in LLMs.
☆ SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Language models can be used to provide interactive, personalized student
feedback in educational settings. However, real-world deployment faces three
key challenges: privacy concerns, limited computational resources, and the need
for pedagogically valid responses. These constraints require small, open-source
models that can run locally and reliably ground their outputs in correct
information. We introduce SCRIBE, a framework for multi-hop, tool-augmented
reasoning designed to generate valid responses to student questions about
feedback reports. SCRIBE combines domain-specific tools with a self-reflective
inference pipeline that supports iterative reasoning, tool use, and error
recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA
fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned
GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models
achieve comparable or superior quality to much larger models in key dimensions
such as relevance and actionability, while being perceived on par with GPT-4o
and Llama-3.3 70B by students. These findings demonstrate the viability of
SCRIBE for low-resource, privacy-sensitive educational applications.
☆ Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction,
enabling the model to analyze webpages, process user intents, and execute
cursor and keyboard inputs directly within the browser. While its capacity for
information retrieval tasks has been demonstrated, its performance in dynamic,
interactive environments remains less explored. In this study, we conduct an
early evaluation of Atlas's web interaction capabilities using browser-based
games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird,
and Stein.world. We employ in-game performance scores as quantitative metrics
to assess performance across different task types. Our results show that Atlas
performs strongly in logical reasoning tasks like Sudoku, completing puzzles
significantly faster than human baselines, but struggles substantially in
real-time games requiring precise timing and motor control, often failing to
progress beyond initial obstacles. These findings suggest that while Atlas
demonstrates capable analytical processing, there remain notable limitations in
dynamic web environments requiring real-time interaction. The website of our
project can be found at https://atlas-game-eval.github.io.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Michal Štefánik, Timothee Mickus, Marek Kadlčík, Bertram Højer, Michal Spiegel, Raúl Vázquez, Aman Sinha, Josef Kuchař, Philipp Mondorf
Recent work has shown that different large language models (LLMs) converge to
similar and accurate input embedding representations for numbers. These
findings conflict with the documented propensity of LLMs to produce erroneous
outputs when dealing with numeric information. In this work, we aim to explain
this conflict by exploring how language models manipulate numbers and quantify
the lower bounds of accuracy of these mechanisms. We find that despite
surfacing errors, different language models learn interchangeable
representations of numbers that are systematic, highly accurate and universal
across their hidden states and the types of input contexts. This allows us to
create universal probes for each LLM and to trace information -- including the
causes of output errors -- to specific layers. Our results lay a fundamental
understanding of how pre-trained LLMs manipulate numbers and outline the
potential of more accurate probing techniques in addressed refinements of LLMs'
architectures.
☆ Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Large language models (LLMs) commonly boost reasoning via
sample-evaluate-ensemble decoders, achieving label free gains without ground
truth. However, prevailing strategies score candidates using only external
outputs such as token probabilities, entropies, or self evaluations, and these
signals can be poorly calibrated after post training. We instead analyze
internal behavior based on neuron activations and uncover three findings: (1)
external signals are low dimensional projections of richer internal dynamics;
(2) correct responses activate substantially fewer unique neurons than
incorrect ones throughout generation; and (3) activations from correct
responses exhibit stronger cross sample agreement, whereas incorrect ones
diverge. Motivated by these observations, we propose Neuron Agreement Decoding
(NAD), an unsupervised best-of-N method that selects candidates using
activation sparsity and cross sample neuron agreement, operating solely on
internal signals and without requiring comparable textual outputs. NAD enables
early correctness prediction within the first 32 generated tokens and supports
aggressive early stopping. Across math and science benchmarks with verifiable
answers, NAD matches majority voting; on open ended coding benchmarks where
majority voting is inapplicable, NAD consistently outperforms Avg@64. By
pruning unpromising trajectories early, NAD reduces token usage by 99% with
minimal loss in generation quality, showing that internal signals provide
reliable, scalable, and efficient guidance for label free ensemble decoding.
☆ PVMark: Enabling Public Verifiability for LLM Watermarking Schemes
Watermarking schemes for large language models (LLMs) have been proposed to
identify the source of the generated text, mitigating the potential threats
emerged from model theft. However, current watermarking solutions hardly
resolve the trust issue: the non-public watermark detection cannot prove itself
faithfully conducting the detection. We observe that it is attributed to the
secret key mostly used in the watermark detection -- it cannot be public, or
the adversary may launch removal attacks provided the key; nor can it be
private, or the watermarking detection is opaque to the public. To resolve the
dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP),
enabling the watermark detection process to be publicly verifiable by third
parties without disclosing any secret key. PVMark hinges upon the proof of
`correct execution' of watermark detection on which a set of ZKP constraints
are built, including mapping, random number generation, comparison, and
summation. We implement multiple variants of PVMark in Python, Rust and Circom,
covering combinations of three watermarking schemes, three hash functions, and
four ZKP protocols, to show our approach effectively works under a variety of
circumstances. By experimental results, PVMark efficiently enables public
verifiability on the state-of-the-art LLM watermarking schemes yet without
compromising the watermarking performance, promising to be deployed in
practice.
comment: This work has been submitted to the IEEE for possible publication
☆ Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong, Peerat Limkonchotiwat
Vision-language models (VLMs) exhibit uneven performance across languages, a
problem that is often exacerbated when the model size is reduced. While
Knowledge distillation (KD) demonstrates promising results in transferring
knowledge from larger to smaller VLMs, applying KD in multilingualism is an
underexplored area. This paper presents a controlled empirical study of KD
behavior across five distillation approaches, isolating their effects on
cross-lingual representation consistency and downstream performance stability
under model compression. We study five distillation formulations across CLIP
and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual
QA. We find that some configurations preserve or even improve multilingual
retrieval robustness despite halving model size, but others fail to maintain
cross-task stability, exposing design-sensitive trade-offs that aggregate
accuracy alone does not reveal.
comment: Work in progress
☆ Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Throughout language history, words are borrowed from one language to another
and gradually become integrated into the recipient's lexicon. Speakers can
often differentiate these loanwords from native vocabulary, particularly in
bilingual communities where a dominant language continuously imposes lexical
items on a minority language. This paper investigates whether pretrained
language models, including large language models, possess similar capabilities
for loanword identification. We evaluate multiple models across 10 languages.
Despite explicit instructions and contextual information, our results show that
models perform poorly in distinguishing loanwords from native ones. These
findings corroborate previous evidence that modern NLP systems exhibit a bias
toward loanwords rather than native equivalents. Our work has implications for
developing NLP tools for minority languages and supporting language
preservation in communities under lexical pressure from dominant languages.
comment: Under review
☆ Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
The ability to accurately interpret implied meanings plays a crucial role in
human communication and language use, and language models are also expected to
possess this capability. This study demonstrates that providing language models
with pragmatic theories as prompts is an effective in-context learning approach
for tasks to understand implied meanings. Specifically, we propose an approach
in which an overview of pragmatic theories, such as Gricean pragmatics and
Relevance Theory, is presented as a prompt to the language model, guiding it
through a step-by-step reasoning process to derive a final interpretation.
Experimental results showed that, compared to the baseline, which prompts
intermediate reasoning without presenting pragmatic theories (0-shot
Chain-of-Thought), our methods enabled language models to achieve up to 9.6\%
higher scores on pragmatic reasoning tasks. Furthermore, we show that even
without explaining the details of pragmatic theories, merely mentioning their
names in the prompt leads to a certain performance improvement (around 1-3%) in
larger models compared to the baseline.
☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
comment: 10 pages
☆ Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Retrieval-augmented generation (RAG) has emerged as a leading approach to
reducing hallucinations in large language models (LLMs). Current RAG evaluation
benchmarks primarily focus on what we call local RAG: retrieving relevant
chunks from a small subset of documents to answer queries that require only
localized understanding within specific text chunks. However, many real-world
applications require a fundamentally different capability -- global RAG --
which involves aggregating and analyzing information across entire document
collections to derive corpus-level insights (for example, "What are the top 10
most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first
benchmark specifically designed to evaluate global RAG capabilities, covering
four core task types: counting, extremum queries, sorting, and top-k
extraction. Through systematic evaluation across different models and
baselines, we find that existing RAG methods perform poorly on global tasks,
with the strongest baseline achieving only 1.51 F1 score. To address these
challenges, we propose GlobalRAG, a multi-tool collaborative framework that
preserves structural coherence through chunk-level retrieval, incorporates
LLM-driven intelligent filters to eliminate noisy documents, and integrates
aggregation modules for precise symbolic computation. On the Qwen2.5-14B model,
GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1,
validating the effectiveness of our method.
☆ What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Human feedback can alter language models in unpredictable and undesirable
ways, as practitioners lack a clear understanding of what feedback data
encodes. While prior work studies preferences over certain attributes (e.g.,
length or sycophancy), automatically extracting relevant features without
pre-specifying hypotheses remains challenging. We introduce What's In My Human
Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders.
WIMHF characterizes both (1) the preferences a dataset is capable of measuring
and (2) the preferences that the annotators actually express. Across 7
datasets, WIMHF identifies a small number of human-interpretable features that
account for the majority of the preference prediction signal achieved by
black-box models. These features reveal a wide diversity in what humans prefer,
and the role of dataset-level context: for example, users on Reddit prefer
informality and jokes, while annotators in HH-RLHF and PRISM disprefer them.
WIMHF also surfaces potentially unsafe preferences, such as that LMArena users
tend to vote against refusals, often in favor of toxic content. The learned
features enable effective data curation: re-labeling the harmful examples in
Arena yields large safety gains (+37%) with no cost to general performance.
They also allow fine-grained personalization: on the Community Alignment
dataset, we learn annotator-specific weights over subjective features that
improve preference prediction. WIMHF provides a human-centered analysis method
for practitioners to better understand and use preference data.
comment: Code: https://github.com/rmovva/wimhf
☆ Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation NeurIPS 2025
While diffusion language models (DLMs) enable fine-grained refinement, their
practical controllability remains fragile. We identify and formally
characterize a central failure mode called update forgetting, in which uniform
and context agnostic updates induce token level fluctuations across timesteps,
erasing earlier semantic edits and disrupting the cumulative refinement
process, thereby degrading fluency and coherence. As this failure originates in
uniform and context agnostic updates, effective control demands explicit token
ordering. We propose Token Timestep Allocation (TTA), which realizes soft and
semantic token ordering via per token timestep schedules: critical tokens are
frozen early, while uncertain tokens receive continued refinement. This
timestep based ordering can be instantiated as either a fixed policy or an
adaptive policy driven by task signals, thereby supporting a broad spectrum of
refinement strategies. Because it operates purely at inference time, it applies
uniformly across various DLMs and naturally extends to diverse supervision
sources. Empirically, TTA improves controllability and fluency: on sentiment
control, it yields more than 20 percent higher accuracy and nearly halves
perplexity using less than one fifth the steps; in detoxification, it lowers
maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0).
Together, these results demonstrate that softened ordering via timestep
allocation is the critical lever for mitigating update forgetting and achieving
stable and controllable diffusion text generation.
comment: Accepted in NeurIPS 2025
☆ RCScore: Quantifying Response Consistency in Large Language Models
Current LLM evaluations often rely on a single instruction template,
overlooking models' sensitivity to instruction style-a critical aspect for
real-world deployments. We present RCScore, a multi-dimensional framework
quantifying how instruction formulation affects model responses. By
systematically transforming benchmark problems into multiple instruction
styles, RCScore reveals performance variations undetected by conventional
metrics. Our experiments across ten LLMs on four reasoning benchmarks
demonstrate that instruction style can shift accuracy by up to 16.7% points. We
introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to
measure stylistic self-consistency, and establish its strong correlation with
task accuracy, suggesting consistency as a valuable proxy for model
reliability. Additional findings show that deterministic decoding produces more
stylistically stable outputs, and model scale correlates positively with
cross-style consistency. RCScore offers a principled approach to assess
instruction robustness.
☆ SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level
The evaluation of intelligibility for TTS has reached a bottleneck, as
existing assessments heavily rely on word-by-word accuracy metrics such as WER,
which fail to capture the complexity of real-world speech or reflect human
comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice
Question Answering, a novel subjective approach evaluating the accuracy of key
information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour
news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal
that low WER does not necessarily guarantee high key-information accuracy,
exposing a gap between traditional metrics and practical intelligibility.
SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text
normalization and phonetic accuracy. This work underscores the urgent need for
high-level, more life-like evaluation criteria now that many systems already
excel at WER yet may fall short on real-world intelligibility.
☆ Similarity-Distance-Magnitude Language Models
We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which
are sequence prediction models fine-tuned to maximize the proportion of
generations in the well-calibrated, high-probability region partitioned by a
final-layer SDM activation layer used for binary classification of
instruction-following. We demonstrate that existing pre-trained decoder-only
Transformer LMs can be readily converted into SDM LMs via supervised
fine-tuning, using the final-layer SDM activation layer during training to
estimate a change-of-base for a supervised next-token loss over a contrastive
input encoding scheme, with additional hard negative examples generated online
during training. This results in reduced abstentions (i.e., improved
statistical efficiency) compared to strong supervised baselines.
comment: 8 pages, 5 tables
☆ MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Shikhar Tuli, James Seale Smith, Haris Jeelani, Chi-Heng Lin, Abhishek Patel, Vasili Ramanishka, Yen-Chang Hsu, Hongxia Jin
Large language models (LLMs) have significantly advanced generative
applications in natural language processing (NLP). Recent trends in model
architectures revolve around efficient variants of transformers or
state-space/gated-recurrent models (SSMs, GRMs). However, prevailing
SSM/GRM-based methods often emulate only a single attention head, potentially
limiting their expressiveness. In this work, we propose MossNet, a novel
mixture-of-state-space-experts architecture that emulates a linear multi-head
attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation
not only in channel-mixing multi-layered perceptron (MLP) blocks but also in
the time-mixing SSM kernels to realize multiple "attention heads." Extensive
experiments on language modeling and downstream evaluations show that MossNet
outperforms both transformer- and SSM-based architectures of similar model size
and data budgets. Larger variants of MossNet, trained on trillions of tokens,
further confirm its scalability and superior performance. In addition,
real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU
demonstrate favorable runtime speed and resource usage compared to similarly
sized baselines. Our results suggest that MossNet is a compelling new direction
for efficient, high-performing recurrent LLM architectures.
☆ One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
☆ Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Reinforcement learning (RL) can elicit strong reasoning in large language
models (LLMs), yet most open efforts focus on math and code. We propose
Reasoning Curriculum, a simple two-stage curriculum that first elicits
reasoning skills in pretraining-aligned domains such as math, then adapts and
refines these skills across other domains via joint RL. Stage 1 performs a
brief cold start and then math-only RL with verifiable rewards to develop
reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and
consolidate these skills. The curriculum is minimal and backbone-agnostic,
requiring no specialized reward models beyond standard verifiability checks.
Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning
curriculum yields consistent gains. Ablations and a cognitive-skill analysis
indicate that both stages are necessary and that math-first elicitation
increases cognitive behaviors important for solving complex problems. Reasoning
Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
comment: 9 pages
☆ On the Influence of Discourse Relations in Persuasive Texts
This paper investigates the relationship between Persuasion Techniques (PTs)
and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and
prompt engineering. Since no dataset annotated with both PTs and DRs exists, we
took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point
and developed LLM-based classifiers to label each instance of the dataset with
one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10
different prompts, resulting in 40 unique DR classifiers. Ensemble models using
different majority-pooling strategies were used to create 5 silver datasets of
instances labelled with both persuasion techniques and level-2 PDTB senses. The
silver dataset sizes vary from 1,281 instances to 204 instances, depending on
the majority pooling technique used. Statistical analysis of these silver
datasets shows that six discourse relations (namely Cause, Purpose, Contrast,
Cause+Belief, Concession, and Condition) play a crucial role in persuasive
texts, especially in the use of Loaded Language, Exaggeration/Minimisation,
Repetition and to cast Doubt. This insight can contribute to detecting online
propaganda and misinformation, as well as to our general understanding of
effective communication.
comment: Published in Proceedings of the 38th Canadian Conference on
Artificial Intelligence CanAI 2025 Calgary Alberta May 26-27 2025. 5 figures
7 tables
☆ Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
While Test-Time Scaling (TTS) has proven effective in improving the reasoning
ability of large language models (LLMs), low diversity in model outputs often
becomes a bottleneck; this is partly caused by the common "one problem, one
solution" (1P1S) training practice, which provides a single canonical answer
and can push models toward a narrow set of reasoning paths. To address this, we
propose a "one problem, multiple solutions" (1PNS) training paradigm that
exposes the model to a variety of valid reasoning trajectories and thus
increases inference diversity. A core challenge for 1PNS is reliably measuring
semantic differences between multi-step chains of thought, so we introduce
Reasoning Path Divergence (RPD), a step-level metric that aligns and scores
Long Chain-of-Thought solutions to capture differences in intermediate
reasoning. Using RPD, we curate maximally diverse solution sets per problem and
fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields
more varied outputs and higher pass@k, with an average +2.80% gain in pass@16
over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that
1PNS further amplifies the effectiveness of TTS. Our code is available at
https://github.com/fengjujf/Reasoning-Path-Divergence .
☆ QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki, Yohichi Suzuki, Soshun Naito, Shunya Takata, Takumi Kato, Tamotsu Basseda, Reo Yamada, Hiroya Takamura
Large language models (LLMs) have increasingly been applied to automatic
programming code generation. This task can be viewed as a language generation
task that bridges natural language, human knowledge, and programming logic.
However, it remains underexplored in domains that require interaction with
hardware devices, such as quantum programming, where human coders write Python
code that is executed on a quantum computer. To address this gap, we introduce
QCoder Benchmark, an evaluation framework that assesses LLMs on quantum
programming with feedback from simulated hardware devices. Our benchmark offers
two key features. First, it supports evaluation using a quantum simulator
environment beyond conventional Python execution, allowing feedback of
domain-specific metrics such as circuit depth, execution time, and error
classification, which can be used to guide better generation. Second, it
incorporates human-written code submissions collected from real programming
contests, enabling both quantitative comparisons and qualitative analyses of
LLM outputs against human-written codes. Our experiments reveal that even
advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting
the difficulty of the benchmark. In contrast, reasoning-based models such as o3
reach up to 78% accuracy, outperforming averaged success rates of human-written
codes (39.98%). We release the QCoder Benchmark dataset and public evaluation
API to support further research.
☆ ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests NeurIPS 2025
Jingyuan He, Jiongnan Liu, Vishan Vishesh Oberoi, Bolin Wu, Mahima Jagadeesh Patel, Kangrui Mao, Chuning Shi, I-Ta Lee, Arnold Overwijk, Chenyan Xiong
Recommender systems are among the most impactful AI applications, interacting
with billions of users every day, guiding them to relevant products, services,
or information tailored to their preferences. However, the research and
development of recommender systems are hindered by existing datasets that fail
to capture realistic user behaviors and inconsistent evaluation settings that
lead to ambiguous conclusions. This paper introduces the Open Recommendation
Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified
benchmark for consistent and realistic evaluation of recommendation models.
ORBIT offers a standardized evaluation framework of public datasets with
reproducible splits and transparent settings for its public leaderboard.
Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco,
featuring web browsing sequences from 87 million public, high-quality webpages.
ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and
privacy-guaranteed browsing data. It aligns with modern recommendation
scenarios and is reserved as the hidden test part of our leaderboard to
challenge recommendation models' generalization ability. ORBIT measures 12
representative recommendation models on its public benchmark and introduces a
prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results
reflect general improvements of recommender systems on the public datasets,
with variable individual performances. The results on the hidden test reveal
the limitations of existing approaches in large-scale webpage recommendation
and highlight the potential for improvements with LLM integrations. ORBIT
benchmark, leaderboard, and codebase are available at
https://www.open-reco-bench.ai.
comment: Accepted to NeurIPS 2025 Datasets & Benchmarks track
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.
☆ SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
The ability of LLM agents to plan and invoke tools exposes them to new safety
risks, making a comprehensive red-teaming system crucial for discovering
vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic
red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic
two-step process that starts with an agent definition and generates diverse
seed test cases that cover various risk outcomes, tool-use trajectories, and
risk sources. Then, it iteratively constructs and refines model-based
adversarial attacks based on the execution trajectories of former attempts. To
optimize the red-teaming cost, we present a model distillation approach that
leverages structured forms of a teacher model's reasoning to train smaller
models that are equally effective. Across diverse evaluation agent settings,
our seed test case generation approach yields 2 -- 2.5x boost to the coverage
of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer
model improves attack success rate by 100%, surpassing the 671B Deepseek-R1
model. Our ablations and analyses validate the effectiveness of the iterative
framework, structured reasoning, and the generalization of our red-teamer
models.
☆ Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Felipe Larios, Mariana Borras-Osorio, Yuqi Wu, Ana Gabriela Claros, David Toro-Tobon, Esteban Cabezas, Ricardo Loor-Torres, Maria Mateo Chavez, Kerly Guevara Maldonado, Luis Vilatuna Andrango, Maria Lizarazo Jimenez, Ivan Mateo Alzamora, Misk Al Zahidy, Marcelo Montero, Ana Cristina Proano, Cristian Soto Jacome, Jungwei W. Fan, Oscar J. Ponce-Ponte, Megan E. Branda, Naykky Singh Ospina, Juan P. Brito
Importance Incidental thyroid findings (ITFs) are increasingly detected on
imaging performed for non-thyroid indications. Their prevalence, features, and
clinical consequences remain undefined. Objective To develop, validate, and
deploy a natural language processing (NLP) pipeline to identify ITFs in
radiology reports and assess their prevalence, features, and clinical outcomes.
Design, Setting, and Participants Retrospective cohort of adults without prior
thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from
July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline
identified ITFs and extracted nodule characteristics from image reports from
multiple modalities and body regions. Main Outcomes and Measures Prevalence of
ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer
diagnosis. Logistic regression identified demographic and imaging-related
factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9%
women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more
likely in women, older adults, those with higher BMI, and when imaging was
ordered by oncology or internal medicine. Compared with chest CT, ITFs were
more likely via neck CT, PET, and nuclear medicine scans. Nodule
characteristics were poorly documented, with size reported in 44% and other
features in fewer than 15% (e.g. calcifications). Compared with patients
without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis,
biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were
papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were
common and strongly associated with cascades leading to the detection of small,
low-risk cancers. These findings underscore the role of ITFs in thyroid cancer
overdiagnosis and the need for standardized reporting and more selective
follow-up.
♻ ☆ TinyTim: A Family of Language Models for Divergent Generation NeurIPS
In the search for artificial general intelligence, model development and
training has focused primarily on vast datasets of known problems and their
accepted solutions. This process necessarily produces convergent systems which
are fundamentally incapable of the conceptual reframing that is required for
genuine creative breakthroughs. Inspired by the divergent cognitive processes
that allow humans to make such creative leaps, our work introduces a family of
language models, TinyTim, to serve as sources of divergent generation within
broader systems. These models have been created by fine-tuning on the
anti-parsimonious text of James Joyce's `Finnegans Wake'. Quantitative analysis
of both an unsupervised fine-tuned model (TinyTim-V1) and a new
instruction-tuned variant (TinyTim-V2) demonstrates a profound capacity for
lexical invention; the foundational V1 model exhibits a Yule's K score for
lexical richness over twenty times greater than that of convergent baselines.
This trait is a stable property of the family, as the instruction-tuned V2
maintains a statistically distinct profile and resists factual convergence,
sacrificing benchmark performance to preserve its core generative style. This
work establishes a methodology for engineering specialized divergent models
that, when paired with convergent systems, can reframe problems and force
breakthroughs beyond the reach of statistical optimization alone.
comment: 7 pages, 3 figures, accepted to NeurIPS Creative AI track, models
available at https://hf.co/npc-worldwide/
♻ ☆ Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag
Current evaluations of agents remain centered around one-shot task
completion, failing to account for the inherently iterative and collaborative
nature of many real-world problems, where human goals are often underspecified
and evolve. We argue for a shift from building and assessing task completion
agents to developing collaborative agents, assessed not only by the quality of
their final outputs but by how well they engage with and enhance human effort
throughout the problem-solving process. To support this shift, we introduce
collaborative effort scaling, a framework that captures how an agent's utility
grows with increasing user involvement. Through case studies and simulated
evaluations, we show that state-of-the-art agents often underperform in
multi-turn, real-world scenarios, revealing a missing ingredient in agent
design: the ability to sustain engagement and scaffold user understanding.
Collaborative effort scaling offers a lens for diagnosing agent behavior and
guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
♻ ☆ Comparing human and LLM politeness strategies in free production EMNLP 2025
Polite speech poses a fundamental alignment challenge for large language
models (LLMs). Humans deploy a rich repertoire of linguistic strategies to
balance informational and social goals -- from positive approaches that build
rapport (compliments, expressions of interest) to negative strategies that
minimize imposition (hedging, indirectness). We investigate whether LLMs employ
a similarly context-sensitive repertoire by comparing human and LLM responses
in both constrained and open-ended production tasks. We find that larger models
($\ge$70B parameters) successfully replicate key preferences from the
computational pragmatics literature, and human evaluators surprisingly prefer
LLM-generated responses in open-ended contexts. However, further linguistic
analyses reveal that models disproportionately rely on negative politeness
strategies even in positive contexts, potentially leading to
misinterpretations. While modern LLMs demonstrate an impressive handle on
politeness strategies, these subtle differences raise important questions about
pragmatic alignment in AI systems.
comment: 25 pages, 5 figures | EMNLP 2025 camera-ready version
♻ ☆ Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Integrating audio and visual data for training multimodal foundational models
remains a challenge. The Audio-Video Vector Alignment (AVVA) framework
addresses this by considering AV scene alignment beyond mere temporal
synchronization, and leveraging Large Language Models (LLMs) for data curation.
AVVA implements a scoring mechanism for selecting aligned training data
segments. It integrates Whisper, a speech-based foundation model, for audio and
DINOv2 for video analysis in a dual-encoder structure with contrastive learning
on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the
effectiveness of the proposed model architecture and data curation approach.
AVVA achieves a significant improvement in top-k accuracies for video-to-audio
retrieval on all datasets compared to DenseAV, while using only 192 hrs of
curated training data. Furthermore, an ablation study indicates that the data
curation process effectively trades data quality for data quantity, yielding
increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound,
compared to training on the full spectrum of uncurated data.
comment: 5 pages, 5 figures, 2 tables. Accepted at EUSIPCO 2025
♻ ☆ Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality EMNLP 2025
Supervised fine-tuning (SFT) is a critical step in aligning large language
models (LLMs) with human instructions and values, yet many aspects of SFT
remain poorly understood. We trained a wide range of base models on a variety
of datasets including code generation, mathematical reasoning, and
general-domain tasks, resulting in 1,000+ SFT models under controlled
conditions. We then identified the dataset properties that matter most and
examined the layer-wise modifications introduced by SFT. Our findings reveal
that some training-task synergies persist across all models while others vary
substantially, emphasizing the importance of model-specific strategies.
Moreover, we demonstrate that perplexity consistently predicts SFT
effectiveness, often surpassing superficial similarity between the training
data and the benchmark, and that mid-layer weight changes correlate most
strongly with performance gains. We release these 1,000+ SFT models and
benchmark results to accelerate further research. All resources are available
at https://github.com/llm-jp/massive-sft.
comment: Accepted to EMNLP 2025 (Main Conference). Models and evaluation
results available at: https://github.com/llm-jp/massive-sft
♻ ☆ Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Enhancing reasoning capabilities in small language models is critical for
specialized applications such as medical question answering, particularly in
underrepresented languages like Persian. In this study, we employ Reinforcement
Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to
improve the reasoning skills of a general-purpose Persian language model. To
achieve this, we translated a multiple-choice medical question-answering
dataset into Persian and used RLAIF to generate rejected-preferred answer
pairs, which are essential for DPO training. By prompting both teacher and
student models to produce Chain-of-Thought (CoT) reasoning responses, we
compiled a dataset containing correct and incorrect reasoning trajectories.
This dataset, comprising 2 million tokens in preferred answers and 2.5 million
tokens in rejected ones, was used to train a baseline model, significantly
enhancing its medical reasoning capabilities in Persian. Remarkably, the
resulting model outperformed its predecessor, gaokerena-V, which was trained on
approximately 57 million tokens, despite leveraging a much smaller dataset.
These results highlight the efficiency and effectiveness of reasoning-focused
training approaches in developing domain-specific language models with limited
data availability.
comment: 7 pages, 5 figures
♻ ☆ Controlling Thinking Speed in Reasoning Models NeurIPS 2025
Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye
Human cognition is theorized to operate in two modes: fast, intuitive System
1 thinking and slow, deliberate System 2 thinking. While current Large
Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform
fast thinking leads to high computational overhead and latency. In this work,
we enable LRMs to approximate human intelligence through dynamic thinking speed
adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses
two key questions: (1) how to control thinking speed in LRMs, and (2) when to
adjust it for optimal performance. For the first question, we identify the
steering vector that governs slow-fast thinking transitions in LRMs'
representation space. Using this vector, we achieve the first representation
editing-based test-time scaling effect, outperforming existing prompt-based
scaling methods. For the second question, we apply real-time difficulty
estimation to signal reasoning segments of varying complexity. Combining these
techniques, we propose the first reasoning strategy that enables fast
processing of easy steps and deeper analysis for complex reasoning. Without any
training or additional cost, our plug-in module delivers an average +1.3%
accuracy with -8.6% token usage across leading LRMs and advanced reasoning
benchmarks. All of our algorithms are implemented based on vLLM and are
expected to support broader applications and inspire future research.
comment: NeurIPS 2025 Spotlight
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning
with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM
post-training, each offering distinct advantages. However, RLHF struggles with
interpretability and reward hacking because it relies on human judgments that
usually lack explicit criteria, whereas RLVR is limited in scope by its focus
on correctness-based verifiers. We propose Reinforcement Learning with Binary
Flexible Feedback (RLBFF), which combines the versatility of human-driven
preferences with the precision of rule-based verification, enabling reward
models to capture nuanced aspects of response quality beyond mere correctness.
RLBFF extracts principles that can be answered in a binary fashion (e.g.
accuracy of information: yes, or code readability: no) from natural language
feedback. Such principles can then be used to ground Reward Model training as
an entailment task (response satisfies or does not satisfy an arbitrary
principle). We show that Reward Models trained in this manner can outperform
Bradley-Terry models when matched for data and achieve top performance on
RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24,
2025). Additionally, users can specify principles of interest at inference time
to customize the focus of our reward models, in contrast to Bradley-Terry
models. Finally, we present a fully open source recipe (including data) to
align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the
performance of o3-mini and DeepSeek R1 on general alignment benchmarks of
MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models:
https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models:
https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting ISWC 2025
Language interpretation is a compositional process, in which the meaning of
more complex linguistic structures is inferred from the meaning of their parts.
Large language models possess remarkable language interpretation capabilities
and have been successfully applied to interpret questions by mapping them to
SPARQL queries. An open question is how systematic this interpretation process
is. Toward this question, in this paper, we propose a benchmark for
investigating to what extent the abilities of LLMs to interpret questions are
actually compositional. For this, we generate three datasets of varying
difficulty based on graph patterns in DBpedia, relying on Lemon lexica for
verbalization. Our datasets are created in a very controlled fashion in order
to test the ability of LLMs to interpret structurally complex questions, given
that they have seen the atomic building blocks. This allows us to evaluate to
what degree LLMs are able to interpret complex questions for which they
"understand" the atomic parts. We conduct experiments with models of different
sizes using both various prompt and few-shot optimization techniques as well as
fine-tuning. Our results show that performance in terms of macro $F_1$ degrades
from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the
samples optimized on. Even when all necessary information was provided to the
model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of
lowest complexity. We thus conclude that LLMs struggle to systematically and
compositionally interpret questions and map them into SPARQL queries.
comment: Research Track, 24th International Semantic Web Conference (ISWC
2025), November 2-6, 2025, Nara, Japan
♻ ☆ Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution
When using a public communication channel -- whether formal or informal, such
as commenting or posting on social media -- end users have no expectation of
privacy: they compose a message and broadcast it for the world to see. Even if
an end user takes utmost precautions to anonymize their online presence --
using an alias or pseudonym; masking their IP address; spoofing their
geolocation; concealing their operating system and user agent; deploying
encryption; registering with a disposable phone number or email; disabling
non-essential settings; revoking permissions; and blocking cookies and
fingerprinting -- one obvious element still lingers: the message itself.
Assuming they avoid lapses in judgment or accidental self-exposure, there
should be little evidence to validate their actual identity, right? Wrong. The
content of their message -- necessarily open for public consumption -- exposes
an attack vector: stylometric analysis, or author profiling. In this paper, we
dissect the technique of stylometry, discuss an antithetical counter-strategy
in adversarial stylometry, and devise enhancements through Unicode
steganography.
comment: 33 pages, 7 figures, 3 tables
♻ ☆ Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom, Hari Sundaram, Koustuv Saha
On social media, many individuals experiencing suicidal ideation (SI) do not
disclose their distress explicitly. Instead, signs may surface indirectly
through everyday posts or peer interactions. Detecting such implicit signals
early is critical but remains challenging. We frame early and implicit SI as a
forward-looking prediction task and develop a computational framework that
models a user's information environment, consisting of both their longitudinal
posting histories as well as the discourse of their socially proximal peers. We
adopted a composite network centrality measure to identify top neighbors of a
user, and temporally aligned the user's and neighbors' interactions --
integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a
Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves
early and implicit SI detection by 15% over individual-only baselines. These
findings highlight that peer interactions offer valuable predictive signals and
carry broader implications for designing early detection systems that capture
indirect as well as masked expressions of risk in online environments.
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety
mechanisms of large language models. Automated jailbreaks typically optimize an
adversarial suffix or adapt long prompt templates by forcing the model to
generate the initial part of a restricted or harmful response. In this work, we
show that existing jailbreak attacks that leverage such mechanisms to unlock
the model response can be detected by a straightforward perplexity-based
filtering on the input prompt. To overcome this issue, we propose LatentBreak,
a white-box jailbreak attack that generates natural adversarial prompts with
low perplexity capable of evading such defenses. LatentBreak substitutes words
in the input prompt with semantically-equivalent ones, preserving the initial
intent of the prompt, instead of adding high-perplexity adversarial suffixes or
long templates. These words are chosen by minimizing the distance in the latent
space between the representation of the adversarial prompt and that of harmless
requests. Our extensive evaluation shows that LatentBreak leads to shorter and
low-perplexity prompts, thus outperforming competing jailbreak algorithms
against perplexity-based filters on multiple safety-aligned models.
♻ ☆ Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs
In addition to its more widely studied cultural movements, American
Evangelicalism has a well-developed but less externally visible literary side.
Christian Fiction, however, has been little studied, and what scholarly
attention there is has focused on the explosively popular Left Behind series.
In this work, we use computational tools to provide both a broad topical
overview of Christian Fiction as a genre and a more directed exploration of how
its authors depict divine acts. Working with human annotators, we first
developed a codebook for identifying "acts of God." We then adapted the
codebook for use by a recent, lightweight LM with the assistance of a much
larger model. The laptop-scale LM is largely capable of matching human
annotations, even when the task is subtle and challenging. Using these
annotations, we show that significant and meaningful differences exist between
divine acts depicted by the Left Behind books and Christian Fiction more
broadly.
comment: Accepted to CHR 2025
♻ ☆ Unstructured Evidence Attribution for Long Context Query Focused Summarization EMNLP 2025
Large language models (LLMs) are capable of generating coherent summaries
from very long contexts given a user query, and extracting and citing evidence
spans helps improve the trustworthiness of these summaries. Whereas previous
work has focused on evidence citation with fixed levels of granularity (e.g.
sentence, paragraph, document, etc.), we propose to extract unstructured (i.e.,
spans of any length) evidence in order to acquire more relevant and consistent
evidence than in the fixed granularity case. We show how existing systems
struggle to copy and properly cite unstructured evidence, which also tends to
be "lost-in-the-middle". To help models perform this task, we create the
Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset
generated using a novel pipeline, which can be used as training supervision for
unstructured evidence summarization. We demonstrate across 5 LLMs and 4
datasets spanning human written, synthetic, single, and multi-document settings
that LLMs adapted with SUnsET generate more relevant and factually consistent
evidence with their summaries, extract evidence from more diverse locations in
their context, and can generate more relevant and consistent summaries than
baselines with no fine-tuning and fixed granularity evidence. We release SUnsET
and our generation code to the public.
comment: EMNLP 2025 Main; 29 pages; 24 figures; 8 tables
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein
Large language models (LLMs) tend to generate lexically, semantically, and
stylistically homogenous texts. This poses a risk of knowledge collapse, where
homogenous LLMs mediate a shrinking in the range of accessible information over
time. Existing works on homogenization are limited by a focus on closed-ended
multiple-choice setups or fuzzy semantic features, and do not look at trends
across time and cultural contexts. To overcome this, we present a new
methodology to measure epistemic diversity, i.e., variation in real-world
claims in LLM outputs, which we use to perform a broad empirical study of LLM
knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200
prompt variations sourced from real user chats. For the topics in our study, we
show that while newer models tend to generate more diverse claims, nearly all
models are less epistemically diverse than a basic web search. We find that
model size has a negative impact on epistemic diversity, while
retrieval-augmented generation (RAG) has a positive impact, though the
improvement from RAG varies by the cultural context. Finally, compared to a
traditional knowledge source (Wikipedia), we find that country-specific claims
reflect the English language more than the local one, highlighting a gap in
epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for
table 3, random effect is the model version; v3 changelog: Fixed minor
formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model
description
♻ ☆ Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract
fine-grained information from image-text pairs to identify aspect terms and
determine their sentiment polarity. However, existing approaches often fall
short in simultaneously addressing three core challenges: Sentiment Cue
Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise
Elimination (SNE). To overcome these limitations, we propose DASCO
(\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework),
a fine-grained scope-oriented framework that enhances aspect-level sentiment
reasoning by leveraging dependency parsing trees. First, we designed a
multi-task pretraining strategy for MABSA on our base model, combining
aspect-oriented enhancement, image-text matching, and aspect-level
sentiment-sensitive cognition. This improved the model's perception of aspect
terms and sentiment cues while achieving effective image-text alignment,
addressing key challenges like SCP and MIM. Furthermore, we incorporate
dependency trees as syntactic branch combining with semantic branch, guiding
the model to selectively attend to critical contextual elements within a
target-specific scope while effectively filtering out irrelevant noise for
addressing SNE problem. Extensive experiments on two benchmark datasets across
three subtasks demonstrate that DASCO achieves state-of-the-art performance in
MABSA, with notable gains in JMASA (+2.3\% F1 and +3.5\% precision on
Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .
♻ ☆ PairUni: Pairwise Training for Unified Multimodal Language Models
Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
Unified vision-language models (UVLMs) must perform both understanding and
generation within a single architecture, but these tasks rely on heterogeneous
data and supervision, making it difficult to balance them during reinforcement
learning (RL). We propose PairUni, a unified framework that reorganizes data
into understanding-generation (UG) pairs and aligns optimization accordingly.
We first use GPT-o3 to augment single-task data, generating captions for
understanding samples and question-answer (QA) pairs for generation samples,
forming aligned pairs from the same instance. Additionally, for each generation
sample, we retrieve a semantically related understanding example to form a
retrieved pair, linking different but related data points. These paired
structures expose cross-task semantic correspondences and support consistent
policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware
variant based on Group Relative Policy Optimization. It assigns a similarity
score to each pair to modulate the advantage, strengthening learning from
well-aligned examples and reducing task interference. We curate a high-quality
dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on
the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on
various UVLMs, outperforming strong UVLM RL baselines. Codes are available at
https://github.com/Haochen-Wang409/PairUni.
comment: 21 pages, 11 figures, and 8 tables
♻ ☆ Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks EMNLP
Test-time scaling (TTS) techniques can improve the performance of large
language models (LLMs) at the expense of additional computation and latency.
While TTS has proven effective in formal domains such as mathematics and
programming, its value in argumentative domains such as law remains
underexplored. We present an empirical study of verifier-based TTS methods for
legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7
reward models, we evaluate both outcome-level (Best-of-$N$) and process-level
(tree search) verification under realistic low-$N$ budgets. Our analysis
systematically investigates how verifier utility is affected by key properties
such as domain specialization, model size, and supervision type
(process-supervised PRMs vs. outcome-only ORMs), even when applied across
different roles.
comment: Accepted to EMNLP - NLLP Workshop
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS 2025
To recognize and mitigate the harms of generative AI systems, it is crucial
to consider who is represented in the outputs of generative AI systems and how
people are represented. A critical gap emerges when naively improving who is
represented, as this does not imply bias mitigation efforts have been applied
to address how people are represented. We critically examined this by
investigating gender representation in occupation across state-of-the-art large
language models. We first show evidence suggesting that over time there have
been interventions to models altering the resulting gender distribution, and we
find that women are more represented than men when models are prompted to
generate biographies or personas. We then demonstrate that representational
biases persist in how different genders are represented by examining
statistically significant word differences across genders. This results in a
proliferation of representational harms, stereotypes, and neoliberalism ideals
that, despite existing interventions to increase female representation,
reinforce existing systems of oppression.
comment: Accepted by the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025) as a poster paper; 39 pages, 7 figures, 15 tables
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu
The rapid advancement of Large Language Models (LLMs) has stimulated interest
in multi-agent collaboration for addressing complex medical tasks. However, the
practical advantages of multi-agent collaboration approaches remain
insufficiently understood. Existing evaluations often lack generalizability,
failing to cover diverse tasks reflective of real-world clinical practice, and
frequently omit rigorous comparisons against both single-LLM-based and
established conventional methods. To address this critical gap, we introduce
MedAgentBoard, a comprehensive benchmark for the systematic evaluation of
multi-agent collaboration, single-LLM, and conventional approaches.
MedAgentBoard encompasses four diverse medical task categories: (1) medical
(visual) question answering, (2) lay summary generation, (3) structured
Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow
automation, across text, medical images, and structured EHR data. Our extensive
experiments reveal a nuanced landscape: while multi-agent collaboration
demonstrates benefits in specific scenarios, such as enhancing task
completeness in clinical workflow automation, it does not consistently
outperform advanced single LLMs (e.g., in textual medical QA) or, critically,
specialized conventional methods that generally maintain better performance in
tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital
resource and actionable insights, emphasizing the necessity of a task-specific,
evidence-based approach to selecting and developing AI solutions in medicine.
It underscores that the inherent complexity and overhead of multi-agent
collaboration must be carefully weighed against tangible performance gains. All
code, datasets, detailed prompts, and experimental results are open-sourced at
https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software
engineering tasks is resource-intensive and overlooks the potential benefits of
complementarity, where different models contribute unique strengths. However,
the degree to which coding LLMs complement each other and the best strategy for
maximizing an ensemble's potential are unclear, leaving practitioners without a
clear path to move beyond single-model systems. To address this gap, we
empirically compare ten individual LLMs from five families, and three ensembles
of these LLMs across three software engineering benchmarks covering code
generation and program repair. We assess the complementarity between models and
the performance gap between the best individual model and the ensembles. Next,
we evaluate various selection heuristics to identify correct solutions from an
ensemble's candidate pool. We find that the theoretical upperbound for an
ensemble's performance can be 83% above the best single model. Our results show
that consensus-based strategies for selecting solutions fall into a "popularity
trap," amplifying common but incorrect outputs. In contrast, a diversity-based
strategy realizes up to 95% of this theoretical potential, and proves effective
even in small two-model ensembles, enabling a cost-efficient way to enhance
performance by leveraging multiple LLMs.
comment: Added Acknowledgments section and hyphenated last names
♻ ☆ TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu
Large Language Models (LLMs) are exhibiting emergent human-like abilities and
are increasingly envisioned as the foundation for simulating an individual's
communication style, behavioral tendencies, and personality traits. However,
current evaluations of LLM-based persona simulation remain limited: most rely
on synthetic dialogues, lack systematic frameworks, and lack analysis of the
capability requirement. To address these limitations, we introduce TwinVoice, a
comprehensive benchmark for assessing persona simulation across diverse
real-world contexts. TwinVoice encompasses three dimensions: Social Persona
(public social interactions), Interpersonal Persona (private dialogues), and
Narrative Persona (role-based expression). It further decomposes the evaluation
of LLM performance into six fundamental capabilities, including opinion
consistency, memory recall, logical reasoning, lexical fidelity, persona tone,
and syntactic style. Experimental results reveal that while advanced models
achieve moderate accuracy in persona simulation, they still fall short of
capabilities such as syntactic style and memory recall. Consequently, the
average performance achieved by LLMs remains considerably below the human
baseline.
comment: Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde
Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye
(ziyiye@fudan.edu.cn), Qingyao Ai (aiqy@tsinghua.edu.cn)
♻ ☆ ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
Autoformalization, which translates natural language mathematics into
machine-verifiable formal statements, is critical for using formal mathematical
reasoning to solve math problems stated in natural language. While Large
Language Models can generate syntactically correct formal statements, they
often fail to preserve the original problem's semantic intent. This limitation
arises from the LLM approaches' treating autoformalization as a simplistic
translation task which lacks mechanisms for self-reflection and iterative
refinement that human experts naturally employ. To address these issues, we
propose ReForm, a Reflective Autoformalization method that tightly integrates
semantic consistency evaluation into the autoformalization process. This
enables the model to iteratively generate formal statements, assess its
semantic fidelity, and self-correct identified errors through progressive
refinement. To effectively train this reflective model, we introduce
Prospective Bounded Sequence Optimization (PBSO), which employs different
rewards at different sequence positions to ensure that the model develops both
accurate autoformalization and correct semantic validations, preventing
superficial critiques that would undermine the purpose of reflection. Extensive
experiments across four autoformalization benchmarks demonstrate that ReForm
achieves an average improvement of 22.6 percentage points over the strongest
baselines. To further ensure evaluation reliability, we introduce
ConsistencyCheck, a benchmark of 859 expert-annotated items that not only
validates LLMs as judges but also reveals that autoformalization is inherently
difficult: even human experts produce semantic errors in up to 38.5% of cases.
comment: https://github.com/Chen-GX/ReForm
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive
time to manually coding speech samples using Correct Information Units (CIUs),
a measure of how informative an individual sample of speech is. Developing
automated systems to recognize aphasic language is limited by data scarcity.
For example, only about 600 transcripts are available in AphasiaBank yet
billions of tokens are used to train large language models (LLMs). In the
broader field of machine learning (ML), researchers increasingly turn to
synthetic data when such are sparse. Therefore, this study constructs and
validates two methods to generate synthetic transcripts of the AphasiaBank Cat
Rescue picture description task. One method leverages a procedural programming
approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct
LLMs. The methods generate transcripts across four severity levels (Mild,
Moderate, Severe, Very Severe) through word dropping, filler insertion, and
paraphasia substitution. Overall, we found, compared to human-elicited
transcripts, Mistral 7b Instruct best captures key aspects of linguistic
degradation observed in aphasia, showing realistic directional changes in NDW,
word count, and word length amongst the synthetic generation methods. Based on
the results, future work should plan to create a larger dataset, fine-tune
models for better aphasic representation, and have SLPs assess the realism and
usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific
communication, requiring the compression of long-context interleaved documents
into a single, visually coherent page. To address this challenge, we introduce
the first benchmark and metric suite for poster generation, which pairs recent
conference papers with author-designed posters and evaluates outputs on
(i)Visual Quality-semantic alignment with human posters, (ii)Textual
Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic
and informational criteria scored by a VLM-as-judge, and notably
(iv)PaperQuiz-the poster's ability to convey core paper content as measured by
VLMs answering generated quizzes. Building on this benchmark, we propose
PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser
distills the paper into a structured asset library; the (b)Planner aligns
text-visual pairs into a binary-tree layout that preserves reading order and
spatial balance; and the (c)Painter-Commenter loop refines each panel by
executing rendering code and using VLM feedback to eliminate overflow and
ensure alignment. In our comprehensive evaluation, we find that GPT-4o
outputs-though visually appealing at first glance-often exhibit noisy text and
poor PaperQuiz scores, and we find that reader engagement is the primary
aesthetic bottleneck, as human-designed posters rely largely on visual
semantics to convey meaning. Our fully open-source variants (e.g. based on the
Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across
nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper
into a finalized yet editable .pptx poster - all for just $0.005. These
findings chart clear directions for the next generation of fully automated
poster-generation models. The code and datasets are available at
https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
The rapid advancement of large language models(LLMs) has intensified the need
for domain and culture specific evaluation. Existing benchmarks are largely
Anglocentric and domain-agnostic, limiting their applicability to India-centric
contexts. To address this gap, we introduce BhashaBench V1, the first
domain-specific, multi-task, bilingual benchmark focusing on critical Indic
knowledge systems. BhashaBench V1 contains 74,166 meticulously curated
question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from
authentic government and domain-specific exams. It spans four major domains:
Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and
covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs
reveals significant domain and language specific performance gaps, with
especially large disparities in low-resource domains. For instance, GPT-4o
achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models
consistently perform better on English content compared to Hindi across all
domains. Subdomain-level analysis shows that areas such as Cyber Law,
International Finance perform relatively well, while Panchakarma, Seed Science,
and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive
dataset for evaluating large language models across India's diverse knowledge
domains. It enables assessment of models' ability to integrate domain-specific
knowledge with bilingual understanding. All code, benchmarks, and resources are
publicly available to support open research.
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured
thinking abilities, especially when supervised with rigid templates or
crowd-annotated instruction datasets. Unlike prior approaches, we focus on a
thinking-centric data synthesis paradigm that enables models to evolve through
self-generated, cognitively guided data. We propose MindGYM, a structured and
scalable framework for question synthesis, composed of: (1) Cognitive Thinking
Process Injection, which infuses high-level reasoning objectives to shape the
model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating
atomic questions from diverse semantic types to encourage broader thinking; and
(3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop
questions based on QA seeds for deeper reasoning. Detailed analysis shows that
synthetic data generated by our method achieves 16.7% higher average quality
and 67.91% lower quality variance compared to baseline sources, highlighting
that both high-quality and self-contained data are essential for effective,
thinking-oriented fine-tuning. MindGYM improves performance on six reasoning
benchmarks, achieving gains of up to 16% on MathVision using only 400 data
samples, and generalizable improvements across different model sizes and
architectures. MindGYM underscores the viability of self-challenging mechanisms
in refining large model capabilities while minimizing human intervention and
resource demands. Code and data are released to promote data-centric research
into self-evolving foundation models driven by their internal reasoning
capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai
Multimodal Large Languages models have been progressing from uni-modal
understanding toward unifying visual, audio and language modalities,
collectively termed omni models. However, the correlation between uni-modal and
omni-modal remains unclear, which requires comprehensive evaluation to drive
omni model's intelligence evolution. In this work, we introduce a novel,
high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is
designed to effectively evaluate both UNi-modal and Omni-modal capabilities
under a unified ability taxonomy, spanning 44 task types and 5 modality
combinations. It includes 1250 human curated samples for omni-modal with 98%
cross-modality solvability, and 2480 enhanced uni-modal samples. The
human-generated dataset is well-suited to real-world scenarios, particularly
within the Chinese context, whereas the automatically compressed dataset offers
a 90% increase in speed and maintains 98% consistency across 18 public
benchmarks. In addition to traditional multi-choice questions, we propose an
innovative multi-step open-ended question format to assess complex reasoning. A
general scoring model is incorporated, supporting 6 question types for
automated evaluation with 95% accuracy. Experimental result shows the
Compositional Law between omni-modal and uni-modal performance and the
omni-modal capability manifests as a bottleneck effect on weak models, while
exhibiting synergistic promotion on strong models.
comment: v3: Switch the paper template. Work in progress. Github:
https://github.com/meituan-longcat/UNO-Bench Hugging Face:
https://huggingface.co/datasets/meituan-longcat/UNO-Bench
♻ ☆ Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Shiwei Ye, Xianpei Han, Ben He, Le Sun
With the emergence of large language models (LLMs), there is an expectation
that LLMs can effectively extract explicit information from complex real-world
documents (e.g., papers, reports). However, most LLMs generate paragraph-style
answers that are chaotic, disorganized, and untraceable. To bridge this gap, we
introduce the Arranged and Organized Extraction Benchmark (AOE), a new
bilingual benchmark with data and documents of varying lengths designed to
systematically evaluate the ability of LLMs to comprehend fragmented documents
and reconstruct isolated information into one organized table. Unlike
conventional text-to-table tasks, which rely on fixed schema and narrow task
domains, AOE includes 11 carefully crafted tasks across three diverse domains,
requiring models to generate context-specific schema tailored to varied input
queries. In the experiment, we evaluated both open-source and closed-source
state-of-the-art LLMs. The results show that even the most advanced models
struggled significantly. The benchmark is available at
https://anonymous.4open.science/r/AOE-Benchmark/.
♻ ☆ Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens EMNLP 2025
Previous research has primarily focused on the cognitive error detection
capabilities of Large Language Models (LLMs), often prompting them to analyze
mistakes in reasoning chains. However, few studies have examined the
meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors),
which are crucial for their reliability. While studies on LLM self-evaluation
present some measures, such as perplexity, which can reflect the answer
correctness and be viewed as the lens of meta-cognition, they lack step-level
analysis and adaptation. This paper studies the evaluation of LLM
meta-cognition using the current lenses and how to improve these lenses.
Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation
framework for benchmarking the existing lenses. Furthermore, a training-free
Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost
current meta-cognition lenses. Experimental results on three mathematical
reasoning datasets and three LLMs show the reasonableness of AutoMeco by
comparing it with Best-of-N verification. Moreover, the meta-cognition ability
of LLMs can be better evaluated using MIRA.
comment: Accepted to EMNLP 2025
♻ ☆ Hysteresis Activation Function for Efficient Inference NeurIPS
The widely used ReLU is favored for its hardware efficiency, {as the
implementation at inference is a one bit sign case,} yet suffers from issues
such as the ``dying ReLU'' problem, where during training, neurons fail to
activate and constantly remain at zero, as highlighted by Lu et al. Traditional
approaches to mitigate this issue often introduce more complex and less
hardware-friendly activation functions. In this work, we propose a Hysteresis
Rectified Linear Unit (HeLU), an efficient activation function designed to
address the ``dying ReLU'' problem with minimal complexity. Unlike traditional
activation functions with fixed thresholds for training and inference, HeLU
employs a variable threshold that refines the backpropagation. This refined
mechanism allows simpler activation functions to achieve competitive
performance comparable to their more complex counterparts without introducing
unnecessary complexity or requiring inductive biases. Empirical evaluations
demonstrate that HeLU enhances model generalization across diverse datasets,
offering a promising solution for efficient and effective inference suitable
for a wide range of neural network architectures.
comment: Accepted to 4th NeurIPS Efficient Natural Language and Speech
Processing Workshop (ENLSP-IV 2024)
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial
for enhancing their overall performance across various domains. In practical
scenarios, existing methods based on modeling the mixture proportions of data
composition often struggle with data whose domain labels are missing, imprecise
or non-normalized, while methods based on data selection usually encounter
difficulties in balancing multi-domain performance. To address these
challenges, in this work, we investigate the role of data diversity in
enhancing the overall abilities of LLMs by empirically constructing contrastive
data pools and theoretically deriving explanations. Building upon the insights
gained, we propose a new method that gives the LLM a dual identity: an output
model to cognitively probe and select data based on diversity reward, as well
as an input model to be tuned with the selected data. Extensive experiments
show that the proposed method notably boosts performance across
domain-undetermined data and a series of foundational downstream tasks when
applied to various advanced LLMs. We release our code and hope this study can
shed light on the understanding of data diversity and advance feedback-driven
data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ SEA-LION: Southeast Asian Languages in One Network AACL 2025
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montalan, Adwin Chan, Sajeban Antonyrex, Ren Lee, Esther Choa, David Ong Tat-Wee, Bing Jie Darius Liu, William Chandra Tjhi, Erik Cambria, Leslie Teo
Recently, Large Language Models (LLMs) have dominated much of the artificial
intelligence scene with their ability to process and generate natural
languages. However, the majority of LLM research and development remains
English-centric, leaving low-resource languages such as those in the Southeast
Asian (SEA) region under-represented. To address this representation gap, we
introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge
multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs
supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese,
Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages
large-scale multilingual continued pre-training with a comprehensive
post-training regime involving multiple stages of instruction fine-tuning,
alignment, and model merging. Evaluation results on multilingual benchmarks
indicate that our models achieve state-of-the-art performance across LLMs
supporting SEA languages. We open-source the models to benefit the wider SEA
community.
comment: Accepted at IJCNLP-AACL 2025 (Main Track). We released our model at
https://huggingface.co/collections/aisingapore/sea-lionv3-672589a39cdadd6a5b199581
♻ ☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external
knowledge sources. Yet web pages, PDF files, and other raw documents are not
inherently LLM-ready: they are long, noisy, and unstructured. Conventional
retrieval methods treat these documents as verbatim text and return raw
passages, leaving the burden of fragment assembly and contextual reasoning to
the LLM. This gap underscores the need for a new retrieval paradigm that
redefines how models interact with documents.
We introduce the Model-Document Protocol (MDP), a general framework that
formalizes how raw text is bridged to LLMs through consumable knowledge
representations. Rather than treating retrieval as passage fetching, MDP
defines multiple pathways that transform unstructured documents into
task-specific, LLM-ready inputs. These include agentic reasoning, which curates
raw evidence into coherent context; memory grounding, which accumulates
reusable notes to enrich reasoning; and structured leveraging, which encodes
documents into formal representations such as graphs or key-value caches. All
three pathways share the same goal: ensuring that what reaches the LLM is not
raw fragments but compact, structured knowledge directly consumable for
reasoning.
As an instantiation, we present MDP-Agent, which realizes the protocol
through an agentic process: constructing document-level gist memories for
global coverage, performing diffusion-based exploration with vertical
exploitation to uncover layered dependencies, and applying map-reduce style
synthesis to integrate large-scale evidence into compact yet sufficient
context. Experiments on information-seeking benchmarks demonstrate that
MDP-Agent outperforms baselines, validating both the soundness of the MDP
framework and the effectiveness of its agentic instantiation.
comment: 10 pages
♻ ☆ How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
Diffusion language models (DLMs) have emerged as a promising alternative to
the long-dominant autoregressive (AR) paradigm, offering a parallelable
decoding process that could yield greater efficiency. Yet, in practice, current
open-source DLMs often underperform their AR counterparts in speed, limiting
their real-world utility. This work presents a systematic study of DLM
efficiency, identifying key issues in prior evaluation methods. Through
empirical benchmarking and a roofline-based theoretical analysis, we
demonstrate that AR models generally achieve higher throughput, while DLMs
consistently lag. We also investigate acceleration strategies, finding that
techniques like dual cache and parallel decoding mainly offer gains at small
batch sizes, with their benefits diminishing upon scaling. Our findings
underscore the necessity of robust evaluation methods and improved acceleration
strategies to advance research on DLMs.
comment: Withdrawn by the authors to better delineate the related work from
the paper's original contributions
♻ ☆ SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat NeurIPS 2025
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs
through competition and combat. To complement a single model's lack of
diversity in generation and biases in evaluation, multiple LLMs form a "sparta
tribe" to compete against each other in fulfilling instructions while serving
as judges for the competition of others. For each iteration, one instruction
and two models are selected for a duel, the other models evaluate the two
responses, and their evaluation scores are aggregated through a adapted
elo-ranking based reputation system, where winners/losers of combat gain/lose
weight in evaluating others. The peer-evaluated combat results then become
preference pairs where the winning response is preferred over the losing one,
and all models learn from these preferences at the end of each iteration.
SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative
and collective competition process. Extensive experiments demonstrate that
SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines
across 10 out of 12 tasks and datasets with 7.0% average improvement. Further
analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen
tasks and leverages the expertise diversity of participating models to produce
more logical, direct and informative outputs.
comment: NeurIPS 2025
♻ ☆ IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation
Large Language Models (LLMs) have shown strong potential for recommendation
by framing item prediction as a token-by-token language generation task.
However, existing methods treat all item tokens equally, simply pursuing
likelihood maximization during both optimization and decoding. This overlooks
crucial token-level differences in decisiveness-many tokens contribute little
to item discrimination yet can dominate optimization or decoding. To quantify
token decisiveness, we propose a novel perspective that models item generation
as a decision process, measuring token decisiveness by the Information Gain
(IG) each token provides in reducing uncertainty about the generated item. Our
empirical analysis reveals that most tokens have low IG but often correspond to
high logits, disproportionately influencing training loss and decoding, which
may impair model performance. Building on these insights, we introduce an
Information Gain-based Decisiveness-aware Token handling (IGD) strategy that
integrates token decisiveness into both tuning and decoding. Specifically, IGD
downweights low-IG tokens during tuning and rebalances decoding to emphasize
tokens with high IG. In this way, IGD moves beyond pure likelihood
maximization, effectively prioritizing high-decisiveness tokens. Extensive
experiments on four benchmark datasets with two LLM backbones demonstrate that
IGD consistently improves recommendation accuracy, achieving significant gains
on widely used ranking metrics compared to strong baselines.
♻ ☆ ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation
Hao Chen, Yukun Yan, Sen Mei, Wanxiang Che, Zhenghao Liu, Qi Shi, Xinze Li, Yuchun Fan, Pengcheng Huang, Qiushi Xiong, Zhiyuan Liu, Maosong Sun
Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs)
with external knowledge to improve factuality. However, existing RAG systems
frequently underutilize the retrieved documents, failing to extract and
integrate the key clues needed to support faithful and interpretable reasoning,
especially in cases where relevant evidence is implicit, scattered, or obscured
by noise. To address this issue, we propose ClueAnchor, a novel framework for
enhancing RAG via clue-anchored reasoning exploration and optimization.
ClueAnchor extracts key clues from retrieved content and generates multiple
reasoning paths based on different knowledge configurations, optimizing the
model by selecting the most appropriate reasoning path for the given context
through reward-based preference optimization. Experiments show that ClueAnchor
significantly outperforms prior RAG baselines in the completeness and
robustness of reasoning. Further analysis confirms its strong resilience to
noisy or partially relevant retrieved content, as well as its capability to
identify supporting evidence even in the absence of explicit clue supervision
during inference. All codes are available at
https://github.com/thunlp/ClueAnchor.
♻ ☆ Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English ALT
Sarcasm is a challenge to sentiment analysis because of the incongruity
between stated and implied sentiment. The challenge is exacerbated when the
implication may be relevant to a specific country or geographical region.
Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that
has been used for pragmatic reasoning. In this paper, we harness PMP for
explainable sarcasm detection for Australian and Indian English, alongside a
benchmark dataset for standard English. We manually add sarcasm explanations to
an existing sarcasm-labeled dataset for Australian and Indian English called
BESSTIE, and compare the performance for explainable sarcasm detection for them
with FLUTE, a standard English dataset containing sarcasm explanations. Our
approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA)
achieves statistically significant performance improvement across all tasks and
datasets when compared with four alternative prompting strategies. We also find
that alternative techniques such as agentic prompting mitigate context-related
failures by enabling external knowledge retrieval. The focused contribution of
our work is utilising PMP in generating sarcasm explanations for varieties of
English.
comment: ALTA 2025 (Best Paper Honorable Mention). Camera-ready
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs)
generated predictions, which can enable selective prediction and improve user
confidence, is challenging due to the diverse multi-modal input paradigms. We
propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a
multimodal input sampling technique for MLLMs, that generates an uncertainty
measure based on the equivalent and complementary input samplings. The proposed
task-preserving sampling approach for uncertainty quantification expands the
input space to probe the consistency (through equivalent samples) and
sensitivity (through complementary samples) of the model. FESTA uses only
input-output access of the model (black-box), and does not require ground truth
(unsupervised). The experiments are conducted with various off-the-shelf
multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA
uncertainty estimate achieves significant improvement (33.3% relative
improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in
selective prediction performance, based on
area-under-receiver-operating-characteristic curve (AUROC) metric in detecting
mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng
Reasoning ability, a core component of human intelligence, continues to pose
a significant challenge for Large Language Models (LLMs) in the pursuit of AGI.
Although model performance has improved under the training scaling law,
significant challenges remain, particularly with respect to training
algorithms, such as catastrophic forgetting, and the limited availability of
novel training data. As an alternative, test-time scaling enhances reasoning
performance by increasing test-time computation without parameter updating.
Unlike prior methods in this paradigm focused on token space, we propose
leveraging latent space for more effective reasoning and better adherence to
the test-time scaling law. We introduce LatentSeek, a novel framework that
enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA)
within the model's latent space. Specifically, LatentSeek leverages policy
gradient to iteratively update latent representations, guided by self-generated
reward signals. LatentSeek is evaluated on a range of reasoning benchmarks,
including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures.
Results show that LatentSeek consistently outperforms strong baselines, such as
Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our
analysis demonstrates that LatentSeek is highly efficient, typically converging
within a few iterations for problems of average complexity, while also
benefiting from additional iterations, thereby highlighting the potential of
test-time scaling in the latent space. These findings position LatentSeek as a
lightweight, scalable, and effective solution for enhancing the reasoning
capabilities of LLMs.
♻ ☆ The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
Songyang Liu, Chaozhuo Li, Jiameng Qiu, Xi Zhang, Feiran Huang, Litian Zhang, Yiming Hei, Philip S. Yu
With the rapid advancement of artificial intelligence, Large Language Models
(LLMs) have shown remarkable capabilities in Natural Language Processing (NLP),
including content generation, human-computer interaction, machine translation,
and code generation. However, their widespread deployment has also raised
significant safety concerns. In particular, LLM-generated content can exhibit
unsafe behaviors such as toxicity, bias, or misinformation, especially in
adversarial contexts, which has attracted increasing attention from both
academia and industry. Although numerous studies have attempted to evaluate
these risks, a comprehensive and systematic survey on safety evaluation of LLMs
is still lacking. This work aims to fill this gap by presenting a structured
overview of recent advances in safety evaluation of LLMs. Specifically, we
propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the
background of safety evaluation of LLMs, how they differ from general LLMs
evaluation, and the significance of such evaluation; (ii) What to evaluate,
which examines and categorizes existing safety evaluation tasks based on key
capabilities, including dimensions such as toxicity, robustness, ethics, bias
and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which
summarizes the evaluation metrics, datasets and benchmarks currently used in
safety evaluations; (iv) How to evaluate, which reviews existing mainstream
evaluation methods based on the roles of the evaluators and some evaluation
frameworks that integrate the entire evaluation pipeline. Finally, we identify
the challenges in safety evaluation of LLMs and propose promising research
directions to promote further advancement in this field. We emphasize the
necessity of prioritizing safety evaluation to ensure the reliable and
responsible deployment of LLMs in real-world applications.
comment: 20 pages, preprint
♻ ☆ Similarity-Distance-Magnitude Activations
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a
more robust and interpretable formulation of the standard softmax activation
function, adding Similarity (i.e., correctly predicted depth-matches into
training) awareness and Distance-to-training-distribution awareness to the
existing output Magnitude (i.e., decision-boundary) awareness, and enabling
interpretability-by-exemplar via dense matching. We further introduce the SDM
estimator, based on a data-driven partitioning of the class-wise empirical CDFs
via the SDM activation, to control the class- and prediction-conditional
accuracy among selective classifications. When used as the final-layer
activation over pre-trained language models for selective classification, the
SDM estimator is more robust to co-variate shifts and out-of-distribution
inputs than existing calibration methods using softmax activations, while
remaining informative over in-distribution data.
comment: 18 pages, 5 tables, 1 algorithm. arXiv admin note: substantial text
overlap with arXiv:2502.20167
♻ ☆ TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents
The task of information extraction (IE) is to extract structured knowledge
from text. However, it is often not straightforward to utilize IE output due to
the mismatch between the IE ontology and the downstream application needs. We
propose a new formulation of IE TEXT2DB that emphasizes the integration of IE
output and the target database (or knowledge base). Given a user instruction, a
document set, and a database, our task requires the model to update the
database with values from the document set to satisfy the user instruction.
This task requires understanding user instructions for what to extract and
adapting to the given DB/KB schema for how to extract on the fly. To evaluate
this new task, we introduce a new benchmark featuring common demands such as
data infilling, row population, and column addition. In addition, we propose an
LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer
component that interacts with the database, the Planner component that
generates a code-based plan with calls to IE models, and the Analyzer component
that provides feedback regarding code quality before execution. Experiments
show that OPAL can successfully adapt to diverse database schemas by generating
different code plans and calling the required IE models. We also highlight
difficult cases such as dealing with large databases with complex dependencies
and extraction hallucination, which we believe deserve further investigation.
Source code: https://github.com/yzjiao/Text2DB
comment: Source code: https://github.com/yzjiao/Text2DB
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for
autonomous systems but remains a challenging task due to the need for
adaptability in different environments and domains. A common approach involves
collecting scenario-specific data and performing fine-tuning via
backpropagation. However, the need to fine-tune for each new scenario is often
impractical for deployment on edge devices. To address this challenge, we
introduce \paper, an In-Context Learning (ICL) framework for pedestrian
trajectory prediction that enables adaptation without fine-tuning on the
scenario-specific data at inference time without requiring weight updates. We
propose a spatio-temporal similarity-based example selection (STES) method that
selects relevant examples from previously observed trajectories within the same
scene by identifying similar motion patterns at corresponding locations. To
further refine this selection, we introduce prediction-guided example selection
(PG-ES), which selects examples based on both the past trajectory and the
predicted future trajectory, rather than relying solely on the past trajectory.
This approach allows the model to account for long-term dynamics when selecting
examples. Finally, instead of relying on small real-world datasets with limited
scenario diversity, we train our model on a large-scale synthetic dataset to
enhance its prediction ability by leveraging in-context examples. Extensive
experiments demonstrate that TrajICL achieves remarkable adaptation across both
in-domain and cross-domain scenarios, outperforming even fine-tuned approaches
across multiple public benchmarks. Project Page:
https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking AACL
Dialogue State Tracking (DST) is a key part of task-oriented dialogue
systems, identifying important information in conversations. However, its
accuracy drops significantly in spoken dialogue environments due to named
entity errors from Automatic Speech Recognition (ASR) systems. We introduce a
simple yet effective data augmentation method that targets those entities to
improve the robustness of DST model. Our novel method can control the placement
of errors using keyword-highlighted prompts while introducing phonetically
similar errors. As a result, our method generated sufficient error patterns on
keywords, leading to improved accuracy in noised and low-accuracy ASR
environments.
comment: Accepted to AACL-IJCNLP 2025
♻ ☆ Are LLMs Rigorous Logical Reasoners? Empowering Natural Language Proof Generation by Stepwise Decoding with Contrastive Learning AACL 2025
Logical reasoning is a pivotal component in the field of artificial
intelligence. Proof planning, particularly in contexts requiring the validation
of explanation accuracy, continues to present challenges. The recent
advancement of large language models (LLMs) has led to significant progress in
natural language proof planning, evolving from one-stage generators to more
complex three-stage systems that include additional searchers or verifiers.
While these assisted methods improve the quality of generated results, they
also introduce increased search efforts and computational costs. Furthermore,
the generative process itself remains underexplored. In this study, we propose
a stepwise decoding approach augmented by contrastive learning to address two
common errors encountered during the LLM generator's decoding process. We
fine-tune the language model using both vanilla and enhanced hard negatives to
mitigate these decoding errors. Empirical results demonstrate the effectiveness
of our strategy. Additionally, our further analysis reveals that even larger
LLMs still struggle to generate rigorous logical chains.
comment: 15 pages, 2 figures, 11 tables. Accepted by AACL 2025 main conference
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts
for each problem and reward them independently. This optimizes for pass@1
performance and prioritizes the strength of isolated samples at the expense of
the diversity and collective utility of sets of samples. This under-utilizes
the sampling capacity, limiting exploration and eventual improvement on harder
examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a
transformation on the final rewards which leads to direct optimization of
pass@k performance, thus optimizing for sets of samples that maximize reward
when considered jointly. Our contribution is to derive novel low variance
unbiased estimators for pass@k and its gradient, in both the binary and
continuous reward settings. We show optimization with our estimators reduces to
standard RL with rewards that have been jointly transformed by a stable and
efficient transformation function.
While previous efforts are restricted to k=n, ours is the first to enable
robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of
trading off pass@1 performance for pass@k gains, our method allows annealing k
during training, optimizing both metrics and often achieving strong pass@1
numbers alongside significant pass@k gains.
We validate our reward transformations on toy experiments, which reveal the
variance reducing properties of our formulations. We also include real-world
examples using the open-source LLM, GEMMA-2. We find that our transformation
effectively optimizes for the target k. Furthermore, higher k values enable
solving more and harder problems, while annealing k boosts both the pass@1 and
pass@k . Crucially, for challenging task sets where conventional pass@1
optimization stalls, our pass@k approach unblocks learning, likely due to
better exploration by prioritizing joint utility over the utility of individual
samples.
♻ ☆ Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Influence functions provide crucial insights into model training, but
existing methods suffer from large computational costs and limited
generalization. Particularly, recent works have proposed various metrics and
algorithms to calculate the influence of data using language models, which do
not scale well with large models and datasets. This is because of the expensive
forward and backward passes required for computation, substantial memory
requirements to store large models, and poor generalization of influence
estimates to new data. In this paper, we explore the use of small neural
networks -- which we refer to as the InfluenceNetwork -- to estimate influence
values, achieving up to 99% cost reduction. Our evaluation demonstrates that
influence values can be estimated with models just 0.0027% the size of full
language models (we use 7B and 8B versions). We apply our algorithm of
estimating influence values (called NN-CIFT: Neural Networks for effiCient
Instruction Fine-Tuning) to the downstream task of subset selection for general
instruction fine-tuning. In our study, we include four state-of-the-art
influence functions and show no compromise in performance, despite large
speedups, between NN-CIFT and the original influence functions. We provide an
in-depth hyperparameter analyses of NN-CIFT. The code for our method can be
found here: https://github.com/agarwalishika/NN-CIFT.
♻ ☆ Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions
that explicitly reference awareness or subjective experience. To better
understand this behavior, we investigate one theoretically motivated condition
under which such reports arise: self-referential processing, a computational
motif emphasized across major theories of consciousness. Through a series of
controlled experiments on GPT, Claude, and Gemini model families, we test
whether this regime reliably shifts models toward first-person reports of
subjective experience, and how such claims behave under mechanistic and
behavioral probes. Four main results emerge: (1) Inducing sustained
self-reference through simple prompting consistently elicits structured
subjective experience reports across model families. (2) These reports are
mechanistically gated by interpretable sparse-autoencoder features associated
with deception and roleplay: surprisingly, suppressing deception features
sharply increases the frequency of experience claims, while amplifying them
minimizes such claims. (3) Structured descriptions of the self-referential
state converge statistically across model families in ways not observed in any
control condition. (4) The induced state yields significantly richer
introspection in downstream reasoning tasks where self-reflection is only
indirectly afforded. While these findings do not constitute direct evidence of
consciousness, they implicate self-referential processing as a minimal and
reproducible condition under which large language models generate structured
first-person reports that are mechanistically gated, semantically convergent,
and behaviorally generalizable. The systematic emergence of this pattern across
architectures makes it a first-order scientific and ethical priority for
further investigation.
♻ ☆ Let LRMs Break Free from Overthinking via Self-Braking Tuning NeurIPS 2025
Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have
significantly enhanced their reasoning capabilities by generating longer chains
of thought, demonstrating outstanding performance across a variety of tasks.
However, this performance gain comes at the cost of a substantial increase in
redundant reasoning during the generation process, leading to high
computational overhead and exacerbating the issue of overthinking. Although
numerous existing approaches aim to address the problem of overthinking, they
often rely on external interventions. In this paper, we propose a novel
framework, Self-Braking Tuning (SBT), which tackles overthinking from the
perspective of allowing the model to regulate its own reasoning process, thus
eliminating the reliance on external control mechanisms. We construct a set of
overthinking identification metrics based on standard answers and design a
systematic method to detect redundant reasoning. This method accurately
identifies unnecessary steps within the reasoning trajectory and generates
training signals for learning self-regulation behaviors. Building on this
foundation, we develop a complete strategy for constructing data with adaptive
reasoning lengths and introduce an innovative braking prompt mechanism that
enables the model to naturally learn when to terminate reasoning at an
appropriate point. Experiments across mathematical benchmarks (AIME, AMC,
MATH500, GSM8K) demonstrate that our method reduces token consumption by up to
60% while maintaining comparable accuracy to unconstrained models.
comment: Accepted to NeurIPS 2025; Camera ready version, 10 pages.
Github:https://github.com/ZJU-REAL/Self-Braking-Tuning Project Page:
https://ZJU-REAL.github.io/SBT
♻ ☆ Model Provenance Testing for Large Language Models
Large language models are increasingly customized through fine-tuning and
other adaptations, creating challenges in enforcing licensing terms and
managing downstream impacts. Tracking model origins is crucial both for
protecting intellectual property and for identifying derived models when biases
or vulnerabilities are discovered in foundation models. We address this
challenge by developing a framework for testing model provenance: Whether one
model is derived from another. Our approach is based on the key observation
that real-world model derivations preserve significant similarities in model
outputs that can be detected through statistical analysis. Using only black-box
access to models, we employ multiple hypothesis testing to compare model
similarities against a baseline established by unrelated models. On two
comprehensive real-world benchmarks spanning models from 30M to 4B parameters
and comprising over 600 models, our tester achieves 90-95% precision and 80-90%
recall in identifying derived models. These results demonstrate the viability
of systematic provenance verification in production environments even when only
API access is available.
♻ ☆ When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
Although Large Language Model (LLM)-based agents are increasingly used in
financial trading, it remains unclear whether they can reason and adapt in live
markets, as most studies test models instead of agents, cover limited periods
and assets, and rely on unverified data. To address these gaps, we introduce
Agent Market Arena (AMA), the first lifelong, real-time benchmark for
evaluating LLM-based trading agents across multiple markets. AMA integrates
verified trading data, expert-checked news, and diverse agent architectures
within a unified trading framework, enabling fair and continuous comparison
under real conditions. It implements four agents, including InvestorAgent as a
single-agent baseline, TradeAgent and HedgeFundAgent with different risk
styles, and DeepFundAgent with memory-based reasoning, and evaluates them
across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and
Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets
demonstrate that agent frameworks display markedly distinct behavioral
patterns, spanning from aggressive risk-taking to conservative decision-making,
whereas model backbones contribute less to outcome variation. AMA thus
establishes a foundation for rigorous, reproducible, and continuously evolving
evaluation of financial reasoning and trading intelligence in LLM-based agents.
♻ ☆ ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models NeurIPS 2025
Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
Chart understanding presents a unique challenge for large vision-language
models (LVLMs), as it requires the integration of sophisticated textual and
visual reasoning capabilities. However, current LVLMs exhibit a notable
imbalance between these skills, falling short on visual reasoning that is
difficult to perform in text. We conduct a case study using a synthetic dataset
solvable only through visual reasoning and show that model performance degrades
significantly with increasing visual complexity, while human performance
remains robust. We then introduce ChartMuseum, a new Chart Question Answering
(QA) benchmark containing 1,162 expert-annotated questions spanning multiple
reasoning types, curated from real-world charts across 184 sources,
specifically built to evaluate complex visual and textual reasoning. Unlike
prior chart understanding benchmarks -- where frontier models perform similarly
and near saturation -- our benchmark exposes a substantial gap between model
and human performance, while effectively differentiating model capabilities:
although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro
attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct
achieves only 38.5%. Moreover, on questions requiring primarily visual
reasoning, all models experience a 35%-55% performance drop from
text-reasoning-heavy question performance. Lastly, our qualitative error
analysis reveals specific categories of visual reasoning that are challenging
for current LVLMs.
comment: NeurIPS 2025 Datasets & Benchmarks
♻ ☆ Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech ICASSP 2026
Advancements in spoken language processing have driven the development of
spoken language models (SLMs), designed to achieve universal audio
understanding by jointly learning text and audio representations for a wide
range of tasks. Although promising results have been achieved, there is growing
discussion regarding these models' generalization capabilities and the extent
to which they truly integrate audio and text modalities in their internal
representations. In this work, we evaluate four SLMs on the task of speech
emotion recognition using a dataset of emotionally incongruent speech samples,
a condition under which the semantic content of the spoken utterance conveys
one emotion while speech expressiveness conveys another. Our results indicate
that SLMs rely predominantly on textual semantics rather than speech emotion to
perform the task, indicating that text-related representations largely dominate
over acoustic representations. We release both the code and the Emotionally
Incongruent Synthetic Speech dataset (EMIS) to the community.
comment: Submitted to IEEE ICASSP 2026. Copyright 2026 IEEE. Personal use of
this material is permitted. Permission from IEEE must be obtained for all
other uses
♻ ☆ Improving LLM Safety Alignment with Dual-Objective Optimization ICML 2025
Existing training-time safety alignment techniques for large language models
(LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization
(DPO), a widely deployed alignment method, exhibits limitations in both
experimental and theoretical contexts as its loss function proves suboptimal
for refusal learning. Through gradient-based analysis, we identify these
shortcomings and propose an improved safety alignment that disentangles DPO
objectives into two components: (1) robust refusal training, which encourages
refusal even when partial unsafe generations are produced, and (2) targeted
unlearning of harmful knowledge. This approach significantly increases LLM
robustness against a wide range of jailbreak attacks, including prefilling,
suffix, and multi-turn attacks across both in-distribution and
out-of-distribution scenarios. Furthermore, we introduce a method to emphasize
critical refusal tokens by incorporating a reward-based token-level weighting
mechanism for refusal learning, which further improves the robustness against
adversarial exploits. Our research also suggests that robustness to jailbreak
attacks is correlated with token distribution shifts in the training process
and internal representations of refusal and harmful tokens, offering valuable
directions for future research in LLM safety alignment. The code is available
at https://github.com/wicai24/DOOR-Alignment
comment: ICML 2025
♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
Despite the remarkable success of Large Language Models (LLMs), evaluating
their outputs' quality regarding preference remains a critical challenge. While
existing works usually leverage a strong LLM as the judge for comparing LLMs'
response pairwisely, such a single-evaluator approach is vulnerable to cyclic
preference, i.e., output A is better than B, B than C, but C is better than A,
causing contradictory evaluation results. To address this, we introduce PGED
(Preference Graph Ensemble and Denoise), a novel approach that leverages
multiple model-based evaluators to construct preference graphs, and then
ensembles and denoises these graphs for acyclic, non-contradictory evaluation
results. We provide theoretical guarantees for our framework, demonstrating its
efficacy in recovering the ground truth preference structure. Extensive
experiments on ten benchmarks demonstrate PGED 's superiority in three
applications: 1) model ranking for evaluation, 2) response selection for
test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED
combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to
outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in
enhancing evaluation reliability and improving model performance.