Multi-Objective and Pluralist Human Alignment of AI Models
By Itai Shapira
& Nuoya Xiong
& Aarti Singh
Human feedback is rarely governed by a single coherent notion of “the right answer.” Preferences vary systematically across contexts, users, and communities, and these disagreements are often substantive rather than noise. Standard RLHF typically fits one reward model, which effectively compresses pluralistic judgments across multiple objectives into a single majoritarian compromise. Downstream optimization then concentrates behavior around that compromise, which can reduce representational diversity and marginalize minority perspectives.
At AI-SDM, we are considering several efforts to efficiently align AI models under diverse human feedback due to multiple objectives and/or multiple stakeholder perspectives. We treat alignment from diverse human feedback as a preference aggregation problem, in the spirit of social choice.
Preferences are governed by multiple objectives. Comparing choices along one objective is easier, than wholistic comparisons across all objectives
We develop efficient methods for aligning AI models with diverse objectives and preferences that go beyond simply linear aggregation rules to allow optimizing for worst case objective and other nonlinear weighted p-means combinations arising from natural social choice axioms. Using non-linear aggregation is computationally expensive due to its reward-based nature which requires retraining whenever the aggregation parameters change. Our alignment method known as projection optimization addresses this problem by transforming the non-linear aggregation maximization problem into a series of sub-problems that involve only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm that directly aggregates optimal policies for individual objectives..
We also demonstrate that it is possible to learn the aggregation weights using a logistic model on the choice of objective for which preference is provided. More generally, we quantify the sample complexity of learning aggregation rules using ordinal or cardinal data on past decisions under a nonlinear weighted p-means framework.
In a parallel effort, we addresses the difficulty of choosing specific aggregation parameters by creating a portfolio of policies that remains approximately optimal across a wide spectrum of aggregation rules. We provide an algorithm to construct these representative sets with theoretical guarantees on efficiency and also a heuristic for situations with limited computational budgets. Ultimately, this framework provides a structured summary of the policy space, empowering human decision-makers to navigate complex trade-offs without needing to commit to a single, arbitrary definition of social welfare in advance.
Finally, we also consider an alternative approach to capture diverse perspectives across stakeholders where instead of forcing a single aggregated reward function to represent everyone, we learn a distribution over reward functions directly from aggregate pairwise comparisons, without relying on annotator identifiers or predefined groups. The key requirement is pairwise calibration: for any prompt and any two candidate responses, the fraction of reward functions that prefer one response over the other matches the observed fraction of annotators who prefer it. In this way, disagreement becomes an informative supervision signal that the model is required to preserve.
We explicitly model human preferences as a distribution over reward functions without relying on predefined annotator groups or identities. By enforcing pairwise calibration, this distribution faithfully captures the inherent diversity in aggregate human judgments. Each calibrated reward function then induces a distinct policy
A pairwise-calibrated reward distribution induces a corresponding distribution over aligned policies, which supports pluralism-preserving deployments. For example, a system can sample policies to reflect population-level variation, or present multiple high-quality responses when preferences are legitimately contested. We also provide guarantees showing that approximate calibration can be achieved with small ensembles, that extreme outlier reward functions can be pruned while maintaining calibration, and that learning objectives based on calibration can generalize from finite data to the underlying preference population. Empirically, the learned reward components are meaningfully diverse and better match observed preference frequencies than a single deterministic reward model. This approach allows for a faithful representation of viewpoints without requiring specific labels or predefined groups, ultimately creating AI systems that can adapt to the multifaceted nature of global human values.
