Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Abstract

Large Language Models (LLMs) have shown impressive moral reasoning abilities.Yet they often diverge when confronted with complex, multi-factor moraldilemmas. To address these discrepancies, we propose a framework thatsynthesizes multiple LLMs' moral judgments into a collectively formulated moraljudgment, realigning models that deviate significantly from this consensus. Ouraggregation mechanism fuses continuous moral acceptability scores (beyondbinary labels) into a collective probability, weighting contributions by modelreliability. For misaligned models, a targeted embedding-optimization procedurefine-tunes token embeddings for moral philosophical theories, minimizing JSdivergence to the consensus while preserving semantic integrity. Experiments ona large-scale social moral dilemma dataset show our approach builds robustconsensus and improves individual model fidelity. These findings highlight thevalue of data-driven moral alignment across multiple models and its potentialfor safer, more consistent AI systems.

Quick Read (beta)

loading the full paper ...