Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models

Abstract

With the rapid development of Large Language Models (LLMs), aligning thesemodels with human preferences and values is critical to ensuring ethical andsafe applications. However, existing alignment techniques such as RLHF or DPOoften require direct fine-tuning on LLMs with billions of parameters, resultingin substantial computational costs and inefficiencies. To address this, wepropose Micro token-level Accept-Reject Aligning (MARA) approach designed tooperate independently of the language models. MARA simplifies the alignmentprocess by decomposing sentence-level preference learning into token-levelbinary classification, where a compact three-layer fully-connected networkdetermines whether candidate tokens are "Accepted" or "Rejected" as part of theresponse. Extensive experiments across seven different LLMs and threeopen-source datasets show that MARA achieves significant improvements inalignment performance while reducing computational costs. The source code andimplementation details are publicly available athttps://github.com/IAAR-Shanghai/MARA, and the trained models are released athttps://huggingface.co/IAAR-Shanghai/MARA_AGENTS.

Quick Read (beta)

loading the full paper ...