Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Abstract

Instruction-fine-tuned large language models (LLMs) under 14B parameterscontinue to underperform on natural language understanding (NLU) tasks, oftentrailing smaller models like BERT-base on benchmarks such as GLUE andSuperGLUE. Motivated by the success of reinforcement learning in reasoningtasks (e.g., DeepSeek), we explore Proximal Policy Optimization (PPO) as aframework to improve the NLU capabilities of LLMs. We frame NLU as areinforcement learning environment, treating token generation as a sequence ofactions and optimizing for reward signals based on alignment with ground-truthlabels. PPO consistently outperforms supervised fine-tuning, yielding anaverage improvement of 6.3 points on GLUE, and surpasses zero-shot and few-shotprompting by 38.7 and 26.1 points, respectively. Notably, PPO-tuned modelsoutperform GPT-4o by over 4\% on average across sentiment and natural languageinference tasks, including gains of 7.3\% on the Mental Health dataset and10.9\% on SIGA-nli. This work highlights a promising direction for adaptingLLMs to new tasks by reframing them as reinforcement learning problems,enabling learning through simple end-task rewards rather than extensive datacuration.

Quick Read (beta)

loading the full paper ...