GRPO on the Countdown Reasoning Task

GitHub: Repo

Context

This project was completed for COMP 579 (Reinforcement Learning) with Abhijeet Praveen and Aman Sidhu. Initially we were inspired by TinyZero then during our project work, McGill’s NLP group released nanoAhaMoment which is great reference. Our project never achieved the results that we would’ve expected which leads me to believe that there’s a bug hidden somewhere in our implementation.

Abstract

We explore Group Relative Policy Optimization (GRPO) and its application to enhancing the reasoning capabilities of large language models (LLMs). This project implements GRPO in PyTorch and evaluates its effectiveness on the Countdown arithmetic reasoning benchmark using the Qwen2.5-1.5B-Instruct model. GRPO modifies Proximal Policy Optimization (PPO) by estimating advantages from group-level rewards, reducing dependency on value functions. Our experiments measure the performance of this GRPO algorithm, providing insights into its design choices. Our experiments show that the best GRPO fine-tuned model achieved a test mean reward of 0.18 and test mean accuracy of 10.9% with a group size of G=3, an improvement over the untrained baseline model, which achieved a mean reward of 0.122 and accuracy of 5%. These results highlight the potential of GRPO for improving LLM reasoning through more structured reinforcement learning. We release all our code to facilitate future research into scalable, efficient methods for enhancing LLM reasoning capabilities.