AI-Powered Pac-Man Agent: Reinforcement Learning in Action

Jesse Gonzalez, Arhma Baig, Megha Martin, Sophia Kuhnert
1University Technology Sydney

Team Members

Abstract

This project showcases the development of an AI agent capable of playing Pac-Man using Deep Q-Networks (DQN). We built an end-to-end DQN agent for Ms. Pac-Man with a custom Gymnasium reward wrapper for pellet bonuses, survival incentives, and motion penalties.

Training runs on eight parallel environments (AsyncVectorEnv) with high-throughput experience replay. Our CNN ingests 84×84 grayscale frames, stacking frames to predict Q-values. Double DQN, epsilon decay, and periodic target updates ensure stable convergence.

Extensive logging captures per-episode reward, moving averages, loss, and mean Q all visualised in training charts. Checkpoints are saved every 50 episodes, enabling resumable training and hyperparameter tuning.

Flow-chart of our training loop

The diagram captures our full training loop of eight parallel Ms Pac-Man games, sending compressed 84 × 84 frames into a CNN encoder. Those features feed a Double DQN, which chooses one of four joystick moves and sends the action back to every environment. While the agent plays, rewards and losses stream and log the metricscompleting the learn and act cycle shown by the neon arrows.

Video

The following video demonstrates the AI agent playing Pac-Man using reinforcement learning. It showcases the agent's learning progress, strategy, and gameplay behaviour.

Training Charts

Reward Convergence Plot
Loss vs Mean Q-value
Exploration Rate vs Total Reward
Episode Length vs Total Reward

Results Discussion and Reflection

The Exploration Rate vs Total Reward scatter plot shows that as ε decays from 1.0 to 0.1, the agent transitions from random play to consistent high scores (600+ points once ε < 0.2). The learning curve (raw and moving-average) over 20,000 episodes shows rapid gains in early episodes and plateaus around ~550 average reward, indicating convergence.

The Episode Length vs Total Reward plot reveals a strong positive correlation: longer survival yields more pellets and higher cumulative rewards. The MeanQ vs Loss chart illustrates initial loss spikes that taper as mean Q stabilises between 25–30, reflecting effective value learning.

Compared to a single DQN baseline, Double DQN achieved smoother initial performance, reduced reward oscillations, more efficient pellet collection, and higher scores in fewer episodes. Future work with more resources could explore PPO, reduced input resolutions (e.g., 64×64), or extended training to 50,000 episodes for potentially further gains.