In this project, you will implement Q-learning. You will test your agents first on Gridworld (from class), then apply them to a simulated robot controller (Crawler) and Pacman.
The code for this project contains the following files:
qlearningAgents.py- Q-learning agents for Gridworld, Crawler and Pacman.analysis.py- A file to put your answers to questions given in the project.
valueIterationAgents.py- A value iteration agent for solving known MDPs.mdp.py- Defines methods on general MDPs.learningAgents.py- Defines the base classes ValueEstimationAgent and QLearningAgent, which your agents will extend.util.py- Utilities, including util.Counter, which is particularly useful for Q-learners.gridworld.py- The Gridworld implementation.featureExtractors.py- Classes for extracting features on (state,action) pairs. Used for the approximate Q-learning agent (in qlearningAgents.py).
environment.py- Abstract class for general reinforcement learning environments. Used by gridworld.py.graphicsGridworldDisplay.py- Gridworld graphical display.graphicsUtils.py- Graphics utilities.textGridworldDisplay.py- Plug-in for the Gridworld text interface.crawler.py- The crawler code and test harness. You will run this but not edit it.graphicsCrawlerDisplay.py- GUI for the crawler robot
Carefully read the instructions for each phase before you start. Make sure you are only implementing the procedures specifically asked for in the instructions. Do not jump ahead and implement phases that are not yet required.
You will now write a Q-learning agent, which does very little on construction, but instead learns by trial and error from interactions with the environment through its update (state, action, nextState, reward) method. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py, and you can select it with the option '-a q'. For this question, you must implement the update, getValue, getQValue, and getPolicy methods.
Note: For getPolicy, you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal.
Important: Make sure that in your getValue and getPolicy functions, you only access Q values by calling getQValue. This abstraction will be helpful for questions 4 or 5, when you override getQValue to use features of state-action pairs rather than state-action pairs directly.
With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard.
# Watch Q-learner Learn under manual control
python3 gridworld.py -a q -k 5 -mThe -k parameter controls the number of episodes your agent learns from. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake."
Hint: to help with debugging, you can turn off noise by using the --noise 0.0 parameter (though this obviously makes Q-learning less interesting).
If you manually steer Pacman north and then east along the optimal path for four episodes, you should see the following Q-values:
Implementation Steps:
- Read and understand the classes and methods available in these files:
valueIterationAgents.py- A value iteration agent for solving known MDPsmdp.py- Defines methods on general MDPslearningAgents.py- Defines the base classes ValueEstimationAgent and QLearningAgent, which your agents will extendutil.py- Utilities, including util.Counter, which is particularly useful for Q-learnersgridworld.py- The Gridworld implementationfeatureExtractors.py- Classes for extracting features on (state,action) pairs
- Examine
QLearningAgentclass inqlearningAgents.py
- Document how the
getValueandgetPolicyprocedures are initially implemented. Explain what the default implementation does in `analysis.py'. - Read the hints and implement
updatemethod, deleting onlyutil.raiseNotDefined(). - Read the hints and implement
getQValuemethod, deleting onlyutil.raiseNotDefined() - Will you change
getValueandgetPolicyprocedures? Why? - Add a question analysis.py:
What was your implentation strategy for Q-Learning? Explain. - Run the test cases underneath
test_cases/q1 - Explain the implementation strategy in analysis.py
Complete your Q-learning agent by implementing epsilon-greedy action selection in the getAction method. Epsilon-greedy is an exploration strategy that balances exploitation (choosing the best known action) with exploration (trying random actions to discover potentially better options).
How Epsilon-Greedy Works:
With probability epsilon (ε), the agent chooses a random action uniformly from all legal actions. With probability (1 - ε), the agent chooses the action that maximizes the Q-value (i.e., follows its current best policy). This mechanism ensures the agent doesn't get stuck exploiting a suboptimal policy and continues to explore the state-action space.
Testing Your Implementation:
python3 gridworld.py -a q -k 100Your final Q-values should resemble those of your value iteration agent, especially along well-traveled paths. However, your average returns will be lower than the Q-values predict because of the random actions and the initial learning phase.
Useful Functions:
random.choice(list)- Choose an element from a list uniformly at randomutil.flipCoin(p)- ReturnsTruewith probabilitypandFalsewith probability1-p
Implementation Steps:
- Examine the
getActionmethod stub inQLearningAgent - Implement the epsilon-greedy logic:
- With probability
self.epsilon, return a random action fromself.getLegalActions(state) - Otherwise, return the action from
self.getPolicy(state) - Handle the case where there are no legal actions (return
None)
- With probability
- Test your implementation:
- Run
python3 gridworld.py -a q -k 100and observe Q-value convergence - Run the test cases underneath
test_cases/q2
- Run
- Answer in
analysis.py:-
What is your implementation strategy for Question 2 (Epsilon Greedy)? Explain.
-
With no additional code changes, you should now be able to run a Q-learning crawler robot. This phase tests whether your Q-learning implementation is general enough to work beyond the GridWorld environment.
Running the Crawler:
python3 crawler.pyThe crawler is a simulated robot that learns to move forward by adjusting its arm and hand angles. Your Q-learning agent will learn a policy that makes the robot "crawl" forward efficiently.
If this doesn't work, you've probably written some code too specific to the GridWorld problem and you should make it more general to all MDPs.
Interactive Parameters:
Play around with the various learning parameters to see how they affect the agent's policies and actions:
- Step Delay - A parameter of the simulation (visualization speed)
- Learning Rate (α) - How quickly new information overrides old Q-values
- Epsilon (ε) - Exploration rate for epsilon-greedy action selection
- Discount Factor (γ) - How much future rewards are valued relative to immediate rewards
Implementation Steps:
- Run the crawler and observe the learning behavior:
- Execute
python3 crawler.py - Watch the robot learn to crawl forward
- Execute
- Experiment with parameters:
- Adjust learning rate and observe the effect on convergence speed
- Adjust epsilon and observe the effect on exploration vs exploitation
- Adjust discount factor and observe how it affects long-term planning
- Answer in
analysis.py:-
What happens when you run python crawler.py? Describe the robot's behavior and learning process.
-
Implement an approximate Q-learning agent that learns weights for features of states, where many states might share the same features. Write your implementation in the ApproximateQAgent class in qlearningAgents.py, which is a subclass of PacmanQAgent.
Why Approximate Q-Learning?
Standard Q-learning maintains a separate Q-value for every state-action pair. This becomes infeasible when the state space is large (e.g., Pacman with many ghosts, food pellets, etc.). Approximate Q-learning instead represents the Q-function as a linear combination of features:
where:
-
$f_i(s, a)$ are feature functions that extract relevant information from state-action pairs -
$w_i$ are learned weights associated with each feature
Weight Update Rule:
The weights are updated using gradient descent on the TD error:
where the correction (TD error) is:
This is the same temporal difference error as in standard Q-learning.
Feature Extractors:
Feature functions are provided in featureExtractors.py. Feature vectors are util.Counter objects (dictionary-like) containing non-zero feature-value pairs; all omitted features have value zero.
Testing Your Implementation:
First, test with the IdentityExtractor (should behave identically to PacmanQAgent):
python3 pacman.py -p ApproximateQAgent -x 2000 -n 2010 -l smallGridThen test with the SimpleExtractor for better generalization:
python3 pacman.py -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumGridFor larger layouts:
python3 pacman.py -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumClassicImplementation Steps:
-
Understand the class hierarchy:
- Read
PacmanQAgentto understand how it extendsQLearningAgent - Examine
ApproximateQAgentstub and its relationship toPacmanQAgent - Study the feature extractors in
featureExtractors.py
- Read
-
Implement
ApproximateQAgent:- Initialize the weight vector (
self.weights) as autil.Counter() - Override
getQValue(state, action)to compute the dot product of features and weights - Override
update(state, action, nextState, reward)to update weights using the gradient descent rule - Implement
final(state)to handle end-of-episode bookkeeping (call parent's final method)
- Initialize the weight vector (
-
Test and validate:
- First run with
IdentityExtractorto verify equivalence with basic Q-learning - Run
python3 pacman.py -p PacmanQAgent -n 10 -l smallGrid -a numTraining=10 - Then test with
SimpleExtractoron larger grids
- First run with
-
Answer in
analysis.py:-
Run python pacman.py -p PacmanQAgent -n 10 -l smallGrid -a numTraining=10 using your code and report on what is happening. Is Pacman failing or winning? What is your "Average Score" and your "Win Rate"? Justify your observations. -
What is your implementation strategy for Question 4 (Approximate Q-Learning)? Explain.
-
Important Reminders:
ApproximateQAgentis a subclass ofQLearningAgent, so it shares methods likegetAction- Ensure your
QLearningAgentmethods callgetQValueinstead of accessing Q-values directly - This abstraction allows the overridden
getQValueinApproximateQAgentto provide feature-based Q-values
