Reinforcement Learning Exercise Chapter 1

This is the exercise from the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto, 2014 2015. I will provide my own answer (as I am new to this very immature), but also a GPT judge and answer. Just for your reference!

Chapter 1

Exercise 1.1: Self-Play

Q: Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself. What do you think would happen in this case? Would it learn a different way of playing?

A: It would learn a different strategy because it will try to beat it self. The strategy might be contineously changing because the “enemy” is growing with the agent. So it can learn a different way of playing. However, I think this might not leading it into a better situation because the opponent is very limited into a simple strategy. If the system is well designed like AlphaGO Zero, it will work.


Your answer reflects the interesting nature of self-play in reinforcement learning, and it does indeed provide some insights. However, there are some additional complexities that may be considered.

When an RL agent is pitted against itself, both versions of the agent are continuously learning from each other. This can create a more competitive and challenging environment that might lead the agent to discover more refined or sophisticated strategies, as each agent continually adapts to the strategies of the other.

However, there are potential pitfalls. Depending on the learning algorithm and exploration strategy used, self-play might lead to convergence on suboptimal strategies or cycling between different strategies without clear improvement. These risks are particularly evident if the learning algorithm is overly aggressive in adapting to recent experiences, which might lead to overfitting to the particular opponent’s behavior (in this case, itself) without finding a generally strong strategy.

In some cases, self-play has been highly successful, as demonstrated by programs like AlphaGo Zero, where the agent learned to play Go at a superhuman level entirely through self-play. However, these successes often involve careful algorithmic design, tuning, and regularization to avoid the potential pitfalls mentioned above.

In conclusion, self-play can be a powerful tool for developing strong strategies, but it is not without risks and complexities. A successful application of self-play usually requires thoughtful consideration of the learning algorithm, exploration strategy, and other factors that might influence the learning process.

Exercise 1.2: Symmetries

Q: Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the reinforce-ment learning algorithm described above to take advantage of this? In what ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?


  1. Amendment: Adding this prior into the agent. Design a set of trasmformation like mirror and rotation. After every move, try if we can transform the existing move into a prior moved place.
  2. Improvement?: Yes, through this method it will largely shrink the space the strategy. So it will make the optimization easier.
  3. We will still use this.
  4. Yes, they can have the same value. Bacause the are exactly the same case.


You’ve identified the key concept of > using symmetries in tic-tac-toe to reduce the state space and potentially speed up learning. Let’s delve into each part of your answer.

  1. Amendment: Utilizing transformations like mirror and rotation to recognize symmetrically equivalent states is a good approach. By mapping these states to a canonical representation, the learning algorithm can treat them as the same state. This can help the agent to generalize more quickly from its experiences.

  2. Improvement: By exploiting symmetries, the agent can indeed reduce the state space, making learning more efficient. It allows the agent to recognize that experiences in one state can apply to other symmetrically equivalent states, thus needing fewer samples to learn a good policy.

  3. Should We Still Use This if the Opponent Doesn’t?: Even if the opponent does not take advantage of symmetries, using them can still be beneficial for the reasons mentioned above. It simplifies the learning task without losing any relevant information.

  4. Should Symmetrically Equivalent Positions Have the Same Value?: While it is tempting to say that symmetrically equivalent positions should have the same value, this might not always be the case if the opponent behaves differently in these positions. If the opponent does not recognize and act identically in symmetric situations, the value of those positions could indeed differ. In playing against an irrational or inconsistent opponent, recognizing symmetries might not always lead to identical value assessments.

Your answer captures the essential idea of leveraging symmetries in tic-tac-toe to enhance the learning process, but considering the opponent’s behavior adds an extra layer of complexity that could affect the equivalence of symmetric states.

Exercise 1.3: Greedy Play

Q: Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Would it learn to play better, or worse, than a nongreedy player? What problems might occur?

A: If this is a well-trained player, as the game tic-tac-toc is relatively simple. It can optimize all the situation during the training. So a greedy strategy can finally be useful. If it is a not well-trained player, non-greedy solution is kind of depend on luck. If is during training, there is no point for exploration so the player can stuck to a local optimum.


If the agent is already well-trained and has an accurate value function, then greedy play can be optimal for playing tic-tac-toe. However, the challenge arises when greedy play is used during the training phase. A strictly greedy approach can lead to a lack of exploration of other actions, potentially causing the agent to become stuck in a local optimum and never discovering better strategies. A non-greedy strategy, introducing some level of exploration, can prevent this and help the agent discover a globally optimal policy. Your response correctly highlights the nuances of greedy play, recognizing its suitability for a well-trained agent and potential problems during the learning phase.

Exercise 1.4: Learning from Exploration

Q: Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time, then the state values would converge to a set of probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

A: If we do not learn from exploratory moves, the result might be determined. I think learning from exploratory moves can be benificial anyway. (But I really do not understand the difference)


In reinforcement learning, exploratory moves are actions that are taken to explore different parts of the state space rather than exploiting current knowledge to select the best action. The difference between learning from exploratory moves or not affects how the agent updates its value estimates.

  1. When Learning from Exploratory Moves: The value estimates will reflect both the optimal policy (what the agent thinks is best) and the exploration policy (trying new things). This could lead to more robust learning as the agent is factoring in all experiences, but it may also be noisier since the exploratory moves might not align with the optimal strategy.

  2. When Not Learning from Exploratory Moves: The value estimates will be more focused on the optimal policy since only the greedy moves (those considered best) are used for learning. This could make the value estimates more precise for playing optimally but might miss out on valuable learning experiences from exploration.

If the agent continues to make exploratory moves, learning from them might be more beneficial. It can help the agent discover unexpected strategies or nuances of the game that lead to better overall play. Learning from all experiences, including exploration, can provide a more comprehensive view of the environment, potentially resulting in more wins.

However, the ideal approach depends on the specific task, the nature of the exploration, and how well the agent can generalize from its exploratory experiences. The choice between learning from exploratory moves or not can be a nuanced decision, with potential trade-offs between robustness and precision.

Exercise 1.5: Other Improvements

Q: Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

A: Possible improvement: 1. Add some prior knowledge to the player, like a better start strategy. 2. Use a better designed opponent. Other solution: using genetic algorithm can be good.


The field of reinforcement learning provides a plethora of techniques and strategies that could be applied to enhance the tic-tac-toe player. Some creative ways might include:

  1. Hierarchical Learning: Breaking down the game into different phases (e.g., opening, mid-game, end-game) and training specialized sub-agents for each phase. This could enable more tailored strategies for different parts of the game.

  2. Meta-Learning: Developing an agent that can learn how to learn more effectively by experimenting with different learning rates, exploration strategies, or other hyperparameters during training. This can lead to more adaptive and robust learning.

  3. Multi-Agent Training: Training multiple agents with different strategies against each other in a round-robin fashion. By doing this, the agent can gain a broader understanding of various strategies and learn to counter a wide array of opponent behaviors.

  4. Model-Based Planning: Creating a model of the opponent’s behavior and using it for look-ahead planning, like Monte Carlo Tree Search. This would enable the agent to anticipate the opponent’s moves and respond more strategically.

  5. Transfer Learning: If the agent has been trained on similar games, leveraging this knowledge to jump-start the learning in tic-tac-toe could save time and lead to more effective strategies.

  6. Interactive Learning: Allowing human players to interact with the agent, providing feedback or challenging it with unique strategies, can also be a way to fine-tune the agent’s performance.

Combining some or all of these strategies, along with continuous monitoring and iterative refinement, could lead to a highly sophisticated tic-tac-toe player, able to adapt to different opponents and situations.