Oct 6, 2018·12 min read
For students of the social sciences the Tragedy of the Commons will be a familiar example about how rational individual actions can have negative effects for society as a whole. The original example of the tragedy of the commons is from 1833 where British economist William Forester Lloyd discussed the issues of grazing cattle of public lands. Since then this idea of individuals over utilizing public lands/goods has been applied to different subjects another classic one that I based the game environment I built off of was the over fishing of New England cod fisheries.
1833 Millerite, Showers of Stars 🌟🌟,
New England Cod fishing
The basic premise of the tragedy of the commons is that individuals acting in a rational way will deplete common resources as they maximize their own utility even though it could be argued that it is in their best interest to preserve these common resources for future use. This is because the future is uncertain and individuals are incentivized to maximize their own personal utility. Without any external regulation/action this means they should utilize resources that are available rather than holding back their consumption for a future that is uncertain.
The solution posed in social sciences to avoid this tragedy of the commons is to create regulations or measures to help curb immediate consumption which will help the resource in question survive into the future. Some examples of this are placing the resource in private hands, have fees/permits attached to their use, or have laws/regulations placed on them for government/society to enforce.
This is all well and good, but I am a data scientist and enjoy building AI.
So how does that fit in here? My thought is that data science and deep learning in general is the process of optimization towards some goal. That goal could be maximizing a win percentage or accuracy, which is fine and very satisfying to achieve. However as AI becomes more integrated into our society there will be increasing social consequences to the objectives we choose to optimize towards. To illustrate how choosing what to optimize for is important this post is going to go through how I built an environment to allow the training of reinforcement learning bots to play out the tragedy of the commons and how adjusting their rewards changes the social implications of the games.
Teaching a Bot to Fish
The first step was to build a game environment to act as a simulation for the tragedy of the commons. I based my example off of a version I found online for economics classes.
NO NEUTRAL BUT WINNING!
The game is initialized with two players and some starting number of fish in the fishery, the game ends when the fishery count falls below a certain threshold representing the crash of the population due to overfishing. The winner is the one who ends with the most fish. Each turn players simultaneously declare how many fish they will take, if the total take does not force the population below the threshold each player gets their take and the remaining fish all have one offspring with the maximum never going above the initial number of fish. On the final turn the remaining fish are divided between players based on how many fish they declared to take for that turn. So if I declared 10 fish and you declared 8 fish with 9 fish remaining I would get 5 and you would get 4.
The game itself is a loop that only breaks when the fishery population crashes (drops below the threshold of 8) or when the game hits 100 turns. I added the 100 turn limit to avoid the loop running infinitely.
In the game itself the starting number of fish was 20 and I had two players one “computer player” which would randomly pick fish values depending on the current level of the fishery and a Keras reinforcement learning bot. This Keras bot is the one that will be the focus of the post, and by testing how tweaking its reward structure affects the game’s outcomes.
The game status that is fed into the neural network as context is the current computer score, bots score, total fishery count, past computer take, past bot take for a 5 node input. The output layer of the network is 20 nodes representing different amount of fish to declare in a given turn, 1–20 fish. In order to get a better idea of the training cycle for the bots I think it is important to throw in a bit of a refresher on reinforcement learning in particular deep q-learning.
See tragedy.py for details of the game. The TOC class contains the game function and some other utility functions for running the game and reporting results.
The network is housed in the netlearner.py file
Quick Deep Q-Learning Review
This section is here to layout how the bots are trained and how exactly it is they learn the specific game behavior that they exhibit.
In high level terms the bots are conditioned based on rewards and punishments they receive for outcomes.
Mechanically doing this for a neural network is obviously not quite the same as doing it for a living thing so it is good to lay it out. The basic process is as follows:
the Agent (the network in this case) is given the current state of the game. This could be the pixels of an atari pong game or whatever representation you choose. Here is is the current number of fish, player and computer scores, player and computer most recent takes. My thoughts are that this is the type of context a human player would have while playing.
Agent chooses an action out if its action space. In the Pong example Andrej Karpathy has it as the probability of going up. In this example it means
choosing one of the 20 output nodes corresponding to declaring to take 1–20 fish. An important note here is that there is the idea of exploration vs exploitation. Essentially saying that sometimes an action should be chosen at random rather than simply doing what the Agent thinks is best. This helps the Agent to explore and find additional rewards that it would not have found otherwise if it has simply exploited the rewards it knows.
Action is imputed into the environment, any rewards are collected, and the environment moves onto its next state, frame, or turn. Mechanically I do this by adding the rewards to the card slot combination that the network outputted. For a positive reward that category’s value in the output array increases and the network will see that given that input again that particular category is a beneficial one.
The Agent is updated based on the rewards it received. After the rewards are used to modify the output array, the network is trained on the initial input state with the modified output array as the target. This helps to reinforce good choices, while taking into account bad choices as well.
Rinse and repeat.
Photo from Tsukiji bluefin tuna auction in Tokyo from a 2014 trip. Seems appropriate for a tragedy of the commons writeup
Victory at the Cost of the Common Good
What I found is that standard configuration of rewards create very “greedy” bots.
By standard configuration I mean that I placed rewards on winning, negative reward on losing, and no reward for anything in between. Usually this is done so that bots can learn what behaviors lead to their wins and allow them to maximize their win percentage. This behavior is useful things in like Atari Pong, Cartpole, or other games where the goal is simply victory. But this maximization of winning does not necessarily mean it has good social consequences.
I even experimented with scaling the win and loss rewards based on the number of turns (I used sqrt of the turn count for some tests) to give incentive for winning in longer games but the results remained the same and the bots played a very greedy play style.
So what do I mean by “greedy play style”?… Basically when I train bots with these rewards in place the bots win something like 99% of the time which is cool! Basically the bots have figured out how to maximize their win percentage against the little computer player that I coded up. However they reach this 99% win rate in a single turn on average.
The strategy that the bots fell into is to declare to take larger numbers of fish on the first turn which causes the population of the fishery to crash immediately.
Then because of the final turn conditions where population is divided up based on the number of fish each player declared if the bot declared more it automatically beats the opponent.
Even though the bot could gain additional reward by playing longer and even with with it set to explore on 10% or 20% of its turns it still falls into this same strategy and does not learn to play for longer games.
This is interesting because this style of bot with a common reward structure and maximization towards winning behaves almost exactly the same as agents in the classic tragedy of the commons examples. The bot can earn greater rewards by keeping resource around for a longer period of time, but instead it decides to consume it now to maximize its short term gains since it may or may not be able to win in the future. So while this bot has learned how to win and win well, it also overconsumes and destroys the resource in question.
Now lets mess around with 1 line of code…
2017 photo from Valencia, Spain. I liked the futuristic look of this part of the city.
A Bot Improving the Common Utility
So I laid out the reward structure for the bot in the previous section, now what if I adjust one part of that reward structure. Leaving the rewards for victory and defeat in place I ran multiple tests where the reward for a continued turn and varied across positive values between 1.5, 5, and 10. What I found that it created “altruistic” bots which would play for longer games, but the bot’s winrate suffered as a result.
From a technical point of view I was testing the idea that placing a reward of 1.5 would mean the bot would find it beneficial to itself to learn to play games of a length greater than 2 (since the reward is sqrt of the turn count for winning). I benchmarked how the bots played over 1000 game windows and found that depending on the iteration the bots have winrates between 85–91% and average games lasting between 2–4 turns.
With a reward of 5 the bot starts out with a fairly low winrate of 36% and a turn count of 7 but ended its training with a winrate of 59% and a game length of 11 and some of the test games from that final section of training are games with length 100 with the bot winning with a score of 406 to 401 fish.
At a reward of 10 the bots winrates do not get above 30–40% but their games go to 100 rounds. So they learn behavior that is useful for having the game continue. but not necessarily win. Due to the extreme focus that the bots will have on simply letting the game continue, a lot of the time the strategy that they follow is to simply take 4–5 fish per turn. So if I was trying to optimize these bots to win, this is a hilariously bad way to train them.
However we are not just trying to have the bots win, so lets step back and think about the behavior of the bots!
Shibuya X photo from a trip to Tokyo, I had my family sit in a second story of a Starbucks for an hour so I could take short timelapse photos of the crossing… Things like this are why it is hard to travel with photographers I think.
Certain Victory or Optimizing for the Common Good
In an unregulated version of the tragedy of the commons, players behaving as rational agents should try to maximize their short term gains rather than hoping for an uncertain future. The “greedy” bots mirror this behavior, they go for a near 100% win rate strategy where they win on the first turn.
However if we think about the consumption of resources. Burning all of them immediately is rarely the goal, rather the continued existence of the resource is beneficial to everyone. So if we could structure incentives to allow for sustainable harvesting it would lead to a vastly larger number of fish being collected than the 1 turn example.
From a mechanical/game theory point of view, the sustainable way to play this implementation of the game is to have the two players each take 5 fish per turn for a total take of 10. This drops the fishery count from 20 to 10. Since the fishery count is not below 8 the population doubles and goes from 10 to 20. This means the fishery can be maintained indefinitely.
This balance is hard to maintain in actual practice because there is no communication between players and requires the players involved to forego short term gains (certain victory) over letting all players gather more fish (increasing collective utility). In the real world this may come about with things like regulations/monitoring/cultural norms/privatization.
So while the “greedy” bots optimize for the fast win, the “altruistic” bots with increased rewards for continued play begin to optimize towards behavior other than simply winning. In fact the bots start to behave very similarly to this more sustainable best case gameplay I laid out above.
With the reward at 10 the bots play a safer version of this conservative gameplay and only play take 4–5 fish per turn and basically throw any chance they have of winning. As far as balancing winrate and average number of turns the reward 5 bots maintain a more competitive stance taking between 4–6 fish per turn. This additional competitiveness is reflected in their higher winrate while still trying to not deplete the fishery.
All of the more altruistic bots were optimized to try and win but to do so in longer games. This multi part optimization means that there is some trade off and in this case it means that they win a lower percentage of the time, but if taken in greater context you may be able to argue that that is okay.
In playing longer games the bots sometimes win, but also sometimes fall hilariously far behind for example some of the 100 turn games end with scores of 350–400 to 600+. While this is a fairly staggering loss in terms of victory or defeat, if you consider that the continued existence of fish generates additional utility for the whole community then this is still a very favorable outcome even if the bot does not win as much for itself.
This staggering loss can be a good outcome by comparing how many total fish a simulation generates. A victory by the “greedy” bot has a total of 20 fish get harvested by both players. The bot wins and the resource is gone. The “altruistic” bot that plays to 100 games and may lose, but somewhere between 800–1000 fish are gathered by both players.
As an economist/social scientist this post is a straightforward showing of how the modification of preferences can be modeled using deep learning/reinforcement learning to illustrate ways to get to higher levels of overall social utility.
As a data scientist this post is my roundabout way of commenting that as AI gets more integrated with society there will be more consequences to how we choose to optimize that AI and the tradeoffs we make in building them.
I could likely improve the RL methods that I applied in this post and try to have the bots learn stronger strategies to increase their winrates no matter the number of rounds. As of now the bots with higher rewards for continued turns sacrifice winning and settle into very weak overall strategies. With more experimentation the bots could likely learn how to sustain the numbers of fish, but also take action to close gaps between them and their opponent or increase a lead over their opponent. That can be a project for another time.
Thanks for reading!
Github repo here