Explaining Monte Carlo Tree Search without trees

Rollout algorithms

Suppose you have some multiplayer game, and you want to build an algorithmic way of finding the best move without needing to specify any sort of heuristics. Here we’ll assume the game is ‘‘fully observable,’’ e.g., a game where all you need to know to win is what’s visible on the board (as in chess, checkers, mancala, etc.).

A rollout algorithm (RA) is an easy way of achieving this. I’ll explain a super simple version first. Let’s say you’re at some state (\(s_t\)) in the game. (In a game like checkers or chess, for example, the state specifies every piece’s location on the board.) What you do is, simulate a game where you and your opponent take random moves every turn, starting from state \(s_t\), and continuing until the game is over. For each simulation \(i\), keep track of the action you took first, \(a(i)\), and whether the game ended in a loss, draw, or win: \(r(i) \in \{-1, 0, 1\}\). Now do this N times, where N is something large. Then we can tally up the average game outcome among the episodes that began with action \(a\):

\[Q(s_t, a) = \frac{1}{N_a} \sum_{i=1}^N r(i) I(a(i) == a)\]

where \(I(a(i) == a)\) returns 1 if we began simulation i with action \(a\), and zero otherwise, and \(N_a\) is the number of simulations that began with action \(a\). Once we’ve calculated \(Q(s_t, a)\) for each possible action \(a\), our best guess about which action to take is the \(a\) where \(Q(s_t, a)\) is largest.

So basically, what a rollout algorithm does is tell you the best move to make at time \(t\), in terms of the average game outcome after taking that action and then acting randomly. Even though this sounds like a bad way to estimate value, it can actually be quite effective in practice.

Improving on the rollout algorithm’s weaknesses

There are three weaknesses of the rollout algorithm that I want to point out. First, as we discussed, the rollout algorithm simulates games by assuming that we and our opponents act randomly in the future. Second, the rollout algorithm throws out its simulations every turn, meaning it’s essentially starting from scratch every turn. And finally, because the rollout algorithm simulates episodes by acting randomly instead of methodically, it’s possible we might not discover really good moves just by chance. Below I’ll go through how to improve on these weaknesses, in reverse order.

3. Simulating episodes more methodically

Okay, so I mentioned the rollout algorithm considers actions randomly rather than methodically. To understand what I mean, let’s say we have a game like Chinese checkers where we have many different moves we could take on each turn. Then it’s possible that, just by chance, our N simulations won’t happen to take one of those moves. Because we’re acting randomly (i.e., each simulation chooses its first and later moves at random) and not methodically (e.g., if on each simulation, we start by taking a move we haven’t taken yet), we may miss valuable moves. Most rollout algorithms address this in a shallow way, by using N simulations per action at time t. In other words, we estimate:

\[Q(s_t, a_j) = \frac{1}{N} \sum_{i=1}^N r_j(i)\]

where \(r_j(i)\) is the outcome of simulated episode i (\(i = 1, \ldots, N\)) beginning with action \(a_j\) from state \(s_t\), and random actions after that point.

But in general, we can imagine trying to be even more methodical about our simulations—e.g., making sure we consider every action throughout the episode and not just at the beginning. I’ll come back to this idea in more detail once we address the other issues.

2. Reusing simulations throughout the game

The next weakness with the rollout algorithm is that we are throwing out simulations each turn, even though some of those simulations will still be useful later in the game! For example, in the above section, we created N simulations that began with \((s_t, a_t)\). At our next turn—say, \(s_{t+2}\) if our opponent’s turn was \(s_{t+1}\)—it might be the case that some of those N simulations actually included \(s_{t+2}\), which means we could use the outcomes of those previously simulated episodes on this new turn. Being able to reuse these simulated episodes would be super useful, because typically our N is small enough that having extra simulations would help us get better estimates of \(Q(s_t, a)\).

Okay, so let’s try to make a simple fix: We will try to keep track of \(Q(s, a)\) for every state-action pair \((s,a)\) we encounter somewhere in our simulated episodes. To explain how to do this, suppose we have a simulated episode \((s_t, a_t, s_{t+1}, a_{t+1}, \ldots, s_{t+j}, a_{t+j}, \ldots, s_T )\) that ends with reward \(r\). Then for each state-action pair \((s_{t+j}, a_{t+j}\)) in this episode that we have never seen before, we will initialize a new memory:

\[N(s_{t+j}, a_{t+j}) := 0\] \[R(s_{t+j}, a_{t+j}) := 0\]

where \(N\) will keep track of the number of episodes where we have seen each state-action pair, and \(R\) will keep track of how many of those episodes that we ended up winning. Now, we will go through each state-action pair \((s_{t+j}, a_{t+j})\) we saw in our episode, and simply update these memories:

\[N(s_{t+j}, a_{t+j}) \mathrel{+}= 1\] \[R(s_{t+j}, a_{t+j}) \mathrel{+}= r\]

This is helpful because we can now estimate the value of each state-action pair \((s,a)\) in our memory as:

\[Q(s, a) = R(s, a) / N(s, a)\]

Unfortunately, this simple approach will not work for games that have large numbers of possible states and actions, because we will soon run out of memory!

So instead, we need to be more conservative about which state-action pairs we save to memory from turn-to-turn. Instead of adding every new state-action pair we encounter in a simulated episode to our memory, let’s only add the first state-action pair we encounter that we are not already tracking. For example, for the simulated episode \((s_t, a_t, s_{t+1}, a_{t+1}, \ldots, s_{t+j}, a_{t+j}, \ldots, s_T )\) with outcome \(r\), maybe we are already tracking \((s_t, a_t)\), but we do not have an entry for \((s_{t+1}, a_{t+1})\). Then we will add that new state-action pair to our memory, and then update only the memories of \((s_t, a_t)\) and \((s_{t+1}, a_{t+1})\) using \(r\).

If we stopped here, we would have a rollout algorithm that can reuse value estimates across trials. Next, I’ll explain how we can use those remembered value estimates to simulate episodes with something better than randomly chosen actions.

1. Simulating moves that are better than random

The approach described above gives us a way of reusing our estimates of \(Q(s,a)\) across turns. This is great, because it opens up the possibility that, when we are at state \(s_t\) in the game, we might already have an estimate of \(Q(s_t, a)\) for every possible action. When this happens, we don’t need to take a random action in the first step of our simulation—instead, we can choose the action that has the current highest value of \(Q(s_t,a)\). We can use the same logic at every step (\(j\)) of our simulated episode: If we already have an estimate of \(Q(s_{t+j}, a)\) for all \(a\), choose:

\[a_{t+j} = \mbox{argmax}_a~ Q(s_{t+j}, a)\]

To explain this process in more detail, let’s suppose that on the current turn we are in state \(s_t\), and we want to simulate an episode. Let’s say our simulation is at step \(j\). Then we check and see if we have an entry \((s_{t+j},a)\) in our memory, for each possible action \(a\). If all available actions have entries \((s_{t+j}, a)\) in our memory, we can choose the action that has the largest \(Q(s_{t+j}, a)\). On the other hand, suppose \((s_{t+j}, a)\) is missing from our memory for some actions. Then we will choose one of those actions, \(a\), and initialize \((s_{t+j}, a)\) in our memory. From this point forward in our simulated episode, we know we won’t encounter another step where we have estimates \((s_{t+j}, a)\) for all \(a\) in our memory, so we will finish off the episode by choosing actions randomly.

Basically, what we have described above is how to simulate an episode in two phases: For as long as we can, we simulate selecting actions according to \(Q(s_{t+j}, a)\). For reasons I’ll explain below, this is called our ‘‘tree policy.’’ But once we reach a state where we don’t have an estimate of \(Q(s_{t+j}, a)\) for each \(a\), we will add \((s_{t+j}, a)\) to our memory, and then act randomly until the episode is over. This ‘‘act randomly’’ behavior is called our ‘‘rollout policy.’’

0. Putting everything together

We can actually combine the above approach with our previous idea of searching more methodically. Recall that the rollout algorithm simulates episodes by acting totally randomly, and we might be concerned that—just by chance—we haven’t simulated many episodes that began with some action \(a\). A very similar problem can happen if we use the procedure above, because our tree policy is selecting the ‘‘best’’ action according to \(Q(s_{t+j}, a)\), if every action has an entry \(Q(s_{t+j}, a)\) in memory. But remember that \(Q(s_{t+j}, a) = R(s_{t+j}, a)/N(s_{t+j}, a)\), where \(N(s_{t+j}, a)\) is the number of times we’ve simulated taking action \(a\) at state \(s_{t+j}\) in our simulations. If \(N(s_{t+j}, a) = 1\), our estimate of \(Q(s_{t+j}, a)\) is probably not very reliable (as it’s based on a single simulation). So instead of having our tree policy select actions where \(Q(s_{t+j}, a)\) is large, we might want to also consider exploring any actions where \(N(s_{t+j}, a)\) is small! In other words, we might choose our action according to:

\[a_{t+j} = \mbox{argmax}_a~ \alpha Q(s_{t+j}, a) + \beta / \sqrt{N(s_{t+j}, a)}\]

where α, β ≥ 0 are weights that specify how we ‘‘exploit’’ (i.e., prioritize choosing actions that we think have high value) versus ‘‘explore’’ (i.e., prioritize choosing actions that have not been chosen very often). When \(\alpha = 1\), and \(\beta = \sqrt{N(s_{t+j-1}, a_{t+j-1})}\), this is known as an Upper Confidence Bound algorithm. When \(\alpha = 0\) and \(\beta > 0\), we get a rollout algorithm that reuses simulations across turns, but always simulates episodes using random actions.

Monte Carlo Tree Search

The steps I have just described, where we improved on some weaknesses of the rollout algorithm, is what’s called Monte Carlo Tree Search, or MCTS. The ‘‘tree’’ is basically a way of structuring our memories of different \((N(s, a), R(s, a))\). For example, if our game is deterministic, then every time we take action \(a\) in state \(s\), we will end up in state \(s'\). We can then save our memories in a tree, where our memory for \((N(s', a'), C(s', a'))\) is a ‘‘child’’ of \((N(s, a), C(s, a))\) if we can get from \(s\) to \(s '\) via action \(a\). This tree structure is not essential to MCTS—it simply makes finding/updating our memories more efficient.

To summarize MCTS: On each turn, we simulate N episodes, where each episode adds exactly one new memory \((s,a)\). Each episode is simulated in four phases:

Node Selection: If we have a memory of \((s_{t+j}, a)\) for all \(a\), choose the next action using our tree policy
Node Expansion: Once we encounter a state where \((s_{t+j}, a)\) is not in our memory for some \(a\), initialize a memory of \((s_{t+j+1}, a)\)
Rollout: After initializing a memory, use our rollout policy to finish simulating the episode, resulting in outcome \(r\)
Backpropagation: Update \(N(s, a), R(s, a)\) for all state-action pairs encountered in our simulated episode that are also in our memory