## Thursday, March 18, 2010

Summary: I describe how the TrueSkill algorithm works using concepts you're already familiar with. TrueSkill is used on Xbox Live to rank and match players and it serves as a great way to understand how statistical machine learning is actually applied today. I’ve also created an open source project where I implemented TrueSkill three different times in increasing complexity and capability. In addition, I've created a detailed supplemental math paper that works out equations that I gloss over here. Feel free to jump to sections that look interesting and ignore ones that seem boring. Don't worry if this post seems a bit long, there are lots of pictures.

#### Introduction

It seemed easy enough: I wanted to create a database to track the skill levels of my coworkers in chess and foosball. I already knew that I wasn’t very good at foosball and would bring down better players. I was curious if an algorithm could do a better job at creating well-balanced matches. I also wanted to see if I was improving at chess. I knew I needed to have an easy way to collect results from everyone and then use an algorithm that would keep getting better with more data. I was looking for a way to compress all that data and distill it down to some simple knowledge of how skilled people are. Based on some previous things that I had heard about, this seemed like a good fit for “machine learning.”

But, there’s a problem.

Machine learning is a hot area in Computer Science— but it’s intimidating. Like most subjects, there’s a lot to learn to be an expert in the field. I didn’t need to go very deep; I just needed to understand enough to solve my problem. I found a link to the paper describing the TrueSkill algorithm and I read it several times, but it didn’t make sense. It was only 8 pages long, but it seemed beyond my capability to understand. I felt dumb. Even so, I was too stubborn to give up. Jamie Zawinski said it well:

“Not knowing something doesn’t mean you’re dumb— it just means you don’t know it.”

I learned that the problem isn’t the difficulty of the ideas themselves, but rather that the ideas make too big of a jump from the math that we typically learn in school. This is sad because underneath the apparent complexity lies some beautiful concepts. In hindsight, the algorithm seems relatively simple, but it took me several months to arrive at that conclusion. My hope is that I can short-circuit the haphazard and slow process I went through and take you directly to the beauty of understanding what’s inside the gem that is the TrueSkill algorithm.

#### Skill ≈ Probability of Winning

Skill is tricky to measure. Being good at something takes deliberate practice and sometimes a bit of luck. How do you measure that in a person? You could just ask someone if they’re skilled, but this would only give a rough approximation since people tend to be overconfident in their ability. Perhaps a better question is “what would the units of skill be?” For something like the 100 meter dash, you could just average the number of seconds of several recent sprints. However, for a game like chess, it’s harder because all that’s really important is if you win, lose, or draw.

It might make sense to just tally the total number of wins and losses, but this wouldn’t be fair to people that played a lot (or a little). Slightly better is to record the percent of games that you win. However, this wouldn’t be fair to people that beat up on far worse players or players who got decimated but maybe learned a thing or two. The goal of most games is to win, but if you win too much, then you’re probably not challenging yourself. Ideally, if all players won about half of their games, we’d say things are balanced. In this ideal scenario, everyone would have a near 50% win ratio, making it impossible to compare using that metric.

Finding universal units of skill is too hard, so we’ll just give up and not use any units. The only thing we really care about is roughly who’s better than whom and by how much. One way of doing this is coming up with a scale where each person has a unit-less number expressing their rating that you could use for comparison. If a player has a skill rating much higher than someone else, we’d expect them to win if they played each other.

The key idea is that a single skill number is meaningless. What’s important is how that number compares with others. This is an important point worth repeating: skill only makes sense if it’s relative to something else. We’d like to come up with a system that gives us numbers that are useful for comparing a person’s skill. In particular, we’d like to have a skill rating system that we could use to predict the probability of winning, losing, or drawing in matches based on a numerical rating.

We’ll spend the rest of our time coming up with a system to calculate and update these skill numbers with the assumption that they can be used to determine the probability of an outcome.

#### What Exactly is Probability Anyway?

You can learn about probability if you’re willing to flip a coin— a lot. You flip a few times:

Each flip has a seemingly random outcome. However, “random” usually means that you haven’t looked long enough to see a pattern emerge. If we take the total number of heads and divide it by the total number of flips, we see a very definite pattern emerge:

But you knew that it was going to be a 50-50 chance in the long run. When saying something is random, we often mean it’s bounded within some range.

It turns out that a better metaphor is to think of a bullseye that archers shoot at. Each arrow will land somewhere near that center. It would be extraordinary to see an arrow hit the bullseye exactly. Most of the arrows will seem to be randomly scattered around it. Although “random,” it’s far more likely that arrows will be near the target than, for example, way out in the woods (well, except if I was the archer).

This isn’t a new metaphor; the Greek word στόχος (stochos) refers to a stick set up to aim at. It’s where statisticians get the word stochastic: a fancy, but slightly more correct word than random. The distribution of arrows brings up another key point:

All things are possible, but not all things are probable.

Probability has changed how ordinary people think, a feat that rarely happens in mathematics. The very idea that you could understand anything about future outcomes is such a big leap in thought that it baffled Blaise Pascal, one of the best mathematicians in history.

In the summer of 1654, Pascal exchanged a series of letters with Pierre de Fermat, another brilliant mathematician, concerning an “unfinished game.” Pascal wanted to know how to divide money among gamblers if they have to leave before the game is finished. Splitting the money fairly required some notion of the probability of outcomes if the game would have been played until the end. This problem gave birth to the field of probability and laid the foundation for lots of fun things like life insurance, casino games, and scary financial derivatives.

But probability is more general than predicting the future— it’s a measure of your ignorance of something. It doesn’t matter if the event is set to happen in the future or if it happened months ago. All that matters is that you lack knowledge in something. Just because we lack knowledge doesn’t mean we can’t do anything useful, but we’ll have to do a lot more coin flips to see it.

#### Aggregating Observations

The real magic happens when we aggregate a lot of observations. What would happen if you flipped a coin 1000 times and counted the number of heads? Lots of things are possible, but in my case I got 505 heads. That’s about half, so it’s not surprising. I can graph this by creating a bar chart and put all the possible outcomes (getting 0 to 1000 heads) on the bottom and the total number of times that I got that particular count of heads on the vertical axis. For 1 outcome of 505 total heads it would look like this:

Not too exciting. But what if we did it again? This time I got 518 heads. I can add that to the chart:

Doing it 8 more times gave me 489, 515, 468, 508, 492, 475, 511, and once again, I got 505. The chart now looks like this:

And after a billion times, a total of one trillion flips, I got this:

In all the flips, I never got less than 407 total heads and I never got more than 600. Just for fun, we can zoom in on this region:

As we do more sets of flips, the jagged edges smooth out to give us the famous “bell curve” that you’ve probably seen before. Math guys love to refer to it as a “Gaussian” curve because it was used by the German mathematician Carl Gauss in 1809 to investigate errors in astronomical data. He came up with an exact formula of what to expect if we flipped a coin an infinite number of times (so that we don’t have to). This is such a famous result that you can see the curve and its equation if you look closely at the middle of an old 10 Deutsche Mark banknote bearing Gauss’s face:

Don’t miss the forest from all the flippin’ trees. The curve is showing you the density of all possible outcomes. By density, I mean how tall the curve gets at a certain point. For example, in counting the total number of heads out of 1000 flips, I expected that 500 total heads would be the most popular outcome and indeed it was. I saw 25,224,637 out of a billion sets that had exactly 500 heads. This works out to about 2.52% of all outcomes. In contrast, if we look at the bucket for 450 total heads, I only saw this happen 168,941 times, or roughly 0.016% of the time. This confirms your observation that the curve is denser, that is, taller at the mean of 500 than further away at 450.

This confirms the key point: all things are possible, but outcomes are not all equally probable. There are longshots. Professional athletes panic or ‘choke’. The world’s best chess players have bad days. Additionally, tales about underdogs make us smile— the longer the odds the better. Unexpected outcomes happen, but there’s still a lot of predictability out there.

It’s not just coin flips. The bell curve shows up in lots of places like casino games, to the thickness of tree bark, to the measurements of a person’s IQ. Lots of people have looked at the world and have come up with Gaussian models. It’s easy to think of the world as one big, bell shaped playground.

But the real world isn’t always Gaussian. History books are full of “Black Swan” events. Stock market crashes and the invention of the computer are statistical outliers that Gaussian models tend not to predict well, but these events shock the world and forever change it. This type of reality isn’t covered by the bell curve, what Black Swan author Nassim Teleb calls the “Great Intellectual Fraud.” These events would have such low probability that no one would predict them actually happening. There’s a different view of randomness that is a fascinating playground of Benoît Mandelbrot and his fractals that better explain some of these events, but we will ignore all of this to keep things simple. We’ll acknowledge that the Gaussian view of the world isn’t always right, no more than a map of the world is the actual terrain.

The Gaussian worldview assumes everything will typically be some average value and then treats everything else as increasingly less likely “errors” as you exponentially drift away from the center (Gauss used the curve to measure errors in astronomical data after all). However, it’s not fair to treat real observations from the world as “errors” any more than it is to say that a person is an “error” from the “average human” that is half male and half female. Some of these same problems can come up treating a person as having skill that is Gaussian. Disclaimers aside, we’ll go along with George Box’s view that “all models are wrong, but some models are useful.”

#### Gaussian Basics

Gaussian curves are completely described by two values:

1. The mean (average) value which is often represented by the Greek letter μ (mu)
2. The standard deviation, represented by the Greek letter σ (sigma). This indicates how far apart the data is spread out.

In counting the total number heads in 1000 flips, the mean was 500 and the standard deviation was about 16. In general, 68% of the outcomes will be within ± 1 standard deviation (e.g. 484-516 in the experiment), 95% within 2 standard deviations (e.g. 468-532) and 99.7% within 3 standard deviations (452-548):

An important takeaway is that the bell curve allows for all possibilities, but each possibility is most definitely not equally likely. The bell curve gives us a model to calculate how likely something should be given an average value and a spread. Notice how outcomes sharply become less probable as we drift further away from the mean value.

While we’re looking at the Gaussian curve, it’s important to look at -3σ away from the mean on the left side. As you can see, most of the area under the curve is to the right of this point. I mention this because the TrueSkill algorithm uses the -3σ mark as a (very) conservative estimate for your skill. You’re probably better than this conservative estimate, but you’re most likely not worse than this value. Therefore, it’s a stable number for comparing yourself to others and is useful for use in sorting a leaderboard.

#### 3D Bell Curves: Multivariate Gaussians

A non-intuitive observation is that Gaussian distributions can occur in more than the two dimensions that we’ve seen so far. You can sort of think of a Gaussian in three dimensions as a mountain. Here’s an example:

In this plot, taller regions represent higher probabilities. As you can see, not all things are equally probable. The most probable value is the mean value that is right in the middle and then things sharply decline away from it.

In maps of real mountains, you often see a 2D contour plot where each line represents a different elevation (e.g. every 100 feet):

The closer the lines on the map, the sharper the inclines. You can do something similar for 2D representations of 3D Gaussians. In textbooks, you often just see 2D representation that looks like this:

This is called an “isoprobability contour” plot. It’s just a fancy way of saying “things that have the same probability will be the same color.” Note that it’s still in three dimensions. In this case, the third dimension is color intensity instead of the height you saw on a surface plot earlier. I like to think of contour plots as treasure maps for playing the “you’re getting warmer...” game. In this case, black means “you’re cold,” red means “you’re getting warmer...,” and yellow means “you’re on fire!” which corresponds to the highest probability.

See? Now you understand Gaussians and know that “multivariate Gaussians” aren’t as scary as they sound.

There’s still more to learn, but we’ll pick up what we need along the way. We already have enough tools to do something useful. To warm up, let’s talk about chess because ratings are well-defined there.

In chess, a bright beginner is expected to have a rating around 1000. Keep in mind that ratings have no units; it’s just a number that is only meaningful when compared to someone else’s number. By tradition, a difference of 200 indicates the better ranked player is expected to win 75% of the time. Again, nothing is special about the number 200, it was just chosen to be the difference needed to get a 75% win ratio and effectively defines a “class” of player.

I’ve slowly been practicing and have a rating around 1200. This means that if I play a bright beginner with a rating of 1000, I’m expected to win three out of four games.

We can start to visualize a match between me and bright beginner by drawing two bell curves that have a mean of 1000 and 1200 respectively with both having a standard deviation of 200:

The above graph shows what the ratings represent: they’re an indicator of how we’re expected to perform if we play a game. The most likely performance is exactly what the rating is (the mean value). One non-obvious point is that you can subtract two bell curves and get another bell curve. The new center is the difference of the means and the resulting curve is a bit wider than the previous curves. By taking my skill curve (red) and subtracting the beginner’s curve (blue), you’ll get this resulting curve (purple):

Note that it’s centered at 1200 - 1000 = 200. Although interesting to look on its own, it gives some useful information. This curve is representing all possible game outcomes between me and the beginner. The middle shows that I’m expected to be 200 points better. The far left side shows that there is a tiny chance that the beginner has a game where he plays as if he’s 700 points better than I am. The far right shows that there is a tiny chance that I’ll play as if I’m 1100 points better. The curve actually goes on forever in both ways, but the expected probability for those outcomes is so small that it’s effectively zero.

As a player, you really only care about one very specific point on this curve: zero. Since I have a higher rating, I’m interested in all possible outcomes where the difference is positive. These are the outcomes where I’m expected to outperform the beginner. On the other hand, the beginner is keeping his eye on everything to the left of zero. These are the outcomes where the performance difference is negative, implying that he outperforms me.

We can plug a few numbers into a calculator and see that there is about a 24% probability that the performance difference will be negative, implying the beginner wins, and a 76% chance that the difference will be positive, meaning that I win. This is roughly the 75% that we were expecting for a 200 point difference.

This has been a bit too concrete for my particular match with a beginner. We can generalize it by creating another curve where the horizontal axis represents the difference in player ratings and the vertical axis represents the total probability of winning given that rating difference:

As expected, having two players with equal ratings, and thus a rating difference of 0, implies the odds of winning are 50%. Likewise, if you look at the -200 mark, you see the curve is at the 24% that we calculated earlier. Similarly, +200 is at the 76% mark. This also shows that outcomes on the far left side are quite unlikely. For example, the odds of me winning a game against Magnus Carlsen, who is at the top of the chess leaderboard with a rating of 2813, would be at the -1613 mark (1200 - 2813) on this chart and have a probability near one in a billion. I won’t hold my breath. (Actually, most chess groups use a slightly different curve, but the ideas are the same. See the accompanying math paper for details.)

All of these curves were probabilities of what might happen, not what actually happened. In actuality, let’s say I lost the game by some silly blunder (oops!). The question that the beginner wants to know is how much his rating will go up. It also makes sense that my rating will go down as a punishment for the loss. The harder question is just how much should the ratings change?

By winning, the beginner demonstrated that he was probably better than the 25% winning probability we thought he would have. One way of updating ratings is to imagine that each player bets a certain amount of his rating on each game. The amount of the bet is determined by the probability of the outcome. In addition, we decide how dramatic the ratings change should be for an individual game. If you believe the most recent game should count 100%, then you’d expect my rating to go down a lot and his to go up a lot. The decision of how much the most recent game should count leads to what chess guys call the multiplicative “K-factor.”

The K-Factor is what we multiply a probability by to get the total amount of a rating change. It reflects the maximum possible change in a person’s rating. A reasonable choice of a weight is that the most recent game counts about 7% which leads to a K-factor of 24. New players tend to have more fluctuations than well-established players, so new players might get a K-Factor of 32 while grand masters have a K-factor around 10. Here’s how the K-Factor changes with respect to how much the latest game should count:

Using a K-Factor of 24 means that my rating will now be lowered to 1182 and the beginner’s will rise to 1018. Our curves are now closer together:

Note that our standard deviations never change. Here are the probabilities if we were to play again:

This method is known as the Elo rating system, named after Arpad Elo, the chess enthusiast who created it. It’s relatively simple to implement and most games that calculate skill end here.

#### I Thought You Said You’d Talk About TrueSkill?

Everything so far has just been prerequisites to the main event; the TrueSkill paper assumes you’re already familiar with it. It was all sort of new to me, so it took awhile to get comfortable with the Elo ideas. Although the Elo model will get you far, there are a few notable things it doesn’t handle well:

1. Newbies - In the Elo system, you’re typically assigned a “provisional” rating for the first 20 games. These games tend to have a higher K-factor associated with them in order to let the algorithm determine your skill faster before it’s slowed down by a non-provisional (and smaller) K-factor. We would like an algorithm that converges quickly onto a player’s true skill (get it?) to not waste their time having unbalanced matches. This means the algorithm should start giving reasonable approximations of skill within 5-10 games.
2. Teams - Elo was explicitly designed for two players. Efforts to adapt it to work for multiple people on multiple teams have primarily been unsophisticated hacks. One such approach is to treat teams as individual players that duel against the other players on the opposing teams and then apply the average of the duels. This is the “duelling heuristic” mentioned in the TrueSkill paper. I implemented it in the accompanying project. It’s ok, but seems a bit too hackish and doesn’t converge well.
3. Draws - Elo treats draws as a half win and half loss. This doesn’t seem fair because draws can tell you a lot. Draws imply you were evenly paired whereas a win indicates you’re better, but unsure how much better. Likewise, a loss indicates you did worse, but you don’t really know how much worse. So it seems that a draw is important to explicitly model.

The TrueSkill algorithm generalizes Elo by keeping track of two variables: your average (mean) skill and the system’s uncertainty about that estimate (your standard deviation). It does this instead of relying on a something like a fixed K-factor. Essentially, this gives the algorithm a dynamic k-factor. This addresses the newbie problem because it removes the need to have “provisional” games. In addition, it addresses the other problems in a nice statistical manner. Tracking these two values are so fundamental to the algorithm that Microsoft researchers informally referred to it as the μσ (mu-sigma) system until the marketing guys gave it the name TrueSkill.

We’ll go into the details shortly, but it’s helpful to get a quick visual overview of what TrueSkill does. Let’s say we have Eric, an experienced player that has played a lot and established his rating over time. In addition, we have newbie: Natalia.

Here’s what their skill curves might look like before a game:

And after Natalia wins:

Notice how Natalia’s skill curve becomes narrower and taller (i.e. makes a big update) while Eric’s curve barely moves. This shows that the TrueSkill algorithm thinks that she’s probably better than Eric, but doesn’t how much better. Although TrueSkill is a little more confident about Natalia’s mean after the game (i.e. it’s now taller in the middle), it’s still very uncertain. Looking at her updated bell curve shows that her skill could be between 15 and 50.

The rest of this post will explain how calculations like this occurred and how much more complicated scenarios can occur. But to understand it well enough to implement it, we’ll need to learn a couple of new things.

#### Bayesian Probability

Most basic statistics classes focus on frequencies of events occurring. For example, the probability of getting a red marble when randomly drawing from a jar that has 3 red marbles and 7 blue marbles is 30%. Another example is that the probability of rolling two dice and getting a total of 7 is about 17%. The key idea in both of these examples is that you can count each type of outcome and then compute the frequency directly. Although helpful in calculating your odds at casino games, “frequentist” thinking is not that helpful with many practical applications, like finding your skill in a team.

A different approach is to think of probability as degree of belief in something. The basic idea is that you have some prior belief and then you observe some evidence that updates your belief leaving you with an updated posterior belief. As you might expect, learning about new evidence will typically make you more certain about your belief.

Let’s assume that you’re trying to find a treasure on a map. The treasure could be anywhere on the map, but you have a hunch that it’s probably around the center of the map and increasingly less likely as you move away from the center. We could track the probability of finding the treasure using the 3D multivariate Gaussian we saw earlier:

Now, let’s say that after studying a book about the treasure, you’ve learned that there’s a strong likelihood that treasure is somewhere along the diagonal line on the map. Perhaps this was based on some secret clue. Your clue information doesn’t necessarily mean the treasure will be exactly on that line, but rather that the treasure will most-likely be near it. The likelihood function might look like this in 3D:

We’d like to use our prior information and this new likelihood information to come up with a better posterior guess of the treasure. It turns out that we can just multiply the prior and likelihood to obtain a posterior distribution that looks like this:

This is giving us a smaller and more concentrated area to look at.

If you look at most textbooks, you typically just see this information using 2D isoprobability contour plots that we learned about earlier. Here’s the same information in 2D:

Prior:

Likelihood:

Posterior:

For fun, let’s say we found additional information saying the treasure is along the other diagonal with the following likelihood:

To incorporate this information, we’re able to take our last posterior and make that the prior for the next iteration using the new likelihood information to get this updated posterior:

This is a much more focused estimate than our original belief! We could iterate the procedure and potentially get an even smaller search area.

And that’s basically all there is to it. In TrueSkill, the buried treasure that we look for is a person’s skill. This approach to probability is called “Bayesian” because it was discovered by a Presbyterian minister in the 1700’s named Thomas Bayes who liked to dabble in math.

The central ideas to Bayesian statistics are the prior, the likelihood, and the posterior. There’s detailed math that goes along with this and is in the accompanying paper, but understanding these basic ideas is more important:

“When you understand something, then you can find the math to express that understanding. The math doesn’t provide the understanding.”— Lamport

Bayesian methods have only recently become popular in the computer age because computers can quickly iterate through several tedious rounds of priors and posteriors. Bayesian methods have historically been popular inside of Microsoft Research (where TrueSkill was invented). Way back in 1996, Bill Gates considered Bayesian statistics to be Microsoft Research’s secret sauce.

As we’ll see later on, we can use the Bayesian approach to calculate a person’s skill. In general, it’s highly useful to update your belief based off previous evidence (e.g. your performance in previous games). This usually works out well. However, sometimes “Black Swans” are present. For example, a turkey using Bayesian inference would have a very specific posterior distribution of the kindness of a farmer who feeds it every day for 1000 days only to be surprised by a Thanksgiving event that was so many standard deviations away from the turkey’s mean belief that he never would have saw it coming. Skill has similar potential for a “Thanksgiving” event where an average player beats the best player in the world. We’ll acknowledge that small possibility, but ignore it to simplify things (and give the unlikely winner a great story for the rest of his life).

TrueSkill claims that it is Bayesian, so you can be sure that there is going to be a concept of a prior and a likelihood in it— and there is. We’re getting closer, but we still need to learn a few more details.

#### The Marginalized, but Not Forgotten Distribution

Next we need to learn about “marginal distributions”, often just called “marginals.” Marginals are a way of distilling information to focus on what you care about. Imagine you have a table of sales for each month for the past year. Let’s say that you only care about total sales for the year. You could take out your calculator and add up all the sales in each month to get the total aggregate sales for the year. Since you care about this number and it wasn’t in the original report, you could add it in the margin of the table. That’s roughly where “margin-al” got its name.

Wikipedia has a great illustration on the topic: consider a guy that ignores his mom’s advice and never looks both ways when crossing the street. Even worse, he’s too engrossed in listening to his iPod that he doesn’t look any way, he just always crosses.

What’s the probability of him getting hit by a car at a specific intersection? Let’s simplify things by saying that it just depends on whether the light is red, yellow, or green.

 Light State Red Yellow Green Probability of getting hit given light state 1% 9% 90%

This is helpful, but it doesn’t tell us what we want. We also need to know how long the light stays a given color

 Light color Red Yellow Green % Time in Color 60% 10% 30%

There’s a bunch of probability data here that’s a bit overwhelming. If we join the probabilities together, we’ll have a “joint distribution” that’s just a big complicated system that tells us too much information.

We can start to distill this information down by calculating the probability of getting hit given each light state:

 Red Yellow Green Total Probability of Getting Hit 1%*60% = 0.6% 9%*10% = 0.9% 90%*30% = 27% 28.5%

In the right margin of the table we get the value that really matters to this guy. There’s a 28.5% marginal probability of getting hit if the guy never looks for cars and just always crosses the street. We obtained it by “summing out” the individual components. That is, we simplified the problem by eliminating variables and we eliminated variables by just focusing on the total rather than the parts.

This idea of marginalization is very general. The central question in this article is “computing your skill,” but your skill is complicated. When using Bayesian statistics, we often can’t observe something directly, so we have to come up with a probability distribution that’s more complicated and then “marginalize” it to get the distribution that we really want. We’ll need to marginalize your skill by doing a similar “summing-out” procedure as we did for the reckless guy above.

But before we do that, we need to learn another technique to make calculations simpler.

#### What’s a Factor Graph, and Why Do I Care?

Remember your algebra class when you worked with expressions like this?

Your teacher showed you that you could simplify this by “factor-ing” out w, like this:

We often factor expressions to make them easier to understand and to simplify calculations. Let’s replace the variables above with w=4, x=1, y=2, and z=3.

Let’s say the numbers on our calculator are circles and the operators are squares. We could come up with an “expression tree” to describe the calculation like this:

You can tell how tedious this computation is by counting 11 “buttons” we’d have to push. We could also factor it like this

This “factorization” has a total of 7 buttons, a savings of 4 buttons. It might not seem like much here, but factorizing is a big idea.

We face a similar problem of how to factor things when we’re looking to simplify a complicated probability distribution. We’ll soon see how your skill is composed of several “factors” in a joint distribution. We can simplify computations based on how variables are related to these factors. We’ll break up the joint distribution into a bunch of factors on a graph. This graph that links factors and variables is called a “factor graph.”

The key idea about a factor graph is that we represent the marginal conditional probabilities as variables and then represent each major function of those variables as a “factor.” We’ll take advantage of how the graph “factorizes” and imagine that each factor is a node on a network that’s optimized for efficiency. A key efficiency trick is that factor nodes send “messages” to other nodes. These messages help simplify further marginal computations. The “message passing” is very important and thus will be highlighted with arrows in the upcoming graphs; gray arrows represent messages going “down” the graph and black show messages coming “up” the graph.

The accompanying code and math paper go into details about exactly how this happens, but it’s important to realize the high level idea first. That is, we want to look at all the factors that go into creating the likelihood function for updating a person’s skill based on a game outcome. Representing this information in a factor graph helps us see how things are related.

Now we have all the foundational concepts that we’re ready for the main event: the TrueSkill factor graph!

#### Enough Chess, Let’s Rank Something Harder!

The TrueSkill algorithm is Bayesian because it’s composed of a prior multiplied by a likelihood. I’ve highlighted these two components in the sample factor graph from the TrueSkill paper that looks scary at first glance:

This factor graph shows the outcome of a match that had 3 teams all playing against each other. The first team (on the left) only has one player, but this player was able to defeat both of the other teams. The second team (in the middle) had two players and this team tied the third team (on the right) that had just one player.

In TrueSkill, we just care about a player’s marginal skill. However, as is often the case with Bayesian models, we have to explicitly model other things that impact the variable we care about. We’ll briefly cover each factor (more details are in the code and math paper).

The first factor starts the whole process. It’s where we get a player’s previous skill level from somewhere (e.g. a player database). At this point, we add some uncertainty to your skill’s standard deviation to keep game dynamics interesting and prevent the standard deviation from hitting zero since the rest of algorithm will make it smaller (since the whole point is to learn about you and become more certain).

There is a factor and a variable for each player. Each factor is a function that remembers a player’s previous skill. Each variable node holds the current value of a player’s skill. I say “current” because this is the value that we’ll want to know about after the whole algorithm is completed. Note that the message arrow on the factor only goes one way; we never go back to the prior factor. It just gets things going. However, we will come back to the variable.

But we’re getting ahead of ourselves.

#### Factor #2: How Are You Going To Perform?

Next, we add in beta (β). You can think of beta as the number of skill points to guarantee about an 80% chance of winning. The TrueSkill inventors refer to beta as defining the length of a “skill chain.”

The skill chain is composed of the worst player on the far left and the best player on the far right. Each subsequent person on the skill chain is “beta” points better and has an 80% win probability against the weaker player. This means that a small beta value indicates a high-skill game (e.g. Go) since smaller differences in points lead to the 80%:20% ratio. Likewise, a game based on chance (e.g. Uno) is a low-skill game that would have a higher beta and smaller skill chain.

Factor #3: How is Your Team Going to Perform?

Now we’re ready for one of the most controversial aspects of TrueSkill: computing the performance of a team as a whole. In TrueSkill, we assume the team’s performance is the sum of each team member’s performance. I say that it’s “controversial” because some members of the team probably work harder than others. Additionally, sometimes special dynamics occur that make the sum greater than the parts. However, we’ll fight the urge to make it much more complicated and heed Makridakis’s advice:

“Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones”

One cool thing about this factor is that you can weight each team member’s contribution by the amount of time that they played. For example, if two players are on a team but each player only played half of the time (e.g. a tag team), then we would treat them differently than if these two players played the entire time. This is officially known as “partial play.” Xbox game titles report the percentage of time a player was active in a game under the “X_PROPERTY_PLAYER_PARTIAL_PLAY_PERCENTAGE” property that is recorded for each player (it defaults to 100%). This information is used by TrueSkill to perform a fairer update. I implemented this feature in the accompanying source code.

#### Factor #4: How’d Your Team Compare?

Next, we compare team performances in pairs. We do this by subtracting team performances to come up with pairwise differences:

This is similar to what we did earlier with Elo and subtracting curves to get a new curve.

#### Factor #5: How Should We Interpret the Team Differences?

The bottom of the factor graph contains a comparison factor based on the team performance differences we just calculated:

The comparison depends on whether the pairwise difference was considered a “win” or a “draw.” Obviously, this depends on the rules of the game. It’s important to realize that TrueSkill only cares about these two types of results. TrueSkill doesn’t care if you won by a little or a lot, the only thing that matters is if you won. Additionally, in TrueSkill we imagine that there is a buffer of space called a “draw margin” where performances are equivalent. For example, in Olympic swimming, two swimmers can “draw” because their times are equivalent to 0.01 seconds even though the times differ by several thousandths of a second. In this case, the “draw margin” is relatively small around 0.005 seconds. Draws are very common in chess at the grandmaster level, so the draw margin would be much greater there.

The output of the comparison factor directly relates to how much your skill’s mean and standard deviation will change.

The exact math involved in this factor is complicated, but the core idea is simple:

• Unexpected outcomes (upsets) cause larger updates to make the algorithm more likely to predict the outcome in the future.

The accompanying math paper goes into detail, but conceptually you can think of the performance difference as a number on the bottom (x-axis) of a graph. It represents the difference between the expected winner and the expected loser. A large negative number indicates a big upset (e.g. an underdog won) and a large positive number means the expected person won. The exact update of your skill’s mean will depend on the probability of a draw, but you can get a feel for it by looking at this graph:

Similarly, the update to a skill’s standard deviation (i.e. uncertainty) depends on how expected the outcome was. An expected outcome shrinks the uncertainty by a small amount (e.g. we already knew it was going to happen). Likewise, an unexpected outcome shrinks the standard deviation more because it was new information that we didn’t already have:

One problem with this comparison factor is that we use some fancy math that just makes an approximation (a good approximation, but still an approximation). We’ll refine the approximation in the next step.

#### The Inner Schedule: Iterate, Iterate, Iterate!

We can make a better approximation of the team difference factors by passing around the messages that keep getting updated in the following loop:

After a few iterations of this loop, the changes will be less dramatic and we’ll arrive at stable values for each marginal.

#### Enough Already! Give Me My New Rating!

Once the inner schedule has stabilized the values at the bottom of the factor graph, we can reverse the direction of each factor and propagate messages back up the graph. These reverse messages are represented by black arrows in the graph of each factor. Each player’s new skill rating will be the value of player’s skill marginal variable once messages have reached the top of the factor graph.

By default, we give everyone a “full” skill update which is the result of the above procedure. However, there are times when a game title might want to not make the match outcome count much because of less optimal playing conditions (e.g. there was a lot of network lag during the game). Games can do this with a “partial update” that is just a way to apply only a fraction of the full update. Game titles specify this via the X_PROPERTY_PLAYER_SKILL_UPDATE_WEIGHTING_FACTOR variable. I implemented this feature in the accompanying source code and describe it in the math paper.

#### Results

There are some more details left, but we’ll stop for now. The accompanying math paper and source code fill in most of the missing pieces. One of the best ways to learn the details is to implement TrueSkill yourself. Feel free to create a port of the accompanying project in your favorite language and share it with the world. Writing your own implementation will help solidify all the concepts presented here.

The most rewarding part of implementing the TrueSkill algorithm is to see it work well in practice. My coworkers have commented on how it’s almost “eerily” accurate at computing the right skill for everyone relatively quickly. After several months of playing foosball, the top of the leaderboard (sorted by TrueSkill: the mean minus 3 standard deviations) was very stable. Recently, a very good player started playing and is now the #2 player. Here’s a graph of the most recent changes in TrueSkill for the top 5 (of around 40) foosball players:

(Note: Look how quickly the system detected how good this new #2 player is even though his win ratio is right at 50%)

Another interesting aspect of implementing TrueSkill is that it has raised an awareness of ratings among players. People that otherwise wouldn’t have played together now occasionally play each other because they know they’re similarly matched and will have a good game. One advantage of TrueSkill is that it’s not that big of a deal to lose to a much better player, so it’s still ok to have unbalanced games. In addition, having ratings has been a good way to judge if you’re improving in ability with a new shot technique in foosball or learning more chess theory.

#### Fun Things from Here

The obvious direction to go from here is to add more games to the system and see if TrueSkill handles them equally well. Given that TrueSkill is the default ranking system on Xbox live, this will probably work out well. Another direction is to see if there’s a big difference in TrueSkill based on position in a team (e.g. midfield vs. goalie in foosball). Given TrueSkill’s sound statistics based on ranking and matchmaking, you might even have some success in using it to decide between to several options. You could have each option be a “player” and decide each “match” based on your personal whims of the day. If nothing else, this would be an interesting way to pick your next vacation spot or even your child’s name.

If you broaden the scope of your search to using the ideas that we’ve learned along the way, there’s a lot more applications. Microsoft’s AdPredictor (i.e. the part that delivers relevant ads on Bing) was created by the TrueSkill team and uses similar math, but is a different application.

As for me, it was rewarding to work with an algorithm that has fun social applications as well as picking up machine learning tidbits along the way. It’s too bad all of that didn’t help me hit the top of any of the leaderboards.

Oh well, it’s been a fun journey. I’d love to hear if you dived into the algorithm after reading this and would especially appreciate any updates to my code or other language forks.

• The Math Behind TrueSkill - A math-filled paper that fills in some of the details left out of this post.
• Moserware.Skills Project on GitHub - My full implementation of Elo and TrueSkill in C#. Please feel free to create your own language forks.
• Microsoft's online TrueSkill Calculators - Allows you to play with the algorithm without having to download anything. My implementation matches the results of these calculators.

Special thanks to Ralf Herbrich, Tom Minka, and Thore Graepel on the TrueSkill team at Microsoft Research Cambridge for their help in answering many of my detailed questions about their fascinating algorithm.

pims said...

Fascinating. My GF won't be very happy, but I know what I'll be doing this weekend !

Thanks a lot for taking the time to write this post along side the paper and the code. It helped me refresh some things I learned and introduced new concepts.

Great post. Please keep them coming :)

Jeff Moser said...

pims: Please send her my apologies ;)

Glad you liked the post! Let me know if you need help with anything.

flod.cc said...

hi Jeff,
great presentation-style!

what did you do all the graphics in? the charts, the heat maps, even the stylized people?

the heat maps, charts and flow-diagrams seem python-ish... what are you using?

thanks

Jeff Moser said...

flod.cc: I used...

Bar charts and curves - Excel 2007
Factor graphs - PowerPoint 2007
Surface and contour plots - GNU Plot (details are in the paper)
Faceless people images and chain link - came from OpenClipArt (public domain)

jacobcarpenter said...

Specifically regarding the reference implementation, GameInfo is a very significant type. But it's not clear from the tests when GameInfo should use non-default values.

Is a single set of GameInfo values meant to be calculated once and used for every match? Or does GameInfo start with default values and as more matches are recorded, GameInfo values update accordingly?

jacobcarpenter said...

Also, implicit in my previous question: how are appropriate non-default values for GameInfo instances calculated?

Jeff Moser said...

jacobcarpenter: Great question!

The idea is that GameInfo should be static for all skill calculations. Whenever I tweak GameInfo for a game, I have a method that recomputes the TrueSkill data for all matches of that game (e.g. the database stores the outcomes of all matches, so this is easy).

For foosball, I'm using the default values. For chess, I currently use:

InitialMean = 1200
InitialStandardDeviation = 400
Beta = 200
DynamicsFactor/Tau= 4
Draw Probability = 0.04

These values seemed fair given the types of players and games we have.

If you listen to Ralf Herbrich's Gamefest 2007 talk, you'll see that he discusses some private tools that Xbox title creators have to tweak these values.

My personal advice is that if the game already has a well-established rating number (e.g. Chess), then use the customs of that game to derive these values, otherwise use the defaults with some minor tweaking. The dynamics factor (tau) has a noticable impact on the results. Beta describes how far apart things are spread out.

In the end, I'd recommend listening to the Gamefest talk and then just play around with a lot of data to see what feels right.

jacobcarpenter said...

Slides 17 and 18 from the Gamefest talk give example beta values between 0.4 (high skill game, like golf) and 2.5 (chance game).

Those values are two orders of magnitude smaller than the reference implementation's default beta value of ~4.167. And you mention a beta of 200 for chess...

Why the discrepancy?

jacobcarpenter said...

Sorry the values are an order of magnitude to two orders of magnitude smaller...

Jeff Moser said...

jacobcarpenter: Good question. The short answer is I'm not quite sure and it may be worth asking the TrueSkill guys about.

I had assumed that in the Gamefest talk they were refering to changing beta along with the initial mean and standard deviation. If you do this, then their values might make more sense. I think the general idea holds that Beta is a measure of the 80%:20% level difference. Additionally, I think you want to make pick a default mean and standard deviation such that your initial TrueSkill (e.g. -3 standard deviations) is zero.

Note that when they show the different Betas, the max levels change as well (in theory, there are always an infinite number of levels because it's roughly Gaussian).

I think the GameInfo values I picked are realistic because they match the F# implementation and the values they mention in the paper.

I picked 200 for chess because that number represents the classical spirit of the Harkness/Elo "class" sizes where around 200 points represents an 75%:25% ratio.

Does that help any?

jacobcarpenter said...

Hmmm.

On slide 18, total # of levels times recommended beta value equals ~25. Is there any significance to that number?

The last graph on slide 18 clearly has more than 10 levels, though (looks more like 20), so perhaps I'm still not understanding what they're trying to illustrate.

But I still don't see how their Beta values relate to Beta values for the Moserware.Skills implementation.

If I was modeling a game where there were, say, 20 or so theoretical levels of skill, what would an appropriate beta value be? Or am I still thinking about it entirely wrong?

Jeff Moser said...

jacobcarpenter: I haven't played a lot with tweaking these numbers to see the results and am probably making this more confusing by my lack of expertise, but we'll see if we can figure this out together.

Let's ignore the Gamefest slide for now and focus on section 4.2 of the TrueSkill paper because that is specifically what I designed Moserware.Skill.GameInfo after.

That is, pick some arbitrary initial mean skill level and call it m_0. Now, we want to make sure around 99% of the skills are positive. This implies that we want "0" to be around 3 standard deviations away from the mean. This also implies that the initial standard deviation must be s_0 = m_0 / 3.

By default, we assume Beta is half the initial standard deviation (B = s_0 / 2) and the Dynamics factor (tau) is T = s_0 / 100. I don't think these default for Beta and Tau are magical, they were just probably observed to give reasonable values.

If you arbitrarily pick m_0 to be 25, then that makes s_0 = 25 / 3 = 8.333 and Beta = 8.333 / 2 = 4.167 and Tau = .083. In addition, you have to observe from data what the draw probability is. For chess, I had to observe that we draw about 3.5% - 4.5% of our games, so I picked ours to be 0.04. The draw probability will definitely vary among games (e.g. there are no draws in foosball) and average players. So, just pick something reasonable there and then update it if it doesn't match observed data.

Now, let's say that after you play awhile, you find that people's TrueSkill (e.g. mean - 3 standard deviations) becomes too stable and the leaderboard doesn't fluctuate much. If you wanted to make it so that people rose (and fell) faster, you'd increase Tau/DynamicsFactor.

This leaves Beta.

If you have a high skill game, then the 80%:20% difference is going to appear with just a small fraction of points and thus your skill chain will be long. Likewise, if you have a low-skill game, the 80%:20% difference will only appear with a large number of points so Beta will be larger.

I especially think that Beta is set from just looking at the data to see what makes the grouping of players look reasonable.

For example, if I play with the numbers in our database of chess players, I can see what happens. If I change Beta from 200 to 400 then very experienced players have their standard deviation go up by a factor of 1.5 (e.g. ~38 to ~56). This makes means players spread out more and cause a change to how fast means move. If I change Beta from 200 to 100, then experienced players have their standard deviations move down by a factor of around 0.7 (e.g. ~38 to ~26). This causes players to be closer together. I just chose Beta = 200, because that seemed roughly to fit with tradition.

Jeff Moser said...

Similiar things happen with foosball, but I have more data. For example, with m_0 = 25, s_0 = 8.333, Beta = 4.166 and Tau = 0.083, the top 5 players are what I graphed: #1, #2, #3, #4, and #5. If I cut Beta in half (2.083), I'm declaring that foosball is a higher skilled game than average. The order of the players changes to #2, #1, #3, #4, #5. Effectively, it sees that the new #2 player that I mentioned must have great skill to beat all the other guys and thus he rises faster. In addition, mean scores are closer together and also lower. The standard deviations shrink too.

Likewise, if I double Beta to 8.3333, then the top 5 ranking becomes #1, #4, #3, #2, #6. Effectively, I'm declaring foosball to be less skilled. This makes the new #2 player not as high on the leaderboard (e.g. now 4th place) Additionally, the mean scores increase and become more spread out. Also, the standard deviations rise. The ranking changes depend on how the sequence of game outcomes were chained over a period of time given the generally higher standard deviations.

So, it appears that there is some definite guesswork to be done based on what feels right (and thus the reason Xbox title creators have graphical tools to play with).

Start with the defaults and ask yourself if your game is more skilled (e.g. has lower than average Beta) or less skilled (e.g. has larger than average Beta). Look at the clustering of players and see if it feels right.

Does that help any?

Now, with all that said, if we go back to the Gamefest slides, we can perhaps look at where the data bars drop off to get an idea on how he defined levels? In golf on slide 18, perhaps that is in the range -10 to +50 for 60 levels. For car racing, it looks like it is -10 to +30 for a total of 40 levels. For a chance game, it seems like it's +15 to +25.

What do you think? Any further thoughts on this?

Anonymous said...

This is a great overview of TrueSkill. Great job!!!

Jeff Moser said...

Anonymous: Thanks!

via email:

In the first Natalia/Eric graph, Natalia (who is a new player) is assigned a starting Rating of 25. But where does the system get the sd of 8.33 from?
What I am really asking is... lets say I want to create my a rating
system for my own SuperFunGame. Can I make a blanket statement and
say, "All new players will start with a rating of 500, and a sd of
50".... afterall, in the beginning of your post you state that the
rating is merely just a number - so if I am starting my own rating
system I can start with any number I want. But the question is, after
I have chosen this "starting mean", can I also arbitarily choose my "starting sd" ?

This question is answered in more detail in the above comments, but the general idea is that you want the initial TrueSkill value which is mean - 3 standard deviations to be 0. This means the initial standard deviation should be 1/3rd of the initial mean. SuperFunGame should consider an initial standard deviation of 166.67 (500/3) if the initial mean is 500. Furthermore, the Beta value should be 500/6 = 83.33 and the DynamicsFactor should be 1.67 (500/300).

In the second Natalia/Eric graph, I noticed that Eric's mean reduces but his sd stays the same. Can I thus infer the following, "Eric's mean reduces because he lost a game, however his sd doesnt reduce because he probably has plenty of games under his belt and a single game doesn't affect his sd much". Is this statement correct?

Yes, his standard deviation is relatively stable. You could make this more unstable by increasing the dynamics factor (tau). Play with data with my implementation or the online calculators for details.

What's a message?

It's a way of sending the result of a factor calculation. See the paper for details

What do you do with the result of a message?
It's used for a Gaussian multiplications and divisions (see code)

When do results go back up? Do you just overwrite older values and reiterate?
Players start with initial values. After a match, update the values and those become the prior values for next iteration(e.g. your store them in a database)

Wow! What a phenomenal post, and a wonderful explanation. Thank you for doing all that work.

Nate Parsons said...

Does this require having 2 players or 2 teams, or can it also work for a 4-player free-for-all?

Jeff Moser said...

Adam: Thanks for the kind words!

Nate Parsons: You can have as many teams as you want (e.g. a 4-player free-for-all). The "FourTeamsOfOneNotDrawn" unit test shows exactly how to do this. TrueSkill tends to converge quickly in this scenario.

Nate Parsons said...

Thanks, Jeff. I may try to use your project in the near future to add a ranking system to jsettlers (sourceforge.net/projects/jsettlers2/) and I'm trying to figure out how feasable that is. Do you think I should try to port it to Java, or modify the C# code to run alongside the server somehow? (I'm pretty comfortable in both languages)

Jeff Moser said...

Nate Parsons: Sounds cool! I've heard from another user that my code works well on Mono, so you could possibly use it on any server that runs Mono with some Java<->Mono interop. You might want to try this for an hour or so to see how far you get.

Personally, I think you could port it to Java without any problem since I used very basic classes (e.g. dictionary/map), I didn't take on additional dependencies (this required writing my own matrix class), and tried to make it easy to read. If you do this, I'd encourage you to make the Java port itself its own project (e.g. create a fork on GitHub) so that others could contribute to it independently of your game. You'd be able to port my unit tests over to Java to help you out.

If you do this, you might see areas where my code could be improved or you could add additional unit tests. I'd love to see this and would probably pull in changes into the main line. This could possibly encourage other language forks where we could share further ideas.

Keep me posted and feel free to leave additional comments if you need any help.

Anonymous said...

hi Jeff, great work and will be reading in detail.

If you are comparing two observaed performances, A and B, and A wins, so A > B, the skill attached to both A and B are re-calculated, is that correct? So A increases and B decreases, or is it just A increases?

Jeff Moser said...

Anonymous #2: Your instinct is right. In Elo, the gain of A is exactly the same amount of the loss by B. This means that the sum of the ratings never changes after a game. TrueSkill is similiar: A's mean will increase and B's mean will decrease. However, there isn't a guarantee that the change will be by the same amount as it was in Elo. If A and B had similiar standard deviations at the start, the change will be about the same. You could very well have a net increase (or decrease) in the sum of the ratings.

One surprising thing is that if you have a really low standard deviation and play a game that has very bad match quality (see my accompanying paper for details but this usually means a very unfair match), it could be that your TrueSkill goes down after a win. This occurs because we add a dynamics factor to your skill in Factor #1 before we start. Your mean will still increase, but the resulting TrueSkill = Mean - 3 * StdDev will go down slightly. This is rare, but it's very unexpected when it happens. It sort of makes sense because the game was so bad that the system is now slightly less certain about your skill (from a lower 99% percentile conservative estimation perspective).

Deebster said...

Thanks for a great, and very detailed post. I've still got lots of careful reading to do.

I am considering using this system to rank poker players. One thing I'm not at all clear on is how TrueSkill would cope with a player placing multiple times (due to re-buys). This confuses things, as a player can win and lose in the same game, or place as the worst two players...

Jeff Moser said...

Deebster: TrueSkill works best when there is a fixed outcome of a "match." It seems that in poker, the object is to get the most money/chips, right? If all the players at a table started off with the same amount of money, you could easily turn that into an ordered set of players based on the amount of money/chips people had at the end of the "match." Here, a match could be defined in terms of a specified number of deals or until everyone's money ran out. Then, you could assign a position that is descending based on the amount of money/chips they had at the end of the match. In the event of a tie, you could say that the person that lasted for more deals wins.

Does that help?

Deebster said...

When our games are over, someone has all the money, and we have a simple list of positions. Players can tie, and players can appear more than once.

For example:

1: Alice,
2: Bob,
3: Carol,
=3: Dave,
5: Eve,
6: Eve

where Alice won, Carol and Dave drew, and Eve lost, bought back in, and then lost again.

What I am wanting to model is that Eve (who's normally very good) had such a bad game that she lost twice.

Is it possible to calculate Eve's two skill updates and apply them together in/after the loops? Or would a different approach be needed?

Jeff Moser said...

Deebster:

That's an interesting challenge (e.g. buying back in) that doesn't quite fit TrueSkill's original design. It's sort of a handicap that you're talking about.

Is it safe to assume that Alice, Bob, Carol, Dave, and Eve all start with the same amount of money? This would be the most fair from a TrueSkill perspective. That way, you don't have a "winner take all" total accrued bankroll problem. I'll assume that this is the case.

In terms of allowing people to "buy-in" again, I don't think it makes sense statistically to treat each time a buy-in occurs as the end of another match. This would skew the results of the people that didn't have to buy in again.

One interesting thing you might want to experiment with is to represent a "buy in again" person as playing on a team with a virtual player. The skill levels for the virtual players would have to be calibrated from data, but your scenario would be

1. Alice
2. Bob
3. Carol
3. Dave

TrueSkill would give Eve punishment on her rating because her team's combined rating should have overcame the other people (e.g. Alice, Bob, Carol, and Dave) assuming that the skill for the virtual player is positive. What I'm not sure about is how to treat the changes to the virtual player's rating. I'd assume that once the virtual player for a "buy-in" is calibrated, the same mean and stddev would be used each time.

In theory, if this worked then you could add in another fresh copy of the virtual player to that person's team for each time they did a buy-in.

Would you be willing to use my code (or Microsoft's online TrueSkill calculators) and try out these scenarios with realistic data to give it a gut-check?

Keep me posted! :)

P.S. The "virtual player" concept was suggested by someone else in an email with several people (including the TrueSkill guys) in relation to another game for use in handicaping. It's still in the experimentation phase, so there aren't any published results on it. For example, in chess you could represent a better player (Alice) playing another player (Bob). Since Alice is so good, she agrees to start the game without one of her pawns (e.g. h2). Here, you could represent the game as Alice playing a team of Bob + the "I was spotted a pawn" virtual player that has positive skill. Technically, you could also represent this as Alice + "I started without a pawn" virtual player whose skill was negative and then Bob by himself.

The Sandbag said...

Hi Jeff,
I've had a go at implimeting the team based buy-in. I've tweaked it slightly so that the other members of your team have a rating based upon the position you bought back in and the other players playing, so if you went out 3rd from last and then bought back in you team would consist of you and the 3rd worst player in the game). I've put the code here if you want to have a look, the cmd apps first line is for the log file in the root of the zip and the second is for ouputing a csv of the history of the TrueSkill values.

Jeff Moser said...

The Sandbag: Glad to see you experimenting with the algorithm! I can't imagine that there's been a lot of work with TrueSkill on poker, so we're sort of exploring new ground and might make a few mistakes along the way... but that's a normal part of experimenting :)

My gut feeling is that it's probably fairer to have a constant value for the "buy-in" virtual player (e.g. constant mean and standard deviation). It might make sense to have different values for each new buy-in.

For example, say a single buy-in virtual player is mean=10.0, stddev=2.0 and the second buy-in virtual player is mean=8.0, stddev=4

If Eve bought back in once, her virtual team would be her rating plus the (mu=10, sigma=2.0) virtual player.

If Eve bought back in twice, her virtual team would be her rating plus first buy-in virtual player rating (mu=10, sigma=2.0) plus second buy-in virtual player rating(mu=8.0, stddev=4) (e.g. that is, three separate player on a single team)

To keep things simple, you might start off having all buy-ins having the same constant value. (e.g. Eve rating + 2 * (mu=10, sigma=2.0) for two buy-ins)

There's some theory you could apply to fit the data to get suggested values of what the buy-in virtual player's values should be, but you'll probably have more fun picking values and see what it does in a guess and check fashion. That is, look at the player's rating change after each round, especially if they bought back in. Do the results seem fair?

In general, the more confident you are in what the mean should be, the smaller the sigma/standard deviation should be. You could start off at the defaults for the virtual player (e.g. mu=25, sigma=8.33) and move the mean based off your data and personal findings. The more confident you are on the virtual player mean, the smaller you could make the sigma.

Can you try experimenting with this approach and tell me if you think you're starting to see fair rankings and anything else you observe?

Deebster said...

Hi Jeff,

We're getting some nice-looking numbers, but it's not without flaws.

One issue we're having with using the team-style method is that it doesn't distinguish between the following two games:

1: Alice,
2: Eve
3: Bob,
4: Carol,
5: Dave,
6: Eve **

1: Alice,
2: Eve
3: Bob,
4: Eve **
5: Carol,
6: Dave

This was part of the logic behind setting different μσ values for the virtual players - Eve should get a worse update for initially going out earlier. It's not worked though, as TrueSkill is rewarding Eve for "carrying" a weaker player when she places high.

Another issue is that Dave is now last in the first game as the ** Eve effectively disappears - Dave and Carol have been penalised for Eve's rebuy.

I'm leaning back towards computing two Eves and combining their Trueskill changes as it seems the team method loses too much information.

I've been lazy though; I've not written any code, so I still need to experiment properly.

We'll let you know if any brainwave occurs.

Kieran said...

Fantastic article! If only research papers were like this....

henry said...

Hi Jeff

I am unclear about what happens when two performances are compared with an indicator function, and what happens to the distributions aftrewards; I know you referred to a math paper in yout superb tut. This is the section where you visualise a joint distribution of 2 gaussians. Can you expand a bit?

Jeff Moser said...

Deebster: Keep me posted. It might make sense to have a fixed value based on the position they were at when they bought back in.

Kieran: Thanks! My hope is that this post might be helpful to some so they don't have to spend as much time as I did looking into it.

henry: The indicator function shows up 3 times in the factor graph. The first time is for the Gaussian weighted sum for teams. This is a relatively straightforward process and is described in my math paper that I linked to. The second time is the Gaussian subtraction that is also relatively straightforward and described in the paper as well.

I assume you're referring to the bottom of the factor graph that is essentially a programming "switch" statement based on if the difference is above the draw margin or not. The details for this somewhat complicated process are described here. The basic idea is that you have two Gaussians (one for each team). These individual team Gaussians combine to form a multivariate Gaussian that I showed a sample picture of on page 28 of my paper (the picture was created by the TrueSkill guys). The actual team performance difference is used to indicate where to "chop" this multivariate Gaussian into two pieces (e.g. think of cutting up a smooth mountain with a big knife falling vertically from the sky). Color one side of chopped multivariate Gaussian red and the other blue. Now, imagine that the side that didn't actually occur is flattened (e.g. if the blue team won, the red side would get flattened to zero). You're now left with a blue piece that is a "truncated" multivariate Gaussian. The math of truncated Gaussians is hard to work with, so we approximate this truncated Gaussian by a real Gaussian so that we can keep the math simple. The reward for all this multivariate work is the "w" and "v" functions that is used by the algorithm. These functions are used to see how unexpected the observed outcomes are and thus how much of an update to give. The details of how they behave are given in the graphs in my attached paper.

Does that help?

henry said...

Hello again Jeff, have I understoon this correctly:

"....The actual team performance difference is used to indicate where to "chop" this multivariate Gaussian into two pieces..."

Imagine i'm looking down on your 3D diagram of the performance comparison.

1. On making the 2 team comparison, the 'chop' is ALWAYS made along the diagonal of the square;

2. The mean of each of the gaussians creates a point on X-Y, which shifts the point around the square as means shift. For equally performing teams, the point is in the centre of the square;

3. The winning half of the square will usually contain stronger performances and so the winning gaussian will tend to a larger value.

4. The losing half will usually contain weaker performances and so the losing gaussian will tend to a smaller value.

Jeff Moser said...

henry: I think you have the right idea on all parts.

1. Admittedly, my understanding on this part isn't perfect. I think the draw margin size might make your first point not always exactly on the diagonal, but it'll be close.

2. I believe so.

3. Yes, I believe the expected winner will have a larger size area.

4. Yes, I believe the expected loser will have a small size area.

Gregory Ford said...

What are you using for your database? How is it organized?

Jeff Moser said...

Gregory Ford: I'm using a simple SQLite database with an ASP.NET MVC web front end. The database essentially has tables for "GameInfo" (TrueSkill parameters), "Player", "Team", "PlayerToTeam", "Match", "MatchTeamRank", and "TrueSkillRating" which is the rating for each player by Game after each Match.

Does that help?

Gregory Ford said...

Yes, that helps some.

I've been keeping ratings for board games for my family, using an "extended" Elo formula, and it works fine, but there are some things it doesn't do, like teams. I'd like to replace it, and rather than go to a Glicko system, I'd rather go for something that is (more?) Bayesian. I don't remember how I first found the TrueSkill stuff, but I recently went back looking for it and came across your blog and article. Thanks for both.

On the implementation end, I'm currently keeping my records in an Excel spreadsheet. I was looking at your code, and noting that neither your article nor your code addressed record management.

I'm in way over my head. It's been 14 years since I've done some serious programming, and that was FORTRAN. I know object-oriented programming exists, and have read enough to form an idea why. (To make programmers think in a more structured way. I think.) But I've never done much OOP coding. And my database experience is mostly using lookup commands in Excel.

Thanks for all the leg-work you've done in deciphering the TrueSkill paper. Even with my second major in math, I was still left swimming after reading it.

Nate Parsons said...

Gregory: Hmm...I'm pretty sure Jeff doesn't address that part because it's very specific to how exactly the data is arranged and stored. The permutations are so vast, it wouldn't really be practical to provide a service that can handle data in any format. As the creator of JSkills, I've been thinking about this issue, too, but it never occurred to me that data might be stored in excel.

Perhaps as a workaround, you could use one of Microsoft's calculators? It wouldn't be as easy as a custom-coded solution, because you'd have to set it up every time and it's limited to 8 players, but maybe it'll work for you. http://research.microsoft.com/en-us/projects/trueskill/calculators.aspx

Jeff Moser said...

Gregory Ford: I'd recommend using Excel and the Microsoft TrueSkill Calculators since it'd probably be the most comfortable. They even have an Excel calculator for simple teams.

Let me know if you need any additional help.

Nate Parsons: You're right. I didn't include any data storage code because there are lots of options for storage once you have the core algorithm. It probably wouldn't hurt to ship an example that uses something like CSV files or SQLite.

Gregory Ford said...

I've toyed with the online calculators, and they would be okay for my past uses, but they won't extend to some of the things I want to do. I have some "co-operative" board games, (namely Shadows Over Camelot and Middle-Earth Quest,) and would like to create ratings for those. The former has the possibility for one (or two) players to be traitors. The latter is one player against the other (1-3) player(s). I think I would need to use the partial play calculations to do that. I could probably do those by hand, but it would get tedious.

Second, I'd like to play with scale, change the beta, all the fun stuff we can do when we know the parameter choices are arbitrary.

Which of the on-line calculators was Excel based? I see a form-based one and an AJAX based one. (Forgive me for not knowing what AJAX is.)

Jeff has posted a .dll of his project. I think I can actually use that in Excel. I know my dad has compiled some routines for some of his Excel sheets.

Jeff Moser said...

Gregory Ford: Microsoft's Excel calculator is here, but it's only for simple two-team games and doesn't allow you to tweak the parameters. "AJAX" effectively means that it updates automatically without you needing to press an "update" button (e.g. it updates as you move the sliders).

I'm not familiar with the board games you mentioned, but keep in mind that partial play typically makes sense for the case where a certain player only played a certain percentage of the time (e.g. imagine a Halo competition with people coming and going during a match). You'd have to be careful to make sure what you wanted to do with it made sense. In some handicapping scenarios, it makes more sense to add a positive or negative "virtual player" to a team. I described this in previous comments.

You can see how to use the code in my unit tests. For example, see this one. The only difference is that where I used "calculator", you'd probably use "TrueSkillCalculator."

You could call my code from Visual Basic for Applications (VBA) that is built into Excel. One problem is that I used common .NET types (e.g. a Dictionary) that VBA doesn't natively support. Alternatively, you could download the free Visual C# Express Edition and use my code natively in a similiar way to what I do for my unit tests. You'd just need to pick out a storage mechanism.

Given your Excel and math talents, you might consider rewriting my implementation in Excel macros or VBA. It's a great learning experience to understand the details (with the option to use my implementation and paper for some hints). A native Excel port would be a great help to people wanting to just experiment with the algorithm. Would you be up for that?

Gregory Ford said...

Using Excel is really my default option. The only problem with VBA is that I've heard it will not be included with future releases of Microsoft Office. That would make for some "forward compatibility" problems. The "cost" is that I have to do the hard work of understanding and rewriting the formulae, and then how to make that look neat in an Excel context.

Thanks for the link to the free C# studio. That makes the option of using your code doable. In this case, I would have to come up to speed with C#, and learn how to use a database like SQL. Again, doable, but a different set of costs.

At this point, I think an Excel implementation would be more "Gregory-friendly." I don't know how quickly it will get done, but this is something I want.

I'd love to discuss the partial-play options for board games, but I'd rather do that on a forum rather than in the comments section. It will take more back-and-forth to settle it all.

Anonymous said...

Jeff thank you for this fantastic article and code work, I have one question. In your example where Eric has a mu-sigma of 29.82,1.25 and Natalia 33.00,5.97. How would you go about calculating the chances of the result (win, draw, loose) of their next match with each other?

Jeff Moser said...

Gregory Ford: The accompanying math paper might help with the details. Feel free to email me if you need extended back/forth clarifications (address on right side of page). Comments are helpful here if possible because others can read them and benefit.

Anonymous #3: The details of the win, lose, and draw probabilities are a bit complicated because they occur in three dimensions and deal with non-constant standard deviations (e.g. sigmas) along with Beta. Additionally, you need to do pairwise comparisons for each team combination. You can see an example of what it looks like visually on page 28 of the math paper.

Instead of answering your question directly, TrueSkill prefers to use "Match Quality" as a rough gauge of what you're looking for. I discuss this in more detail on page 34 of the paper (along with equations and derivations). The idea is that you can compute a number that is the probability of getting a draw for all participants. That is, the probability that match are well-balanced skill-wise.

For example, with Experienced Eric (mean=30, stddev=1.25) vs Newbie Natalia (mean=25, stddev=8.33) and a Beta of 4.167 has a match quality around 50.92%. After the game, Eric at (mean=29.82, stddev=1.25) vs. Natalia (mean=33,stddev=5.97) and same Beta would have a match quality around 64.77%. Do you see how Natalia's higher mean and especially lower sigma caused the system to think the "teams" were much more balanced after the game?

Does that help?

Anonymous said...

Hi Jeff,

So is the match quality the percentage chance of a draw occurring? and if so is it based on the Mu-Sigma values going into the calculation or the Mu-Sigma values that come out of the calculation?

Jeff Moser said...

Anonymous #4: It's slightly more complicated given that it's a calculus limit as the draw margin approaches zero (as discussed on page 34 of my paper), but it's quite reasonable to just refer to it as the probability of a draw occuring among the teams participating.

The match quality is always based on the mu & sigma values before the observed outcome is known (e.g. the values going in) because the match will affect the mu and sigma. If you calculate it after, you'd get TrueSkill's prediction for the next game.

You might find the online TrueSkill calculators to be helpful in experimenting with mu and sigma inputs (although they don't allow you to tweak the Beta value like my code does). Additionally, you could setup a spreadsheet using the equations from the paper.

Anonymous said...

Any concerns about Microsoft holding the patent on the TrueSkill system and if they came after you for using some variant of it?

Jeff Moser said...

Anonymous #5: Great question! Someone in the audience asked a similiar question in Bishop's Turing lecture. I think that given that Microsoft links to this post, I'm probably ok.

Good companies typically hold patents in case they get into patent wars with another company or if there is an egregious violation. I think the spirit of their position is that it's fine to use it for non-commercial use. I don't think they'd play the patent card unless someone like Sony or Nintendo tried to use it for their online matching and ranking. In short, if you use it for something that has no possible impact on XBox Live's revenue, then you're probably ok, otherwise no.

However, I'm not a lawyer, so if you're in doubt, check with one.

Anonymous said...

Fascinating and well written post! Thanks a lot! It seems TrueSkill is able to solve my problem of getting better picture about the skill of a player having few games played.
Have a question though:
How would you get a probability of player A winning player B for example from following information:
A (2 095.868/86.621) - B (1 981.954/45.556) match=0.756
You can consider this as a chess game.
Any suggestions?

Jeff Moser said...

Anonymous #6: The answer to the probability question will depend on your choice of beta (described in the post and paper). TrueSkill's match quality metric was created specifically to match people up to have close games (e.g. a high chance of drawing) rather than calculating probability of winning. They're seemingly similar, but look at different things.

TrueSkill is a more generalized version of Elo because it stores both the mean and the standard deviation. Because of this, the math is a bit more complicated. You can get a feel for what I'm talking about by looking at page 34 of my math derivations that accompany this post.

If you reduce the complexity by ignoring the individual standard deviation values and draw margin along with using a beta of around 200 like I used for the Elo example, you're back to 2D math with a simple curve and would get that player A has about a 66% probability of winning against player B.

If you plug in your mean and standard deviation values into the match quality algorithm in the TrueSkill paper (which is derived on page 37 of my paper), you'll get a match quality of around 88% if you use a beta of 200. A rough interpretation is an 88% probability of drawing with the implication that player A is better.

This sort of makes sense that the probability is high of drawing if you compare each players TrueSkill value of Mean-3*StdDev which is 1836 and 1845 respectively. They're both quite close and it's hard to really say who's better (A is higher but is more uncertain than B)

In the end, I'd strongly recommend focusing on the match quality than notions of win probability and seek to have games that have a high match quality.

Tobias said...

Great post and the maths behind it offers even more insight, thanks for that!

I just have a few questions that hopefully you can answer.

1) In a two team game we will have a 1D Gaussian representing each team comprising the means of the players and variance, beta and tau. We then subtract one gaussian from the other to get another 1D gaussian representing the team difference. This will head in to variable d. Connected to the bottom of d is the update function indicator. Would I be right in saying that we choose the appropriate update function indicator based on the game outcome rather than the prior team difference?
That is to say if one team was significantly better than the other team, such that t1 - t2 > e, but the actual outcome of the match was a draw, then we would choose the update function to be the |t1-t2|<=e?

2) The marginal associated with d is obtained by multiplying the two incoming messages, i.e the team difference Gaussian by the indicator function Gaussian? Is this how that 2D Gaussian was obtained that is to be chopped in two?
I presume that the image produced is x ~ N([mu1 mu2], [var1 var2]) where mu is the mean of one team and var is the variance of that team.

3) The message leaving d to return back to the teams will then be a new 1D Gaussian with mean and variance which approximates the chopped Guassian at d?

4) In the update equations for the new marginal d in the Trueskill paper we have two computations c and d. c is defined as the precision of the marginal of variable d minus the precision of the message from the indicator function. However as the precision of the marginal of d is the sum of the two incoming precisions, does that mean that c is merely the precision of the incoming message coming from the team differences?

Thanks for taking the time for answering questions! I hope I am not way off the mark with my assumptions.

Jeff Moser said...

Tobias:

1. Correct. The graph represents the observed outcome, not necessarily expected outcome. If this were not the case, TrueSkill couldn't "learn" skills.

2. Marginals can be calculated by the product of messages per factor graph rules. See equation 3.4 of the TrueSkill paper or page 22 of my paper. The 2D Gaussian is shown on page 28 of my paper. I think you're mostly on the right track. The idea is that you "chop" the 2D Gaussian by setting the heights on the losing side of the chop to all zeros. You approximate this with another Gaussian using the fancy "W" and "V" functions that depend on if you were within the draw margin or not in the observed outcome.

3. Yes. Messages are always 1D Gaussian (e.g. height on the y-axis). The details are on page 29 of my paper.

4. I believe you're on the right track. If you look at my GaussianWithinFactor class, you'll see that "messageFromVariable" is oldMarginal/oldMessage and that c = messageFromVariable.Precision and d = messageFromVariable.PrecisionMean.

Does that help?

Y_Less said...

There is a very slight mistake in your factor graphs, you have for example:

N(p4;s4;B4^2)

Whereas in the original paper B^2 is a constant, not set per player, so it shouldn't be B4^2, jsut B^2.

Jeff Moser said...

Y_Less: Good eye! You're absolutely right that it should be constant for all players. I'll make a note to change it.

Y_Less said...

Wait, ignore that, there seem to be a few differences between your graph and the original - the one I just mentioned and the +T^2 in the prior update. Could you possibly expand on where these came from (I will admit I've not finished reading yet, just spotted those as I've been looking at the paper for a while in the same boat as it seems you were originally).

Jeff Moser said...

Y_Less: Note that on page 3 of the original TrueSkill paper it says "If the skills are expected to vary over time, a Gaussian dynamics factor can be introduced which leads to an additive variance component of \gamma^2 in the subsequent prior." In my factor graph picture, I was trying to make this explicit.

In later talks and further dialog, the TrueSkill team referred to this parameter as "tau" rather than "gamma", so I used the updated term/symbol of "tau."

Does that help?

Y_Less said...

That did help very much, yes - I also discovered I had a very slightly older copy of the paper (the technical report rather than the NIPS paper) which didn't say that anywhere. Could explain one problem I was having.

Keilin said...

Similar to your comment earlier about the probability of a game outcome:

Is there an easy way to calculate the probability of a result given two players and Beta and everything needed?

Game quality is good for matchmaking, but I'd like to run some tests to see how accurate the implementation is and I can't see how to use game quality for that.

Jeff Moser said...

Keilin: The probability of team A winning over team B is approximately GaussianDistribution.CumulativeTo(teamAWinNumerator / denominator)

where teamAWinNumerator = teamAMeanSum - teamBMeanSum - drawMargin

and

denominator = Math.Sqrt(totalPlayers * (beta * beta) + teamASigmaSquaredSum + teamBSigmaSquaredSum)

Anonymous said...

Thank you very much. Seriously! From the dark deep corners of my heart. You have no idea how much this helped me. God! I still can't believe you took all that effort to painstakingly explain everything.

Jeff Moser said...

Anonymous #7: Thanks for the nice feedback! I'm glad that you found it to be helpful.

Rigor said...

hi,

i just started reading ur post, some of my online gaming friends want to improve our current ladder webpage and are planning to rate also 2v2 or even 3v3 games. so far our 1v1 ladder webpage runs smoothly but there is plenty of room for improvements: http://ladder.subversiva.org/

maybe u have a constructive idea how to refresh our old ELO system with a new one? if so, id ask u kindly to drop me a message via ur blog or forum (just that i wont check ur page every day ;) - here the link: http://www.wesnoth.org/forum/viewtopic.php?f=15&t=32037&p=464858#p464858).

Jeff Moser said...

Rigor: I would encourage you to use TrueSkill any time you need to go beyond 1v1 because it was specifically designed for this scenario (e.g. pulling apart individual performance from team play). Elo wasn't designed for more than 1v1 and you'd have to perform hacks to make it do such (like the duelling heuristic idea mentioned in the paper).

You're welcome to use my C# or PHP TrueSkill implementations to do the calculations. If you go this route, the leaderboard is then sorted by TrueSkill (mean - 3 * stdDev).

John P said...

Hi Jeff,

Firstly let me say thank you for the effort you have put in to this project.

I have been ranking 400 high school basketball teams using Microsoft Research's online true skill calculator and the workload is very high. So seeing your C# implementation was an answer to my prayers but I have installed a copy of Visual C# Express and downloaded your Skills project but it does not want to work and gives 80+ errors. Would it not be possible to provide some code that is a bit more stand alone to provide similar functionality to the online calculator. Here's hoping.

Jeff Moser said...

John P: Sounds like an interesting idea. Are you just tracking the skill of a team as a whole or are you breaking things down into players? (TrueSkill could do both).

I assume you're doing it by team. If this is the case, you could make a significant simplification by using the two player formula mentioned on the TrueSkill website and implemented in my much simpler two player calculator.

You should be able to create a spreadsheet where one sheet is a list of teams and their current mean and standard deviation, another sheet could be the list of games in the format (Date, Home Team, Away Team, Home Score, Away Score) if you track that detail or just something like (Date, Winning Team, Losing Team).
Then you could write the Excel macro to implement the calculation and you would have a nice tool to do exactly what you want. You could even have a "Leaderboard" sheet using the TrueSkill calculation.

Does this seem like something you could write? How are you tracking and storing things now?

What exactly do you use the data for? Have you gleened any interesting insight from it?

I look forward to hearing more. It sounds like a neat project.

Anonymous said...

Hi, Jeff,

I can understand your codes and AdPredictor paper, even so, I find that it's difficult to understand the TrueSkill paper and your tutorial. I just cannot understand the meaning of some notations. It will be highly appreciated if you can give some hints.

1) In the first row of table 1 in TrueSkill paper, I just cannot understand what's the meaning of pi_new_x and tau_new_x, should it be pi_new_(f->x) and tau_x_(f->x)? If so, where do pi_x and tau_x come from? Based on my understanding of factor graph, a message u_(f->x) should be sent from the prior factor to variable x, such that u_(f->x) = N(x; m, v*v), where m and v can be the default user skills and variances, or learned previously.

2) In the second row, assume that y is the skill node and x is the performance node, again I cannot understand what's the meaning of pi_y, tau_y, pi_(f->y) and tau_(f->y), and whether pi_(f->y) and tau_(f->y) are parameters for messages from the prior factor to y, or for messages from likelihood factor to y.Based on my understanding of factor graph, here a message u_(f->x) should be send from the factor node to variable x according to rule 3.2 in TrueSkill paper, assume the message from prior factor to y is defined as in 1), then u_(f->x) will be exactly the same as the final result derieved in page 24 in your tutorial.

Cosmo said...

Awesome article here.

I'm running a ranking system, it is currently using ELO but we were going to rewrite it a bit.

It uses race results, for example
1. Bob (1:02:24)
2. Jim (1:09:40)
3. Larry (1:34:34)
4. Marie (2:07:23)
5. David (forfeit)
5. Andrew (forfeit)

It is an open system so "matchmaking" really never happens; basically a race gets started and whoever wants to join, does. My question to you:

Do you think it is worth it to try to actually use the times in the calculation? One negative of this is that winning by a couple seconds isn't as exciting as before, if the times end up evening out the point loss for a close race.

Also, do you think ELO is sufficient for this kind of thing or should we be using the standard deviation that TrueSkill uses?

It looks interesting although sometimes people forfeit suddenly because something else comes up and they don't have time.

Also because calculated "matchmaking" virtually does not happen, it seems a bit excessive.

For multiple forfeits we just consider it a 0.5 win per forfeited opponent.

Was also considering using K-factor changes at 1800 and 2200, with no special new player shift. Do you think I should abandon this idea and instead try to incorporate the varying standard deviation?

We noticed the problem where new players sometimes get sorted awkwardly, as well as good players seeking easy targets. Some of the problem here is the whole matchmaking thing, but due to the nature of what we are doing it's not feasible to do random pairing / random groups.

Also, we like to keep the mean exactly 1000 at all times in our system, to avoid some sort of rating inflation. Not sure how necessary this is.

Just curious as to your thoughts on what a good rating system would be that uses solely time-based results.

Jeff Moser said...

Cosmo:

Your game sounds like Project Gotham Racing on XBox that the TrueSkill guys have worked with. For this reason, I think it's a good fit.

I don't think it's worthwhile to use the actual times in the rating update calculation but rather stick to the ordinal placing results (e.g. 1,2,3,4,5,5). The TrueSkill guys debated back and forth on whether they should add additional game info (e.g. time, points differences, etc) but finally concluded that doing so adds much more complexity without much benefit.

I think TrueSkill is better for your situation because:

1. Elo wasn't designed for a single competition with more than 2 people.
2. You wouldn't have to worry about using a dynamic K-factor since uncertainty is handled by tracking the standard deviation.
3. Elo doesn't explicitly handle draws.

If players forfeited due to a bad connection or something else, you could just adjust the TrueSkill "partial update" value for that player. However, this could possibly be exploited by cheaters, so maybe it's best to somewhat punish them for dropping out.

You're welcome to use my C# or PHP code to try it on your game outcomes and see how the data looks. Let me know how it goes.

Thanks for stopping by!

Computer Guru said...

I just wanted to thank you for taking the time to post this. Extremely enlightening, and a wonderful refresher for some probability mathematics that I had long forgotten.

developmentality said...

I took a probability course in college, but I probably learned more from this post than that whole semester course. Your presentation skills are truly inspired. The contour plots are extremely useful.

I have to admit when you started going into the 'factor' graphs, I started losing you. Maybe a slightly more gradual transition into that. I followed all the math up to that point.

FlipScript said...

Hi Jeff,

Awesome, awesome post. I've read it three times now, and have been going through the source as you explain things.

No question here (yet). I just wanted you to know that we are going to be using your TwoPlayerTrueSkillCalculator to do something... a little different.

We have a massive number of visual designs that can be combined in a multitude of ways. We assign them an initial "quality score", but it's more art than science.

We want these scores to update over time, based on what users like, and we've really made some great progress using your classes!

Basically, when a user picks one design from two or more choices, that is considered a "win" for the design that was selected, and a "loss" for the one(s) that were not.

We track mu-sigma in a database, and let all of the stats update automatically based on user preferences.

The fascinating thing is that the users don't even know that they are "ranking" the characters in the designs, but the best designs appear to be bubbling to the top (as we had hoped).

Anyway, fantastic work on this post, and the code. We really appreciate your generosity on this, and if see anything on the FlipScript.com web site that you like, drop me an email and I'll have it sent out to you. After all, you did help create it. :-)

Yaroslav said...

Can you say what you used to generate the graphics on that page?

YokoZar said...

I have to take issue with the assumption that team skill is simply the sum of the individual skills. Consider some example games:

1) Which team can reach a basket of a certain increasing height first? The winner of this game is simply whichever team has the tallest player. Here skill = height, and team skill is clearly just the max of the member's skills. More broadly, this could apply to any game where "teams" consist of an amalgam of players working individually; a math contest that gives a point to the first team to have an individual solve a problem on his own, for instance.

2) Consider a game with an exponential economy. Here a player's skill would be how fast he can multiply himself: simply summing members won't work to approximate teamplay. (Imagine we earn the team points by iteratively multiplying the team's score by our own personal skill number). Even if the game is not purely exponential, but rather functions as some combination of addition and multiplication, the additive skill assumption would be doomed to increasing inaccuracy, especially in games where one team is similar skill and another is a good and bad player paired together (n^2 > (n+1)(n-1))

3) Some games have genuinely good balance mechanisms for teams of uneven sizes. Here the addition metric is clearly off -- what matters instead is the average skill of team members. If it's a multiplicative game, then geometric mean would be more appropriate.

If you studied economics, you may be familiar with the idea of indifference curves: it's the class of curves such that an individual's preference is "indifferent" between various numbers of one good and another. I believe this is actually the same problem: the goods in question are different player's skills, and our goal is to find two "indifferent" teams (ie matched). Purely additive curves are just one kind of indifference curve, however they're widely believed to be too simplistic an assumption as they don't capture concepts like diminishing returns.

What do you think? Have you considered allowing developers to test different assumptions about formulas for "team skill", perhaps based on their own feelings about how they've balanced their game? It seems like it should be very easy to quantify which formula works best for a given game with simple experimentation -- Good + Bad skill should not consistently beat (or lose to) equal + equal skill, even when they have the same "team skill".

Jeff Moser said...

Computer Guru: Glad you liked it. Thanks for stopping by!

developmentality: I also liked seeing things visually. It helped me grok things faster. Sorry about losing you at factor graphs. They're a specialized idea. The basic idea is breaking big things down into message-passing pieces.

FlipScript: Interesting approach. You might just assign everything a default score (e.g. mu=25,sigma=8.333) and let the system automatically adapt rather than worry about a highly subjective metric. Also, you might try and use the factor graph based TrueSkillCalculator rather than the two-player one. This will allow you to treat a scenario of 3 or more options as a "win" with everyone else "drawing" for second place which might possibly reflect the reality better.

Thanks for your offer of stuff from your site. Just hearing that my code is being used is enough of a reward for me.

Yaroslav: I discussed that above.

YokoZar: There are lots of possibilities that might yield better results for a specific game type, but the nice part about TrueSkill is that it works really well for the general case. You could do something like what you're talking about by tweaking the "partial play" parameter, but then you're going into scenarios that the algorithm wasn't specifically designed for and encounter challenges.

Martin said...

Great stuff Jeff! A lot to take in in one sitting! I'll be rereading this a couple of times!

Have you compared the predictive accuracy of the TrueSkill algorithm to the Glicko2 method for 1 player teams with no draws? I'd be interested to know how it compares and under what conditions one is more accurate than the other.

I've written my own implementation of the Glicko 2 algorithm and applied this to a dataset of ATP tennis matches over the past 50 years, you can see the outcome here. I'm looking forward to applying TrueSkill to my dataset (when I find the time!) and I'll let you know what I find

once again, thanks for putting the effort in on our behalves!

Martin

John P said...

Hi Jeff,

I posted a question above regarding the ranking of my son's high school basketball team. Your reply inspired me to stick with it and I am now pleased to say that I have your classes working now and I am very pleased with the results.

My question this time is regarding the Sigma values. At the end of the season the bottom three teams are demoted into the league below and the top three teams in that league are promoted as replacements.

To start with I assigned the promoted teams 25, 8.333 Mu and Sigma values but found this to not be as accurate as I would have liked. Instead I take the average Mu value of the three demoted teams and assign this value as the initial Mu value of the promoted teams. This has proved to be far more accurate.

My problem is that I do not know what Sigma value to give the promoted teams. Any suggestions?

Jeff Moser said...

Martin: I haven't compared TrueSkill with Glicko2 myself. Feel free to post a comment with your results! You might be interested in the recently completed Kaggle chess competition. A lot of different algorithms (like Glicko2) were tried. A modified version of TrueSkill got 2nd place.

John P: Welcome back! Great to hear that you've had success with TrueSkill on your son's high school basketball team.

Interesting application of TrueSkill to the different "leagues" with promoted and demoted teams. It seems like the best solution would be to have a single consistent rating system scale for all teams regardless of their league. The only problem is that the graphs for each league are all self-connected (they probably don't play with teams outside their own league by definition). Thus, it seems like you have a slight "seeding" problem. You might try to set some initial mean for each league (e.g. the top league has mu_0 = 25, second best league has mu_0 = 20, etc) but all other parameters stay the same between leagues. This might make it easy to transition between leagues without having to fiddle with their mu's. However, because the promoted/demoted teams are unknown in their new graphs, you might just add a in 1-2 standard deviations of their new league's sigma (e.g. calculate the stddev all the sigma's for a given league and then add that in to the incoming teams's sigma). Then, let TrueSkill sort out the new teams. This is just my initial guess. Let me know how it works out!

boros1124 said...

Very good post! I recently bought a C # book. I think is very good. The only downside to these books, that require such little things in it.
http://www.konyv-konyvek.hu/book_images/01b/130441301b.jpg

Sergio said...

Incredible post!
As you say in your introduction, it is very difficult for a beginner to get into this kind of subjects.
I like very much how you built your post. I was actually looking for a python implementation of the Elo rating system, but this is much better to fully understand it. Many thanks.
One small remark, can you explain, as usual in simple terms, how do you deal with the draw, it is still quite confusing in my head.

And, if not to much to ask, i could not fully understand your example of dynamic standard deviation (sigma).

I didn't understood how you make the update on the player's sigma. In your example the weaker player(A) (with high sigma) beat the stronger opponent (B)and:

- A's skill becomes higher and Sigma is now lower;
- B's skill becomes lower (not much) and the Sigma remains the same.

Apologies for the poor English...

Timmy said...

Is there a way to calculate win probability between two players or some way to monte carlo it? I know it's intended, but it would provide a good visualization

Timmy said...

i mean i see ur other post, but want to generalize it for more than 2 ppl, thanks

Jeff Moser said...

Sergio: A draw typically means that you were well matched. That is, you both performed at about the same level, so the system can typically lower your standard deviation more (e.g. it becomes more confident). This makes some intuitive sense because if you win against someone, the system is still unsure of how good you are. Likewise, if you lose, the system is unsure just how bad you are. However, when you draw, the system knows much better about your skill.

In terms of updating sigma: imagine you've had a friend for 10 years and he is always a very nice and friendly person and then one day he's not super nice. If you had to rate his "niceness", your 10 years of experience would overwhelmingly indicate with high confidence that he's nice, so one not great day has little impact on your perception (unless it was really bad!). Alternatively, if you just met someone and they're mean, you might update your perception faster (but it could change the next day to nice if they're really nice).

Ratings in TrueSkill are similar. The more the system "learns" about you, the less your ratings tend to move and the lower the standard deviation becomes. Likewise, the mean changes less as your standard deviation goes down.

Does that help?

George B. said...

Jeff, consider yourself gifted with a talent to simplify! Your site rocks and depth:quality of your articles even more.

Greetings and many thanks from the Netherlands, George

Sean said...

Source code is great. Got an entire ranking system working in only a few hours from a database of poker game rankings my friends and I have kept.

I know poker isn't the normal type of game that TrueSkill is meant to rank. So it has produced results that are somewhat skewed.

Basically I rank a game as such:
1 Jeff (gets 66% of $$) 2 Sean (gets 22% of$$)
3 Dave (gets 11% of )
4 John
4 Jack
4 Mike

John, Jack and Mike are tied for last because they all simply "lost".. order or losing doesn't matter.

The current problem is that players who have gotten 3rd many many times with little payout have higher rankings that players who get several 1st and a few 4th. I am wondering how I can reflect the payout into the ranking update.

Thanks again for the awesome source and any advice!

Cheers,

Sean

Jeff Moser said...

George B: Thanks!
Sean: Interesting problem. You might be able to improve things by increasing the dynamics factor to allow for quicker changes, but it probably won't be the best long-term solution.

With TrueSkill, the system learns over time, so a consistent 3rd place finisher probably has a low sigma which helps out his/her TrueSkill more than a player that's only played a few times and has a higher mean but lower sigma. This should work itself out over time and reflects the Bayesian uncertainty about the player's strength. Can you give a rough distribution of how many games people have played where this oddity shows up?

Holger said...

Great work. I'm just toying with the PHP port for the results of our foosball group (where we only play doubles), and I'm getting quite a good picture already after only a total of 15 recorded games. Some of it seems strange at first sight, like a player with a 9-5 record being ranked behind one with 4-6 and another with 4-9, but seeing that he got most of his victories just because of his strong partner, I'd say that's actually quite accurate with regard to the playing abilities.

You mentioned above that you use the default values for your foosball round, including a default draw probability of 0.1. Does that mean you have draws in your round? We always play best-of-3 matches, so there are no draws; should I set the probability to 0 then to get (even) more accurate results?

Jeff Moser said...

Holgar: Each foosball game had 4 people in it (two on each side: midfield and goalie). The winner of the "game" was the first team to reach 10 goals. In any given session, there would likely be 3 games. We'd record each game in the session separately.

You're right that in foosball, you can't really tie unless you abandon a game early due to time pressure. If you are sure you won't tie, then you can set the draw probability to zero for slightly better results.

Maccara said...

Thank you!

I'm planning making an independent implementation for Mathematica and your explanations and paper help me greatly to dissect the terse original paper.

Currently I'm avoiding your code to make this truly an independent implementation, but I may end up using that eventually to verify the results, or if I run into trouble.
(I'm also avoiding it because Mathematica efficient usage is fundamentally different from typical procedural programming paradigms and I'm also doing this to improve my Mathematica programming skills, so I don't want to be "polluted" by a ready solution)

I'm hoping I can implement TrueSkill Through Time eventually also. Any plans on addressing that paper in detail, or incorporating it into your code?

David said...

Hey there! First off, thanks for the post and the beautifully written source code. I'm putting together my own project and you've helped me tons.

I play Halo 3 myself and am trying to model my friends' skill in free-for-all matches. I'm doing this partly to understand TrueSkill better, since that's what ranks us in online multiplayer. I'd really like to include predictive qualities too, though, so I was wondering if there's a way to getting each players' odds of nabbing first. I've seen your responses to Keilin, but I don't have a great grasp of Gaussian concepts yet, and I don't understand if this can be done for multiple people (or how it would be done).

On another note: both Microsoft's Trueskill calculator and your online implemenation diverge/fail with large values of mu (especially for extremely unbalanced games). I understand Trueskill is not meant to model such encounters, but I'd love to understand why it works that way.

Thanks for any help, Jeff! Trust me, you've already helped a lot!

-David

Jeff Moser said...

Maccara: I don't have plans to implement TrueSkill Through Time for the foreseeable future just because I've since moved on to different areas of research.

David: It should be possible, just replace "team" with the individual person's mean and sigma and use totalplayers=2 for comparing two players.

How do you define large? 50? I used TrueSkill for mu = ~1200 in chess, you just have to appropriately use proportional values of sigma_0 and beta per the ratios outlined in the code and it should be fine.

Hope that helps.

David said...

Thanks for the response! My problem, though, is expanding beyond two players. I've got upwards of 8 friends in each free-for-all match. For now, I'm calculating the odds experimentally by pulling random numbers from their normal distributions and comparing them (a few thousand times). Not exactly elegant, but the best I can do. :) Do you know how to do it mathematically?

Anonymous said...

Hi,

is there a way to account for different scores at the end of a game? E.g. one might win low with 1:0 or high with 10:0, which should result in a higher skill adjustment..

Kjartan said...

Now one question: I noticed that when using your implementation with all the default values unchanged and pitching two players against each other for 100,000 matches while letting the same player always win leads to the standard deviation of both players actually to start increasing in the long run after a quick initial dip.

I would have expected it to converge to zero with the number of matches.

Setting the dynamics factor to zero makes it converge, but _extremely_ slowly.

This is somewhat counter intuitive. Do you have any explanation for this?

Sombra Negra said...

Implementation in Excel looks promising. Does anybody have already made such implementation? I am interested on a 2 teams with 5 players each and track individual players skills.

Anonymous said...

Thank you for all explanations!

I am wondering if this algorithm can be used to make a ranking for tennis players. There will be 100 players and about 1300 matches between them. Can you please tell me what modifications must be made and if it is possible? Thank you

Jeff Moser said...

David: The one with the higest TrueSkill is expected to come in first. You can compute the odds of winning against others in a pairwise fashion.

Anonymous #9: TrueSkill specifically doesn't care about score by design. The designers didn't want to worry about complexities like score. It'll work itself out without needing to know score.

Kjartan: The dynamics factor adds in some uncertainty to the outcome because there is always the chance that you learned something new since the last game.

Sombra Negra: The Microsoft calculators can give you a start.

Anonymous #10: I used this for tennis without any modifications. The standard mu/sigma work great.

Kyera said...

I'm a little late to the party, but this is a great analysis of this data.

I've been trying to pull everything out so I can throw it into a spreadsheet (don't ask), but once I got to the ErrorFunctionCumulativeTo section I think my head exploded. ;)

Either way, I'll keep messing with it -- the 'easy out' for what I'm using it for is just to keep using Microsoft's Online Calcs, but where's the fun in that?

Ivan Veselov said...

Hi Jeff!

Thanks for your wonderful article, I'm enjoying it so much!

I'm reading the article in parallel with your supplementary math document and the code on GitHub and I've got several questions regarding Elo model:

What is the connection between Gaussian Elo model described in "True Skill: a Bayesian Skill rating system" and the FIDE Elo formula used in your code? i.e. 1/(1 + 10 ^ ratingDifference / (2 * beta))

I can't see any direct correspondence :( Moreover, on the wikipedia page about Elo rating, this formula is explained in connection with logistic function instead of Gaussian:

"If Player A has true strength RA and Player B has true strength RB, the exact formula (using the logistic curve) for the expected score of Player A is: ..."

In the same time they say that FIDE is still using Gaussian instead of logistic (as originally suggested by Elo). This is slightly confusing...

I believe it's possible to derive this formula with powers of ten by approximating logistic probability function somehow, but I don't know the details, perhaps they are given in the original Elo book, but it is not available in public domain.

Another idea: perhaps it should be possible to calculate the difference on logistic curve (in the same way as you calculated convolutions on Gaussians in the supplementary paper) and then simplify the result got to this "powers of ten" formula...

Do you have any thoughts or clarifications on this? I would be very grateful!

Thanks a lot!

Ivan Veselov said...

Following my previous comment: I was curious how different are the rating changes calculated using Gaussian formula and FIDE formula, so I've implemented both and made a surface plot:

http://dl.dropbox.com/u/818369/img/gaussian_vs_fide_elo.png

x-axis is first player rating, y-axis is the second player rating, and z-axis is the difference between rating update value in Gaussian formula and FIDE formula (after multiplying by K).

It seems the difference is zero if opponents ratings are equal (so both formulas produce the same result then) and overall the difference is small enough (less than 0.5 points), so this formula with power of tens seems to be a good enough approximation of gaussian model. But the surface itself is weird enough :)

I'd like to experiment with logistic curve as well, but I'm not sure if it's legal just to change the probability function from gaussian to logistic and use the same parameters for them.

The code I used to play and test is on GitHub

Jeff Moser said...

Kyera: The good news is that I only used the ErrorFunctionCumulativeTo to build up to the Gaussian/Normal distribution calculation code that Excel already has, so you can just use those Excel functions directly.

Ivan Veselov: Good thoughts. I think the logistic function might fit the data of real chess players slightly better, but the Gaussian function is easier to understand and it's Gaussians that are used by TrueSkill.

mvbaffa said...

Hi Jeff,

I am looking for a Rating system to rate individual players and teams in sports like basketball, soccer etc.

The performance of these players must be evaluated based on scouts numbers and not only on win and loss.

How could we adapt and use TrueSkill in these environments

Nate Parsons said...

mvbaffa: if you don't have time to read the attachments, then I would respectfully suggest that I don't think you have time to build your system.

In any case, the only input to TrueSkill is win/loss/draw, but you are free to combine the TrueSkill rating with anything else you choose, whatever these 'scouts numbers' happen to be.

mvbaffa said...

Hi Nate,

Thanks for your answer. And yes, you are right, I do not have, at this moment, much time to build my system.

I am just at the beggining and studing the avaiable alternatives.

I beleived that Jeff, or anyone else, including you, could help me with the experience you have in TrueSkill, to show me a more clear path. Something like a Shortcut.

If this is not possible or if I am asking too much I understand, no problem.

I respectfully thank you.

Anonymous said...

Hello Jeff,

First off, thanks for the great code! I love the unit tests; they make it really easy to start using your library!

I do have a question for you. Do you know of anyway to optimize the library for large players (50-200) each on their own team? I am looking to use the PHPSkills library to calculate trueskill for many individuals. These players compete against each other and as such are all on their own team.

When I start to have 50+ Player/Teams, the FactorGraphTrueSkillCalculator begins to take a very, very long time to compute. I was wondering if you had any suggestions for optimizing for large sets of teams of a 50, 100, 200 player match.

I found one variable in Moserware\Skills\TrueSkill\Layers\IteratedTeamDifferencesInnerLayer.php called initialMaxDelta set to value 0.0001.

If I change that value to 0.1 (arbitrarily) the computations seem to take no time at all now, and the smaller matches (with only 2-10 player/teams) seem to have near similar rankings as before (plus or minus 0.00005 mu and sigma).

Can you offer any insight into this variable or other suggestions for speeding up large calculations?

Thank you

Jassim said...

Jeff,

I am wondering how matching percentage is calculated? Is this the common area between two bell curves upon the total area(2).

sammy said...

i've been struggling a lot with the trueSkill paper ! thank u alot for the code and the math behind it :)

Andrew van der Westhuizen said...

This is a lot of effort and would just like to take the time to congratulate you in the write-up. Was recently linked it and now feel like fiddling with something new...

Kimberly Stedman said...

Hey, you did an incredible job on this, man! Tank you for your hard work. Fyi, if you blogged more, I'd read it.

Michael said...

Great article!

I may be missing something obvious here, but which update do you use in the event of a draw? The winner update formula?

Thanks!

Jeff Moser said...

Michael: There are two cases: if you're below the draw margin (epsilon) you use the draw case, otherwise you use the exceeded draw margin case (i.e. someone won). The code and paper go into this in more detail.

Luigi said...

Hello Jeff,

very impressive work. Congratulations.

I have a question regarding your statement "At this point, we add some uncertainty to your skill’s standard deviation to keep game dynamics interesting and prevent the standard deviation from hitting zero since the rest of algorithm will make it smaller (since the whole point is to learn about you and become more certain)."

I wonder what happen during the calculation if you do not add "some uncertainty to your skill’s standard deviation". Does it risk in some rare case to become negative after the update procedure?

Thanks

Jeff Moser said...

Luigi: It shouldn't become negative but rather asymptotically approach zero.

Anonymous said...

Hi Jeff,

Great post. Like others have commented, the depths you have gone to produce this work is truly astounding.

I love everything about Trueskill, however, the implementation I have in mind may not work due to the inability to produce win probabilities. I know a couple of people have asked this question, but I am not clear on how to proceed.

For example, if I were to start using TrueSkill to rate greyhounds. If I were to go to the track and wish to place a value bet i.e (odds greater than the true chance of winning), I need to derive the win probabilities for each dog. Is this simply not possible with TruSkill ?

Any type of example to get me going on this would be greatly appreciated.

Stanko Horvat said...

Incredible work! Thank you Jeff!
I have just one question regarding you formula for winning probability. So you said:

"The probability of team A winning over team B is approximately GaussianDistribution.CumulativeTo(teamAWinNumerator / denominator)

where teamAWinNumerator = teamAMeanSum - teamBMeanSum - drawMargin

and

denominator = Math.Sqrt(totalPlayers * (beta * beta) + teamASigmaSquaredSum + teamBSigmaSquaredSum)"

How to compute teamASigmaSquaredSum and teamBSigmaSquaredSum.

I've been struggling with math with this.

Sorry for a newbie question :) and keep up a good work.

Thanks for all again!

Scott Craig said...

In order to work out the probability of a team winning I wrote the following code. Seems to work ok, let me know if you spot any bugs!

Some caveats. The routine only expects one player teams (i.e. Free for all) and the possibility of a draw is 0.

The AllPermutations() extension method is courtesy of http://blogs.msdn.com/b/updownroundnround/archive/2009/12/14/generating-all-permutations-c-enumerator-implementation.aspx

private static double CalculateWinProbabilityForTeam(Team teamToCalculateWinProbabilityFor, IEnumerable> teams)
{
var permutations = teams.AllPermutations();
List> permutationProbs = new List>();

foreach (var permutation in permutations)
{
var permList = permutation.ToList();

// The first team in this collection is the key for the dictionary containing all permutations.
Team tupleTeam = new Team(permList[0].First().Key, permList[0].First().Value);
double cumalitiveProb = 0;

for (int x = 1; x < permutation.Count(); x++)
{
var winNumerator = permList[x - 1].First().Value.Mean - permList[x].First().Value.Mean;
var denominator = Math.Sqrt(2 * (GameInfo.DefaultGameInfo.Beta * GameInfo.DefaultGameInfo.Beta) + (Math.Pow(permList[x - 1].First().Value.StandardDeviation, 2)) + Math.Pow(permList[x - 1].First().Value.StandardDeviation, 2));
var winProb = GaussianDistribution.CumulativeTo(winNumerator / denominator);

// Work out the conditional probability.
if (x == 1)
{
cumalitiveProb = winProb;
}
else
{
cumalitiveProb = cumalitiveProb * winProb;
}
}

// Add the prob for ths permutation to the collection.
}

// Now work out the win prob for team in question.
var probForMyTeam = permutationProbs.Where(s => s.Item1.AsDictionary().First().Key == teamToCalculateWinProbabilityFor.AsDictionary().First().Key).Select(s => s.Item2).Sum();
var probForAll = permutationProbs.Select(s => s.Item2).Sum();

return probForMyTeam / probForAll;
}

Dan P said...

Jeff, this is awesome!! Thank you.

Your post was very well written and understandable for such heavy material (jeez factor graphs!?).

Thanks for publishing!

I'm looking to implement this system in some kind of competitive league... interested in helping?

Don said...

Great article. Haven't looked at the code yet, but believe it will be very useful for me.

A question: How about considering an individual player as if it were a team? Some individual sports have different skillsets coming into play depending on factors like the track or the weather of the day. I am thinking about e.g.. cycling , where you have mountain stages, flat stages and time trials. The chances of spinter winning a mountain stage are small, just like a climber is unlikely to win a mass sprint. Same in car racing: some rally riders are much better in snow or rain, or on very winding tracks. This means that in some sports one overall Trueskill rating for all riders is not very useful to predict their chances of winning a given stage/race. I was thinking about solving that by considering a rider as if it were a virtual team of players, with each "player" being a specific skill set: e.g "climbing", "sprinting", "cornering"... Then use different "player" weightings, depending on the characteristics of upcoming stage/race. Am I on the right track, or do you see a better way of tackling this?
By the way also in team sports like football we could add a virtual player "rain", which comes in action only when a match is played in rain. Would like to hear your thoughts.
Thanks.

Jeff Moser said...

Stanko Horvat: teamASigmaSquaredSum is the sum of all the variances of each player. For examples (player_1_on_team_a_sigma)^2 + (player_2_on_team_a_sigma)^2 + ...

Scott Craig: Thanks!

Dan P: Sounds interesting, but I'll have to pass due to other time commitments.

Don: I think that a "virtual player" teammate could be helpful if the conditions tend to affect a large swaths of players in a similar way. Another approach would be to treat the conditions as a separate skill entirely. Thus, you'd have a separate mu/sigma/ranking for "climbing", "sprinting", etc. It'd be more to track, but it could lead to more intuitive results.

Anonymous said...

this funciton from
namespace Moserware.Numerics
class GaussianDistribution

public static double CumulativeTo(double x, double mean, double standardDeviation)
{
double invsqrt2 = -0.707106781186547524400844362104;
double result = ErrorFunctionCumulativeTo(invsqrt2*x);
return 0.5*result;
}

after read the formula from wiki,

i think it shoube be this:
double result = InverseErrorFunctionCumulativeTo(invsqrt2*x);

Anonymous said...

sorry for my mistake

i should be this below:

double result = ErrorFunctionCumulativeTo(invsqrt2*x);
return 0.5*(1-result);

arunachalam said...

Great work. Indeed it was a nice presentation overall.

Got a question for you. Im looking to use your code for my problem. Pleas let me know how i can use it.

Problem Statement:
There are some N players in the pool. Two teams with each 5 players are selected from the pool. Result for around 15000 games are given. All i wanted to create a model from these records and predict the outcome of a game given two team with 5 players each.

All i understood from your explanation is that i need to find the true skill of team by summing up the individual skills.

Can you help me out in using your code and solving my problem?

Jeff Moser said...

Anonymous: Yep, it looks like I didn't make the code work for non standard normal cases. I'll make a note to fix.

arunachalam: You can use TrueSkill for this scenario. For each game, make a note of the players on each team and then get the result of the match. Use TrueSkill to calculate the after-game skill levels for each. You could use Match Quality as well as probablity of winning (mentioned above in comments) from the TrueSkill data.

Anders said...
This comment has been removed by the author.
Anders said...

Hi Jeff
I'm very interested in your work with the foosball rating!.
I'm a noobie with the programming stuff. But I would love to have a program like that to rate the weekly tournament in the local club. How do I get a program like that.?
Kind regards
Anders

Unknown said...

First off, I'm using the PHP fork of the code, but I'm assuming it's functionally very similar. And now for my question...

I'm getting into using TrueSkill to create balanced teams for some games that we play at work.

The way I'm doing this is by creating all possible team combinations and running each of the match ups through the calculateMatchQuality method of the TwoTeamTrueSkillCalculator and pulling the highest quality matches and using that one.

The issue I'm having, is that all the matches seem to have the same ~44% quality rating, regardless of the teams. Sometimes both the better players are on one team, and sometimes they seem moderately balanced. But they all come out with the same quality.

My question is this: Is there something I'm missing? Is there some other number I should be using to test the balance of the teams?

Also... is there a built in way to create balanced teams? Or is there some other method that might be better than brute forcing my way through all possible combinations?

Anywho... Thanks for deciphering the paper for the rest of us. =)

Marko said...

Hey Jeff, great article, did learn a lot.

However, I have a couple of questions that I can't seem to find an answer to anywhere else.

First one is, can TrueSkill be modified to work correctly with distributions other than normal? Some games don't really follow the Gaussian curve, but instead more like a Inverse-chi-squared distribution or Pareto distribution. Is it possible to use TrueSkill for these distributions?

Second question comes from reading a paper on Glicko2. Specifically with part about rating period. This is a quote from the paper:

The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period.

Seeing as TrueSkill borrows some concepts from Glicko, I thought this is similar in TrueSkill as well. Now my question is, could TrueSkill be forced to work in real time with a lot of games, updated in shorts period of time? For example, imagine a ladder system with hundreds of players all playing 1v1 games at the same time. How often would the ladder ranking be updated and would players be able to see their new rating and rank after each game?

Cheers,
Marko.

Gareth Williams said...

I had a lot of trouble understanding the update equations in the TrueSkill paper at first. I couldn't quite figure out what it means when it says

pi(new)_f->x = a(pi_y - pi_f->y)

I got a bit confused at first about the notation here, and then once I understood the notation I got a bit confused about the formula.

The notation comes about because all messages being passed around are Gaussian, so each message can be characterised by a pi and tau.

Equally, each marginal is also a Gaussian.

So the formula above means that the pi of the message being passed from f->x depends on the pi of the marginal for y and the pi of the message being passed from f->y.

Once I figured out what it meant, I then had trouble understanding why the formula took this form. It looks kindof recursive, with pi_f->x depeneding on pi_f->y and pi_f->y depending on pi_f->x.

However if you consider the sentence from the Factor Graphs tutorial paper: "We can compute [the marginal for a variable] as the product of the two messages that were passed (in opposite directions) over any single edge incident on x_i"

This means that the marginal for y is the product of the message that is passed from f->y (which == the message passed from x->f) and the message passed from y->f.

Rearranging, we get

msg_f->y = marg(y) / msg_y->f

=> msg_x->f = marg(y) / msg_y->f (1)

And equation (1), in combination with the properties of pi and tau, finally let me understand the update equations in Table 1. Because the message is the quotient of two other messages, the pi (and tau) is the difference of the two pis for the messages in the quotient.

Hope that helps someone out there trying to understand the TrueSkill paper.

Gareth

Andrew Ames said...

I noticed in the source code the DrawMargin class assumes there are exactly 2 teams with one player on each team (1v1). Should the number of teams and the number of players per team be passed into the GetDrawMarginFromDrawProbability function?

Andrew Ames said...

Regarding my previous question: I think I figured it out. GetDrawMarginFromDrawProbability is only ever used when doing pairwise comparisons between 2 gaussians (a pair of means and standard deviations. So, n1 and n2 being both 1 makes sense.