Monday, November 21, 2011

Life, Death, and Splitting Secrets

(Summary: I created a program to help back up important data like your master password in case something happens to you. By splitting your secret into pieces, it provides a circuit breaker against a single point of failure. I’m giving it away as a free open source program with the hope that others might find it useful in addressing this aspect of our lives. Feel free to use the program and follow along with just the screenshots below or read all sections of this post if you want more context.)

Background

I just couldn’t do it.

Grandma and JeffMy grandma died at this time last year from a stroke. She was a great woman. I still miss her. In that emotional last week, I was reminded of great memories with her and the fragility of life. I was also reminded about important documents that I still didn’t have.

When something happens to you, be it death or incapacitation, there are some important steps that need to occur that can be greatly assisted by legal documents. For example:

  1. An advance health care directive (aka “Living Will”) specifies what actions should (or shouldn’t) be taken with regards to your healthcare if you’re no longer able to make decisions for yourself.
  2. A durable power of attorney allows you to designate someone to legally act as you if you become incapacitated.
  3. A last will and testament allows you to legally assign caregivers for minor children as well as designate where you'd like your possessions to go.

My grandma had these and it helped reduce stress and anxiety in this difficult time. We knew what she would have wanted and these documents helped legally enforce that.

I had assumed that these documents were expensive and time-consuming to create. Furthermore, as a guy in my 20’s, death still seems like a distant rumor. As a Christian, I’m not overly concerned about death itself, but my grandma’s death reminded me that these documents are not really for me, but rather the people I’d leave behind. I knew that if something happened to me, I’d potentially be leaving behind a mess, and that concern of irresponsibility compelled me to investigate what I could do.

It turns out that creating these documents is essentially a matter of filling out a form template. I bought a program that made it about as easy as preparing taxes online. In most cases, you just need disinterested third parties, such as friends or coworkers, to witness you signing them to make them fully legal. At most, you might have to get them notarized or filed in your county for a small fee.

One of the steps involved in filling out the “Information for Caregivers and Survivors” document is to list “Secured Places and Passwords.” It’s a helpful section that your executor can turn to if something happened to you in order to do things like unlock your cell phone or access your online accounts. Sure, your survivors might be able use legal force to get access without it, but only after months of sending official documentation. That’s a lot of hassle to put someone through. Also, it’s very likely that a lot of important things will be missed and no one would ever know they existed.

It’s probably rational to just write your passwords down and put them in a safe which your executor knows the location of and can access in a timely matter. Alternatively, you could pay for an attorney or a third-party service and leave your password list with them. However, this seemed like it would cause a maintenance problem, especially as I might add or update my passwords frequently. These options would also force me to trust someone I haven’t known for a long time. Most importantly, the thought of writing down my passwords on a piece of paper, even if it was in a relatively safe place, went against every fiber of my security being.

I just couldn’t do it.

DISCLAIMER: The above simple approaches are probably fine and have worked for a lot of people over the years. If you’re comfortable with these basic approaches, by all means use them and ignore this post. These simpler approaches have less moving parts and are easy to understand. However, if you want a little more security, or need to liven up this process with a little spy novel-esque fun, read on.

The Modern Password & Encryption Problem

As an online citizen, you don’t want to be that person. You know, the one whose password was so easy to guess that his email account was broken into and who “wrote” to you saying that he decided to go to Europe on a whim this past weekend but now needs you to wire him money right now and he’ll explain everything later: that guy.

You’ve learned that passwords like “thunder”, “thunder56”, and even “L0u|>Thund3r” are terrible because they’re easily guessed. You now know that the most important aspect of a password is its length combined with basic padding and character variation such as “/* Thunder is coming! */”, “I hear <em>thunder</em>!”, or “1.big.BOOM@thunder.mil”.

In fact, you’re probably clever enough that you don’t create or remember most of your passwords anymore. You use a password manager like LastPass or KeePass to automatically generate and store unique and completely random passwords for all of your accounts. This has simplified your life so that you only have to remember your “master password” that will get you into where you keep all the rest of your usernames and passwords.

Skeleton Key

You also understand that your email account credentials are a “skeleton key” for almost everything else due to the widespread use of simple password reset emails. For this very reason, you probably realize that it’s critical to protect your email login with “two-factor” authentication. That is, your email account should at least be protected by:

  1. Something you know (your password) and
  2. Something you have (your cellphone), that creates or receives a one-time use code when you want to login.

On top of all of this, you try your best to follow the trusty advice that your passwords should be ones that nobody could guess and you never ever write them down.

But what if something happens to you? If you’ve done everything “right,” then your master password and all your second factor details go with you.

And then there are your encrypted files. Maybe you’re keeping a private journal for your children to read when they grow up. Perhaps you’re living in some spy novel life where you’re worried that people will take you out to prevent something you know from being discovered. Wherever you fall on the spectrum, what do you do with such encrypted data?

Modern encryption is a bit scary because it’s so good. If you use a decent encryption program with a good password/key, then it’s very likely that no one, not even a major government, could decrypt the file even after hundreds of years. Encryption is great for keeping prying eyes out, but it could sadden survivors that you want to have access to your data. The thought of something being lost forever might make you almost yearn for the days when you just put everything into a good safe that’s rated by how many minutes it might slow somebody down.

On a much lighter note, the “something” that happens to you doesn’t have to be so grim. Maybe you had a really relaxing three week vacation and now you can’t remember the exact keyboard combination of your password. Given that our brains have to recreate memories each time you recall something, it’s possible that you could stress yourself out so much trying to remember your password that you effectively “forget” it. What do you do then?

When you put all your eggs into a password manager basket, you really want to watch that basket. Fortunately, creating a basic plan isn’t that hard.

A Proposed Solution

Example nuclear launch keys

Let’s borrow an ancient yet incredibly useful idea: if it’s really important to get your facts right about something, be sure to have at least two or three witnesses. This is especially true concerning matters of life and death but it also comes up when protecting really valuable things.

By the 20th century, this “two-man rule” was implemented in hardware to protect nuclear missiles from being launched by a lone rogue person without proper authorization. The main vault at Fort Knox is locked by multiple combinations such that no single person is entrusted with all of them. On the Internet, the master key for protecting the new secure domain name system (DNSSEC) is split between among 7 people from 6 different countries such that at least 5 people are needed to reconstruct it in the event of an Internet catastrophe.

If this idea is good enough for protecting nuclear weapons, the Fort Knox vault, and one of the most critical security aspects on the Internet, it’s probably good enough for your password list. Besides, it can make a somewhat uncomfortable process a little more fun.

Let’s start with a simple example. Let’s say that your master password is “1.big.BOOM@thunder.mil”. You could just write it out on a piece of paper and then use scissors to cut it up. This would work if you wanted to split it among 2 people, but it has some notable downsides:

  1. It doesn’t work if you want redundancy (i.e. any 2 of 3 people being able to reconstruct it)
  2. Each piece would tell you something about the password and thus has value on its own. Ideally, we’d like the pieces to be worthless unless a threshold of people came together.
  3. It doesn’t really work for more complicated scenarios like requiring 5 of 7 people.

Fortunately, some clever math can fix these issues and give you this ability for free. I created a program called SecretSplitter to automate all of this to hopefully make the whole process painless.

Let’s say you want to require at least 2 witnesses to agree that something happened to you before your secret is available. You also want to build in redundancy such that any pair of people can find out your password. For this scenario, you keep the can use the default settings and press the “split” button:

Specifying message

You’ll get this list of split pieces:

List of message shares

Notice that each piece is twice as long as your original message (about twice the size of a package tracking number). This is by design.

Now comes the hard part: you have to select three people you trust. You should have high confidence in anyone you’d entrust with a secret piece. It’s easy to get caught up in gee-whiz cryptography and miss fundamentals: you ultimately have to trust something, especially with important matters. SecretSplitter provides a trust circuit breaker just in case (because even well-meaning people can lose important things). The splitting process adds a bit of complexity, but so do real circuit breakers. If you trust no one, then you can’t have anyone help you if something happens.

For demonstration purposes, let’s say you trust 3 people.

You now have to distribute these secret pieces. You could do all sorts of clever things like send letters to people that will be delivered far in the future or read them over the phone. However, distributing them in person is a pretty good option:

Creating an envelope with a share

It can make the upcoming holiday table discussions even more fun:

Handing over the envelope with the secret piece

Let’s pretend that something happened to you. Two of the three family members that you gave pieces to would come together and agree that “something” indeed has happened to you. What happens now?

Two opened envelopes with secret shares

Well, either you included a note with each secret piece or you emailed them previously with instructions that they’d just need to download and run this small program. The pair comes together at a laptop and they each type their piece in quickly and then press “Recover”:

Typing in secret shares with a typo

Oops... they typed so quickly that they mixed up one of the digits. It told us where to look:

Warning about typo

They fix the typo and press recover again:

Fixed the typo

And immediately they see:

Recovered message

Password recovered! They could now use this master password to log into your password manager where you’ve stored further details.

This “message” approach is useful if you have a small amount of data such as a password that you could write on a piece of paper. One downside is that each piece is twice the size of the text message. If your message becomes much larger then it will no longer be feasible to type it in manually.

One alternative approach is to bundle together all of your important files into a zip file:

Example of a compressed file contents

To split this file, you’d click the “Create” tab and then find the file, set the number of shares and click “Save”:

Splitting up a file

You’ll then be told:

MessageBox asking you to save the file

And then you pick where to save the encrypted file:

Save file dialog

Finally, you’ll see this screen:

Split file pieces

This creates a slightly more complicated scenario because you now have 2 things to share: the secret pieces and the encrypted file with all your data. The encrypted file doesn’t have to be secret at all. You can safely email it to people that have a secret piece:

Sending the fun email

Now, if something happens to you, they’d run the program, and type in two shares and press “Recover”:

Entering in file shares

It’ll then tell them:

Specify encrypted file MessageBox

They’d then go to their email and search for the email from you that includes your encrypted file:

Searching email

Then they’d find the single message (or the latest one if you sent out updates) and download your encrypted attachment:

Found email

They’d then go back to the program to open it up:

Opening the file

and then they’d see a message to be careful where they saved it:

Will you keep the data safe?

and then they’d save it:

Save decrypted

They'd then be asked if they want to open the decrypted file, which they’d say “Yes”:

Open decrypted file?

Now they can see everything:

Example of a compressed file contents

It might sound complicated, but if you’re familiar with the process, it might only take a minute. If you’re not tech savvy and have never done it before and type slowly, it might take 30 minutes. In either case, it’s faster than having to drive to your home and search around for a folder and it contains everything you wanted people to know (especially when things are time sensitive).

That’s it! Your master password and important data are now backed up. The risk is distributed: if any one piece is compromised (i.e. gets lost or misplaced), you can have everyone else destroy their secret piece and nothing will be leaked. Also, the program has an advance feature that lets you save the file encryption key. This feature allows you to send out updated encrypted files that can be decrypted with the pieces you’ve already established in person.

SecretSplitter implements a “(t,n) threshold cryptosystem” which can be thought of as a mathematical generalization of the physical two-man rule. The idea is that you split up a secret into pieces (called “shares”) and require at least a threshold of “t” shares to be present in order to recover the secret. If you have less than “t” shares, you gain no information about the secret. Whatever threshold you use, it’s really important that each “shareholder” know the threshold number of shares.

You can be quite creative in setting the threshold and distributing shares. For example, you can trust your spouse more by giving her more shares than anyone else. The key idea is that a share is an atomic unit of trust. You can give more than one unit of trust to a person, but you can never give less.

Another important practical concern is that you should consider adding redundancy to any threshold system. This is easily achieved by creating more shares than the threshold number. The reason is that if you’re going out of your way to use a threshold system, then you probably want to make sure you have a backup plan in case one or more of the shares are unavailable.

IMPORTANT LEGAL NOTE: It’s tempting to keep everything, including the important directives and your will in only electronic form (even when they’re signed). Unfortunately, most states require the original signed documents to be considered legal and most courts will not accept a copy. For this reason, you should still have the paper originals somewhere such as a fireproof safe. However, be careful where you put the originals: although it might sound convenient to put them in a bank safety deposit box, there’s usually a rather long waiting period before a before a bank can legally provide access to your box to a survivor, so don’t put any time sensitive items there. My recommendation at the current time would be to include copies of the signed originals in your encrypted file and also include detailed instructions on where the originals are located and how to access them.

How It Works

Given the sensitive nature of the data being protected, I wanted to make sure I understood every part of the mathematics involved and literally every bit of the encrypted file. You’re more than welcome to just use the program without fully understanding the details, but I encourage people to verify my math and code if you’re able and curious.

To get started, recall that computers work with bits: 1’s and 0’s that can represent anything. For example, the most popular way of encoding text will encode “thunder” in binary as

01110100 01101000 01110101 01101110 01100100 01100101 01110010

We can write this more efficiently using hexadecimal notation as: 74 68 75 6E 64 65 72. We can also treat this entire sequence of bits as a single 55 bit number whose decimal representation just happens to be 32,765,950,870,971,762. In fact, any piece of data can be converted to a single number.

Now that we have a single number, let’s go back to your algebra class and remember the equation for a line:  y=mx+b.

Line showing intercept

In this equation, “b” is the “y-intercept”, which is where the line crosses the y-axis. The “m” value is the slope and represents how steep the line is (i.e. its “grade” if it were a hill).

This is all the core math you need to understand splitting secrets. In our particular case, our secret message is always represented by the y-intercept (i.e. “b” in y=mx+b). We want to create a line that will go through this point. Recall that a line could go through this point at any angle. The slope (i.e. “m” in y=mx+b) will direct us where it goes. For things to work securely, the slope must be a random number.

Although we use large numbers in practice for security reasons, let’s keep it simple here. Let’s say our secret number is “7” and our random slope is “3.” These choices generate this line:

y=3x+7

With this equation, we can generate an infinite number of points on the line. For example, we can pick the first three points: (1, 10), (2, 13), and (3, 16):

3 points

You can see that if you had any two of these points, you could find the y-intercept.

It’s critical to realize that having just one of these points gives us no useful information about the line. However, having any other point on the line would allow us to use a ruler and draw a straight line to the y-intercept and thus reveal the secret (we could also work it out algebraically). Each point represents a secret piece or “share” and has a unique “x” and “y” value.

The mathematically fascinating part about this idea is that a line is just a simple polynomial (curve) and this technique works for polynomials of arbitrarily large degrees. For example, a second degree polynomial is a parabola that requires 3 unique points to completely define it (one more than a line). Its equation is of the form y=ax^2 + bx + c. In our case “c” is the y-intercept and “a” and “b” are random as in y = 2x^2 + 3x + 7:

Given this equation, we can generate as many “shares” as we’d like: (1,12), (2,21), (3,34), (4,51), etc.

Keep in mind that a parabola requires three points to uniquely define it. If you just had two points, as in (1,12) and (2,21), you could create an infinite number of parabolas going through these points and thus have infinite choices for what the y-intercept (i.e. your secret) could be:

6 parabolas going through the same two points

However, a third point will define the parabola and its y-intercept exactly:

Unique parabola

You’ve just learned that splitting a secret that requires three people is just a matter of creating a parabola. Requiring more people is just a matter of creating a higher-degree polynomial such as a cubic or quartic polynomial. If you understand this basic idea, the rest is just details:

  1. Instead of using numbers, we translate the data to a big polynomial with binary coefficients.
  2. Instead of using middle school algebra, we use a “finite field.” This helps keep results about the same size as the input and adds some security.

Don’t be intimidated by these changes. The core ideas are the same as the basic case. The only noticeable difference is that you have to think of operations like multiplication and division in a more abstract way. For details, check out my source code’s use of Horner’s scheme for evaluating polynomials, peasant multiplication, irreducible polynomials with the fewest terms, Lagrange polynomial interpolation to find the y-intercept, and using Euclidean inverses for division.

Again, it probably sounds more complicated than it really is. At its core, it’s simple. This technique is formally known as a Shamir Secret Sharing Scheme and it was discovered in the 1970’s.

I didn’t want to invent anything new unless I felt I absolutely had to. There was already a good tool called “ssss-split” that generates shares similar to how I wanted. This program adds a special twist by scrambling the resulting y-intercept point and therefore adds an extra layer of protection. Since this program was already the de-facto standard, I wanted to be fully compatible with it. To make sure I was compatible, I had to copy its method of “diffusing” (i.e. scrambling) the bits using the public domain XTEA algorithm. However, to ensure complete fidelity, I had to look at the source code. The only problem was that it was originally released under the GNU Public License (GPL) and it used a GPL library for working with large numbers. My goal was to make my implementation as open as I could, so I asked the author if I could look at his code to derive my own implementation that I’d release under the more permissive MIT license and he graciously allowed me to do this.

To prove the compatibility, you can use the ssss-split demo page and paste the results into SecretSplitter and it’ll work just fine. In addition, I created command line programs from scratch that are fully compatible with ssss-split and ssss-combine.

After some basic usability testing, I decided to make one small adjustment. The “ssss-split” command allows you to attach a prefix that it ignores. I wanted to add a special prefix that would tell what type of share it was (i.e. a message or a file) as well as a simple checksum because with all those digits it’s easy to mistype one.

Now, you can understand all the pieces of the long share:

Share components

In theory, you could “encrypt” a large file directly using this technique. In practice, it doesn’t work well because each share would be huge and not something you’d be able to write down by hand or say over the phone, even using the phonetic alphabet.

For lots of data, we use a hybrid approach: encrypt the file using standard file encryption with a random key and then split the small “key” into pieces.

For file encryption, I again didn’t want to invent anything new. I decided to use the OpenPGP Message Format, the same format used by PGP and GNU Privacy Guard (GPG). I didn’t want to have to worry about licensing restrictions or including a third-party library, so I wrote my own implementation from scratch that did exactly what I wanted. I read RFC4880 and started sketching out what I needed to do. A few bug fixes later and I had a working implementation that was able to interoperate with GPG. To simplify my implementation, I only support a limited subset of features:

  1. I always use AES with a 256-bit key for encryption, even if users select a smaller effective key size. This means that users can pick any size key they want and thus balance security and share length. I picked AES because it’s strong and understandable with stick figures.
  2. The actual file encryption key is always a hashed, salted, and stretched version of the reconstructed shares text.
  3. The encrypted file has an integrity protection packet to detect if the file has been modified and ensure it was decrypted correctly.

Since I used common formats, you can verify the correctness of the generated files using a Linux shell. You can also create files using the shell and have them interoperate with SecretSplitter. I included a sample of how to do this with the source code.

Help Wanted / Future Possibilities

SecretSplitter still looks and feels like a prototype. There are lots of possible improvements that could be made:

  1. Secret splitting is a relatively complicated idea. In Cryptography Engineering, the authors write “secret sharing schemes are rarely used because they are too complex. They are complex to implement, but more importantly, they are complex to administrate and operate.”

    Although I tried to simplify the user experience for broad use, it could still use some user experience enhancements to simplify it further.
  2. I wrote it in C# for the .net platform because that is what I’m most familiar with (and it has some built-in powerful primitives like BigIntegers, AES, and hash functions). I suspect that an HTML5 version using JavaScript, a nice interface, and coming from a trusted domain would get much broader usage. In addition, since this is a problem that affects everyone, having great internationalization support would be a nice touch. It also would be nice to have a polished look with a good logo and other graphics.
  3. You could use more elaborate secret sharing schemes than what I implemented in SecretSplitter. I considered these, but ultimately wanted to use a technique that was already compatible with widely deployed tools. I also considered enhancing shares with two-factor support or using existing public key infrastructure, but decided that added too much complexity. Perhaps it’s possible to incorporate these in a good design.
  4. It’d be neat if this scheme or something similar to it was integrated into LastPass and KeyPass as a core feature.
  5. Obviously the shares themselves are long. I tried making them shorter but the downsides outweighed the upsides. Perhaps it could be better. Also, a compelling graphically designed share card might make it more fun for broader use. The long length is somewhat of a safety mechanism that prevents people from memorizing with a quick glance. Also, it discourages overhasty use much like freezing a credit card.
  6. I kept the codes in a format that would be easy to write as well as read over the phone. I used a simple character set that avoids ambiguities like “O” vs “0”. One additional strategy could be to embed the share as a QR code or something similar. I didn’t pursue this approach in favor of simplicity, but this could be an option.
  7. Really paranoid people might want to back up their encrypted file to paper. This is possible, but I’m not sure if it should belong inside the program itself.
  8. It’d be good to have suggestions on how to exchange shares or perhaps borrow ideas from PGP key signing parties. I suspect that if secret splitting were to become popular, then “web of trust” scenarios would naturally occur (i.e. “I’ll hold your secret share if you hold mine”).
  9. It’d be fun to compile a list of non-obvious uses for SecretSplitter to share with others. For example, it could make for interesting scavenger hunt clues.

If you’d like to donate your time to any of the above ideas, I’d encourage you to just give it a go. You don’t have to ask for my permission but it would be nice if you posted your results somewhere or left a comment to this post. You can use my code for whatever purpose you’d like. My only hope is that you might get some benefit out of it.

Conclusion

SecretSplitter is just a tool that gives another option for backing up very sensitive information by splitting it up into pieces. It’s not a full solution, only a tool. By relying on people I trust instead of a third-party company, it helped me remove one excuse I had for not preparing somewhat unpleasant but important documents that we should all probably have. I still don’t have this all figured out, but writing SecretSplitter help me get started.

If you’re young, don’t have any minor children, and don’t care at all what happens to your stuff, then you could run some mental actuarial model and convince yourself that the probability of you or your survivors needing these documents or password recovery procedure anytime soon is low, but you’re not given any guarantees.

At the very least, it’s a good idea to make sure all of your financial assets and life insurance policies have a named beneficiary and at perhaps at least one alternate. You can also declare things like organ donor preferences on your driver’s license instead of making declarations in other documents. It’s also a good idea to have an “ICE” entry in your cell phone. However, going the extra step and making very basic final documents doesn’t require that much more work. Besides, once you have baseline documents, keeping them fresh is just a matter of occasional updates due to life events.

The increasing digitization of our lives means that more personal things will only be stored digitally. From our journals to email to videos to health records, all of this will eventually only exist digitally and likely hidden behind passwords. This future needs some safety net for backing up sensitive things in a safe and accessible way.

Everything doesn’t need to be backed up. There are also lots of files, usernames and passwords that don’t really matter. Don’t include those. SecretSplitter was built with the assumption that everything that really mattered could be stored in a file small enough to email to others. This helps focus and pare down to what really matters.

It’s also good to have a healthy dose of common sense. Instead of holding out a secret until after your death, maybe you should get that resolved today. You’ll probably live better. My general view is that these final “secrets” should be mostly boring by just containing account details and credentials.

Finally, on a more personal level, I think it’s healthy to be reminded about our own mortality at least once every year or so. It’s a helpful reminder of how much a gift every day is and helps focus what we do and not worry about things that don’t matter.

If a little bit of fancy math can help you sleep better at night, well then, I’d consider it a success.

Special thanks to B. Poettering for creating the original ssss program and allowing me to clone its format.

Tuesday, October 26, 2010

Notes from porting C# code to PHP

(Summary: I ported my TrueSkill implementation from C# to PHP and posted it on GitHub. It was my first real encounter with PHP and I learned a few things.)

I braced for the worst.

After years of hearing negative things about PHP, I had been led to believe that touching it would rot my brain. Ok, maybe that’s a bit much, but its reputation had me believe it was full of bad problems. Even the cool kids had issues with PHP. But I thought that it couldn’t be too bad because there was that one website that gets a few hits using a dialect of it. When Kaggle offered to sponsor a port of my TrueSkill C# code to PHP, I thought I’d finally have my first real encounter with PHP.


<?php echo "Disclaimer:"; ?>

To make the port quick, I kept most of the design and class structure from my C# implementation. This led to a less-than-optimal result since PHP really isn’t object-oriented. I didn’t do a deep dive on redesigning it in the native PHP way. I stuck with the philosophy that you can write quasi-C# in any language. Also, I didn’t use any of the web and database features that motivate most people to choose PHP in the first place. In other words, I didn’t cater to PHP’s specialty, so my reflections are probably an unfair and biased comparison as I was not using PHP the way it was intended. I expect that I missed tons of great things about PHP.

Personal disclaimers aside, even PHP book authors don’t claim that it’s the nicest language. Instead, they highlight the language’s popularity. I sort of got the feeling that people mainly choose PHP in lieu of languages like C# because of its current popularity and its perception of having a lower upfront cost, especially among cash-strapped startups. Matt Doyle, author of Beginning PHP 5.3, wrote the following while comparing PHP to other languages:

“Many would argue that C# is a nicer, better-organized language to program in than PHP, although C# is arguably harder to learn. Another advantage of ASP.NET is that C# is a compiled language, which generally means it runs faster than PHP’s interpreted scripts (although PHP compilers are available).” - p.5

He continued:

“ASP and ASP.NET have a couple of other disadvantages compared to PHP. First of all, they have a commercial license, which can mean spending additional money on server software, and hosting is often more expensive as a result. Secondly, ASP and ASP.NET are fairly heavily tied to the Windows platform, whereas the other technologies in this list are much more cross-platform.” - p.5

Next, he hinted that Ruby might eventually replace PHP’s reign:

“Like Python, Ruby is another general-purpose language that has gained a lot of traction with Web developers in recent years. This is largely due to the excellent Ruby on Rails application framework, which uses the Model-View-Controller (MVC) pattern, along with Ruby’s extensive object-oriented programming features, to make it easy to build a complete Web application very quickly. As with Python, Ruby is fast becoming a popular choice among Web developers, but for now, PHP is much more popular.” - p.6

and then elaborating on why PHP might be popular today:

“[T]his middle ground partly explains the popularity of PHP. The fact that you don’t need to learn a framework or import tons of libraries to do basic Web tasks makes the language easy to learn and use. On the other hand, if you need the extra functionality of libraries and frameworks, they’re there for you.” - p.7

Fair enough. However, to really understand the language, I needed to dive in personally and experience it firsthand. I took notes during the dive about some of the things that stuck out.

The Good Parts

  • It’s relatively easy to learn and get started with PHP. As a C# developer, I was able to pick up PHP in a few hours after a brief overview of the syntax from a book. Also, PHP has some decent online help.
  • PHP is available on almost all web hosts these days at no extra charge (in contrast with ASP.NET hosting). I can’t emphasize this enough because it’s a reason why I would still consider writing a small website in it.
  • I was pleasantly surprised to have unit test support with PHPUnit. This made me feel at home and made it easier to develop and debug code.
  • It’s very easy and reasonable to create a website in PHP using techniques like Model-View-Controller (MVC) designs that separate the view from the actual database model. The language doesn’t seem to pose any hindrance to this.
  • PHP has a “static” keyword that is sort of like a static version of a “this” reference. This was useful in creating a quasi-static “subclass” of my “Range” class for validating player and team sizes. This feature is formally known as late static binding.

The “When in Rome...” Parts

  • Class names use PascalCase while functions tend to use lowerCamelCase like Java whereas C# tends to use PascalCase for both. In addition, .NET in general seems to have more universally accepted naming conventions than PHP has.
  • PHP variables have a ‘$’ prefix which makes variables stick out:

    function increment($someNumber)
    {
    $result = $someNumber + 1;
    return $result;
    }
    This convention was probably copied from Perl’s scalar variable sigil. This makes sense because PHP was originally a set of Perl scripts intended to be a simpler Perl.
  • You access class members and functions using an arrow operator (“->”) like C++ instead of the C#/Java dot notation (“.”). That is, in PHP you say “$someClass->someMethod()” instead of “someClass.someMethod()
  • The arguments in a “foreach” statement are reversed from what C# uses. In PHP, you write:

    foreach($allItems as $currentItem) { ... }
    instead of the C# way:

    foreach(currentItem in allItems) { ... }
    One advantage to the PHP way is its special syntax that makes iterating through key/value pairs in an map easier:

    foreach($someArray as $key => $value) { ... }
    vs. the C# way of something like this:

    foreach(var pair in someDictionary)
    {
    // use pair.Key and pair.Value
    }
  • The “=>” operator in PHP denotes a map entry as in

    $numbers = array(1 => ‘one’, 2 => ‘two’, ...)
    In C#, the arrow “=>” is instead used for a lightweight lambda expression syntax:

    x => x * x
    To define the rough equivalent of the PHP array, you’d have to write this in C#

    var numbers = new Dictionary<int, string>{ {1, "one" }, {2, "two"} };
    On the one hand, the PHP notations for maps is cleaner, but it comes at a cost of having no lightweight lambda syntax (more on that later).
  • PHP has some “magical methods” such as “__construct” and “__toString” for the equivalent of C#’s constructor and ToString functionality. I like C#’s approach here, but I’m biased.

The “Ok, I guess” Parts

  • The free NetBeans IDE for PHP is pretty decent for writing PHP code. Using it in conjunction with PHP’s XDebug debugger functionality is a must. After my initial attempts at writing code with a basic notepad, I found NetBeans to be a very capable editor. My only real complaint with it is that I had some occasional cases where the editor would lock up and the debugger wouldn’t support things like watching variables. That said, it’s still good for being a free editor.
  • By default, PHP passes function arguments by value instead of by reference like C# does it. This probably caused the most difficulty with the port. Complicating things further is that PHP references are not like references in other languages. For example, using references usually incurs a performance penalty since extra work is required.
  • You can’t import types via namespaces alone like you can in C# (and Java for that matter). In PHP, you have to import each type manually:

    use Moserware\Skills\FactorGraphs\ScheduleLoop;
    use Moserware\Skills\FactorGraphs\ScheduleSequence;
    use Moserware\Skills\FactorGraphs\ScheduleStep;
    use Moserware\Skills\FactorGraphs\Variable;
    whereas in C# you can just say:

    using Moserware.Skills.FactorGraphs;
    PHP’s way makes things explicit and I can see that viewpoint, but it was a bit of a surprising requirement given how PHP usually required less syntax.
  • PHP lacks support for C#-like generics. On the one hand, I missed the generic type safety and performance benefits, but on the other hand it forced me to redesign some classes to not have an army of angle brackets (e.g. compare this class in C# to its PHP equivalent).
  • You have to manually call your parent class’s constructor in PHP if you want that feature:

    class BaseClass
    {
    function __construct() { ... }
    }

    class DerivedClass extends BaseClass
    {
    function __construct()
    {
    // this line is optional, but if you omit it, the BaseClass constructor will *not* be called
    parent::__construct();
    }
    }
    This gives you more flexibility, but it doesn’t enforce C#-like assumptions that your parent class’s constructor was called.
  • PHP doesn’t seem to have the concept of an implicit “$this” inside of a class. This forces you to always qualify class member variables with $this:

    class SomeClass
    {
    private $_someLocalVariable;
    function someMethod()
    {
    $someMethodVariable = $this->_someLocalVariable + 1;
    ...
    }
    }
    I put this in the “OK” category because some C# developers prefer to always be explicit on specifying “this” as well.
  • PHP allows you to specify the type of some (but not all kinds) of the arguments of a function:

    function myFunction(SomeClass $someClass, array $someArray, $someString) { ... }
    This is called “type hinting.” It seems that it is designed for enforcing API contracts instead of general IDE help as it actually causes a decrease in performance.
  • PHP doesn’t have the concept of LINQ, but it does support some similar functional-like concepts like array_map and array_reduce.
  • PHP has support for anonymous functions by using the “function($arg1, ...) {}” syntax. This is sort of reminiscent of how C# did the same thing in version 2.0 where you had to type out “delegate.” C# 3.0 simplified this with a lighter weight version (e.g. “x => x * x”). I’ve found that this seemingly tiny change “isn’t about doing the same thing faster, it allows me to work in a completely different manner” by employing functional concepts without thinking. It’s sort of a shame PHP didn’t elevate this concept with concise syntax. When C#’s lambda syntax was introduced in 3.0, it made me want to use them much more often. PHP’s lack of something similar is a strong discourager to the functional style and is a lesson that C++ guys have recently learned.
  • Item 4 of the PHP license states:
    Products derived from this software may not be called “PHP”, nor may “PHP” appear in their name, without prior written permission from group@php.net. You may indicate that your software works in conjunction with PHP by saying “Foo for PHP” instead of calling it “PHP Foo” or “phpfoo”
    This explains why you see carefully worded names like “HipHop for PHP” rather than something like “php2cpp.” This technically doesn’t stop you doesn’t stop you from having a project with the PHP name in it (e.g. PHPUnit) so long as the official PHP code is not included in it. However, it’s clear that the PHP group is trying to clean up its name from tarnished projects like PHP-Nuke. I understand their frustration, but this leads to an official preference for names like Zope and Smarty that seem to be less clear on what the project actually does. This position would be like Microsoft declaring that you couldn’t use the “#” suffix or the “Implementation Running On .Net (Iron)” prefix in your project name (but maybe that would lead to more creativity?).

The Frustrating Parts:

  • As someone who’s primarily worked with a statically typed language for the past 15 years, I prefer upfront compiler errors and warnings that C# offers and agree with Anders Hejlsberg’s philosophy:
    “I think one of the reasons that languages like Ruby for example (or Python) are becoming popular is really in many ways in spite of the fact that they are not typed... but because of the fact that they [have] very good metaprogramming support. I don’t see a lot of downsides to static typing other than the fact that it may not be practical to put in place, and it is harder to put in place and therefore takes longer for us to get there with static typing, but once you do have static typing. I mean, gosh, you know, like hey -- the compiler is going to report the errors before the space shuttle flies instead of whilst it’s flying, that’s a good thing!”
    But more dynamic languages like PHP have their supporters. For example, Douglas Crockford raves about JavaScript’s dynamic aspects:
    “I found over the years of working with JavaScript... I used to be of the religion that said ‘Yeah, absolutely brutally strong type systems. Figure it all out at compile time.’ I've now been converted to the other camp. I've found that the expressive power of JavaScript is so great. I've not found that I've lost anything in giving up the early protection [of statically compiled code]”
    I still haven’t seen where Crockford is coming from given my recent work with PHP. Personally, I think that given C# 4.0’s optional support of dynamic objects, the lines between the two worlds are grayer and that with C# you get the best of both worlds, but I’m probably biased here.
  • You don’t have to define variables in PHP. This reduces some coding “ceremony” to get to the essence of your code, but I think it removes a shock absorber/circuit-breaker that can be built into the language. This “feature” turned my typo into a bug and led to a runtime error. Fortunately, options like E_NOTICE can catch these, but it caught me off guard. Thankfully, NetBean’s auto-completion saved me from most of these types of errors.
  • PHP has built-in support for associative arrays, but you can’t use objects as keys or else you’ll get an “Illegal Offset Type” error. Because my C# API heavily relied on this ability and I didn’t want to redesign the structure, I created my own hashmap that supports object keys. This omission tended to reinforce the belief that PHP is not really object oriented. That said, I’m probably missing something and did it wrong.
  • PHP doesn’t support operator overloading. This made my GaussianDistribution and Matrix classes a little harder to work with by having to invent explicit names for the operators.
  • PHP lacks support for a C#-like property syntax. Having to write getters and setters made me feel like I was back programming in Java again.
  • My code ran slower in PHP. To be fair, most of the performance problem was in my horribly naive matrix implementation which could be improved with a better implementation. Regardless, it seems that larger sites deal with PHP’s performance problem by writing critical parts in compiled languages like C/C++ or by using caching layers such as memcached. One interesting observation is that the performance issue isn't really with the Zend Engine per-se but rather the semantics of the PHP language itself. Haiping Zhao on the HipHop for PHP team gave a good overview of the issue:
    “Around the time that we started the [HipHop for PHP] project, we absolutely looked into the Zend Engine. The first question you ask is 'The Zend Engine must be terribly implemented. That’s why it’s slow, right?' So we looked into the Zend Engine and tried different places, we looked at the hash functions to see if it’s sufficient and look some of the profiles the Zend Engine has and different parts of the Zend Engine. You finally realize that the Zend Engine is pretty compact. It just does what it promises. If you have that kind of semantics you just cannot avoid the dynamic function table, you cannot avoid the variable table, you just cannot avoid a lot of the things that they built... that’s the point that [you realize] PHP can also be called C++Script because the syntax is so similar then you ask yourself, 'What is the difference between the speed of these two different languages and those are the items that are... different like the dynamic symbol lookup (it’s not present in C++), the weak typing is not present in C++, everything else is pretty much the same. The Zend Engine is very close to C implementation. The layer is very very thin. I don’t think we can blame the Zend Engine for the slowness PHP has.”
    That said, I don’t think that performance alone would stop me from using PHP. It’s good enough for most things. Furthermore, I'm sure optimizers could use tricks like what the DLR and V8 use to squeak out more performance. However, I think that in practice, there is a case of diminishing returns where I/O (and not CPU time) typically become the limiting factor.

Parting Thoughts

Despite my brief encounter, I feel that I learned quite a bit and feel comfortable around PHP code now. I think my quick ramp-up highlights a core value of PHP: its simplicity. I did miss C#-like compiler warnings and type safety, but maybe that’s my own personal acquired taste. Although PHP does have some dubious features, it’s not nearly as bad as some people make it out to be. I think that its simplicity makes it a very respectable choice for the type of things it was originally designed to do like web templates. Although I still wouldn’t pick PHP as my first choice as a general purpose web programming language, I can now look at its features in a much more balanced way.

P.S. I’d love to hear suggestions on how to improve my implementation and learn where I did something wrong. Please feel free to use my PHP TrueSkill code and submit pull requests. As always, feel free to fork the code and port it to another language like Nate Parsons did with his JSkills Java port.

Thursday, March 18, 2010

Computing Your Skill

Summary: I describe how the TrueSkill algorithm works using concepts you're already familiar with. TrueSkill is used on Xbox Live to rank and match players and it serves as a great way to understand how statistical machine learning is actually applied today. I’ve also created an open source project where I implemented TrueSkill three different times in increasing complexity and capability. In addition, I've created a detailed supplemental math paper that works out equations that I gloss over here. Feel free to jump to sections that look interesting and ignore ones that seem boring. Don't worry if this post seems a bit long, there are lots of pictures.

Introduction

It seemed easy enough: I wanted to create a database to track the skill levels of my coworkers in chess and foosball. I already knew that I wasn’t very good at foosball and would bring down better players. I was curious if an algorithm could do a better job at creating well-balanced matches. I also wanted to see if I was improving at chess. I knew I needed to have an easy way to collect results from everyone and then use an algorithm that would keep getting better with more data. I was looking for a way to compress all that data and distill it down to some simple knowledge of how skilled people are. Based on some previous things that I had heard about, this seemed like a good fit for “machine learning.”

But, there’s a problem.

Machine learning is a hot area in Computer Science— but it’s intimidating. Like most subjects, there’s a lot to learn to be an expert in the field. I didn’t need to go very deep; I just needed to understand enough to solve my problem. I found a link to the paper describing the TrueSkill algorithm and I read it several times, but it didn’t make sense. It was only 8 pages long, but it seemed beyond my capability to understand. I felt dumb. Even so, I was too stubborn to give up. Jamie Zawinski said it well:


“Not knowing something doesn’t mean you’re dumb— it just means you don’t know it.”

I learned that the problem isn’t the difficulty of the ideas themselves, but rather that the ideas make too big of a jump from the math that we typically learn in school. This is sad because underneath the apparent complexity lies some beautiful concepts. In hindsight, the algorithm seems relatively simple, but it took me several months to arrive at that conclusion. My hope is that I can short-circuit the haphazard and slow process I went through and take you directly to the beauty of understanding what’s inside the gem that is the TrueSkill algorithm.

Skill ≈ Probability of Winning

Women runners in the 100 meter dash.Skill is tricky to measure. Being good at something takes deliberate practice and sometimes a bit of luck. How do you measure that in a person? You could just ask someone if they’re skilled, but this would only give a rough approximation since people tend to be overconfident in their ability. Perhaps a better question is “what would the units of skill be?” For something like the 100 meter dash, you could just average the number of seconds of several recent sprints. However, for a game like chess, it’s harder because all that’s really important is if you win, lose, or draw.

It might make sense to just tally the total number of wins and losses, but this wouldn’t be fair to people that played a lot (or a little). Slightly better is to record the percent of games that you win. However, this wouldn’t be fair to people that beat up on far worse players or players who got decimated but maybe learned a thing or two. The goal of most games is to win, but if you win too much, then you’re probably not challenging yourself. Ideally, if all players won about half of their games, we’d say things are balanced. In this ideal scenario, everyone would have a near 50% win ratio, making it impossible to compare using that metric.

Finding universal units of skill is too hard, so we’ll just give up and not use any units. The only thing we really care about is roughly who’s better than whom and by how much. One way of doing this is coming up with a scale where each person has a unit-less number expressing their rating that you could use for comparison. If a player has a skill rating much higher than someone else, we’d expect them to win if they played each other.

The key idea is that a single skill number is meaningless. What’s important is how that number compares with others. This is an important point worth repeating: skill only makes sense if it’s relative to something else. We’d like to come up with a system that gives us numbers that are useful for comparing a person’s skill. In particular, we’d like to have a skill rating system that we could use to predict the probability of winning, losing, or drawing in matches based on a numerical rating.

We’ll spend the rest of our time coming up with a system to calculate and update these skill numbers with the assumption that they can be used to determine the probability of an outcome.

What Exactly is Probability Anyway?

You can learn about probability if you’re willing to flip a coin— a lot. You flip a few times:

HeadsHeadsTails

Heads, heads, tails!

Each flip has a seemingly random outcome. However, “random” usually means that you haven’t looked long enough to see a pattern emerge. If we take the total number of heads and divide it by the total number of flips, we see a very definite pattern emerge:

But you knew that it was going to be a 50-50 chance in the long run. When saying something is random, we often mean it’s bounded within some range.

It turns out that a better metaphor is to think of a bullseye that archers shoot at. Each arrow will land somewhere near that center. It would be extraordinary to see an arrow hit the bullseye exactly. Most of the arrows will seem to be randomly scattered around it. Although “random,” it’s far more likely that arrows will be near the target than, for example, way out in the woods (well, except if I was the archer).

This isn’t a new metaphor; the Greek word στόχος (stochos) refers to a stick set up to aim at. It’s where statisticians get the word stochastic: a fancy, but slightly more correct word than random. The distribution of arrows brings up another key point:

All things are possible, but not all things are probable.

Probability has changed how ordinary people think, a feat that rarely happens in mathematics. The very idea that you could understand anything about future outcomes is such a big leap in thought that it baffled Blaise Pascal, one of the best mathematicians in history.

In the summer of 1654, Pascal exchanged a series of letters with Pierre de Fermat, another brilliant mathematician, concerning an “unfinished game.” Pascal wanted to know how to divide money among gamblers if they have to leave before the game is finished. Splitting the money fairly required some notion of the probability of outcomes if the game would have been played until the end. This problem gave birth to the field of probability and laid the foundation for lots of fun things like life insurance, casino games, and scary financial derivatives.

But probability is more general than predicting the future— it’s a measure of your ignorance of something. It doesn’t matter if the event is set to happen in the future or if it happened months ago. All that matters is that you lack knowledge in something. Just because we lack knowledge doesn’t mean we can’t do anything useful, but we’ll have to do a lot more coin flips to see it.

Aggregating Observations

The real magic happens when we aggregate a lot of observations. What would happen if you flipped a coin 1000 times and counted the number of heads? Lots of things are possible, but in my case I got 505 heads. That’s about half, so it’s not surprising. I can graph this by creating a bar chart and put all the possible outcomes (getting 0 to 1000 heads) on the bottom and the total number of times that I got that particular count of heads on the vertical axis. For 1 outcome of 505 total heads it would look like this:

Not too exciting. But what if we did it again? This time I got 518 heads. I can add that to the chart:

Doing it 8 more times gave me 489, 515, 468, 508, 492, 475, 511, and once again, I got 505. The chart now looks like this:

And after a billion times, a total of one trillion flips, I got this:

In all the flips, I never got less than 407 total heads and I never got more than 600. Just for fun, we can zoom in on this region:

As we do more sets of flips, the jagged edges smooth out to give us the famous “bell curve” that you’ve probably seen before. Math guys love to refer to it as a “Gaussian” curve because it was used by the German mathematician Carl Gauss in 1809 to investigate errors in astronomical data. He came up with an exact formula of what to expect if we flipped a coin an infinite number of times (so that we don’t have to). This is such a famous result that you can see the curve and its equation if you look closely at the middle of an old 10 Deutsche Mark banknote bearing Gauss’s face:

Don’t miss the forest from all the flippin’ trees. The curve is showing you the density of all possible outcomes. By density, I mean how tall the curve gets at a certain point. For example, in counting the total number of heads out of 1000 flips, I expected that 500 total heads would be the most popular outcome and indeed it was. I saw 25,224,637 out of a billion sets that had exactly 500 heads. This works out to about 2.52% of all outcomes. In contrast, if we look at the bucket for 450 total heads, I only saw this happen 168,941 times, or roughly 0.016% of the time. This confirms your observation that the curve is denser, that is, taller at the mean of 500 than further away at 450.

This confirms the key point: all things are possible, but outcomes are not all equally probable. There are longshots. Professional athletes panic or ‘choke’. The world’s best chess players have bad days. Additionally, tales about underdogs make us smile— the longer the odds the better. Unexpected outcomes happen, but there’s still a lot of predictability out there.

It’s not just coin flips. The bell curve shows up in lots of places like casino games, to the thickness of tree bark, to the measurements of a person’s IQ. Lots of people have looked at the world and have come up with Gaussian models. It’s easy to think of the world as one big, bell shaped playground.

But the real world isn’t always Gaussian. History books are full of “Black Swan” events. Stock market crashes and the invention of the computer are statistical outliers that Gaussian models tend not to predict well, but these events shock the world and forever change it. This type of reality isn’t covered by the bell curve, what Black Swan author Nassim Teleb calls the “Great Intellectual Fraud.” These events would have such low probability that no one would predict them actually happening. There’s a different view of randomness that is a fascinating playground of Benoît Mandelbrot and his fractals that better explain some of these events, but we will ignore all of this to keep things simple. We’ll acknowledge that the Gaussian view of the world isn’t always right, no more than a map of the world is the actual terrain.

The Gaussian worldview assumes everything will typically be some average value and then treats everything else as increasingly less likely “errors” as you exponentially drift away from the center (Gauss used the curve to measure errors in astronomical data after all). However, it’s not fair to treat real observations from the world as “errors” any more than it is to say that a person is an “error” from the “average human” that is half male and half female. Some of these same problems can come up treating a person as having skill that is Gaussian. Disclaimers aside, we’ll go along with George Box’s view that “all models are wrong, but some models are useful.”

Gaussian Basics

Gaussian curves are completely described by two values:

  1. The mean (average) value which is often represented by the Greek letter μ (mu)
  2. The standard deviation, represented by the Greek letter σ (sigma). This indicates how far apart the data is spread out.

In counting the total number heads in 1000 flips, the mean was 500 and the standard deviation was about 16. In general, 68% of the outcomes will be within ± 1 standard deviation (e.g. 484-516 in the experiment), 95% within 2 standard deviations (e.g. 468-532) and 99.7% within 3 standard deviations (452-548):

An important takeaway is that the bell curve allows for all possibilities, but each possibility is most definitely not equally likely. The bell curve gives us a model to calculate how likely something should be given an average value and a spread. Notice how outcomes sharply become less probable as we drift further away from the mean value.

While we’re looking at the Gaussian curve, it’s important to look at -3σ away from the mean on the left side. As you can see, most of the area under the curve is to the right of this point. I mention this because the TrueSkill algorithm uses the -3σ mark as a (very) conservative estimate for your skill. You’re probably better than this conservative estimate, but you’re most likely not worse than this value. Therefore, it’s a stable number for comparing yourself to others and is useful for use in sorting a leaderboard.

3D Bell Curves: Multivariate Gaussians

A non-intuitive observation is that Gaussian distributions can occur in more than the two dimensions that we’ve seen so far. You can sort of think of a Gaussian in three dimensions as a mountain. Here’s an example:

In this plot, taller regions represent higher probabilities. As you can see, not all things are equally probable. The most probable value is the mean value that is right in the middle and then things sharply decline away from it.

In maps of real mountains, you often see a 2D contour plot where each line represents a different elevation (e.g. every 100 feet):

The closer the lines on the map, the sharper the inclines. You can do something similar for 2D representations of 3D Gaussians. In textbooks, you often just see 2D representation that looks like this:

This is called an “isoprobability contour” plot. It’s just a fancy way of saying “things that have the same probability will be the same color.” Note that it’s still in three dimensions. In this case, the third dimension is color intensity instead of the height you saw on a surface plot earlier. I like to think of contour plots as treasure maps for playing the “you’re getting warmer...” game. In this case, black means “you’re cold,” red means “you’re getting warmer...,” and yellow means “you’re on fire!” which corresponds to the highest probability.

See? Now you understand Gaussians and know that “multivariate Gaussians” aren’t as scary as they sound.

Let’s Talk About Chess

There’s still more to learn, but we’ll pick up what we need along the way. We already have enough tools to do something useful. To warm up, let’s talk about chess because ratings are well-defined there.

In chess, a bright beginner is expected to have a rating around 1000. Keep in mind that ratings have no units; it’s just a number that is only meaningful when compared to someone else’s number. By tradition, a difference of 200 indicates the better ranked player is expected to win 75% of the time. Again, nothing is special about the number 200, it was just chosen to be the difference needed to get a 75% win ratio and effectively defines a “class” of player.

I’ve slowly been practicing and have a rating around 1200. This means that if I play a bright beginner with a rating of 1000, I’m expected to win three out of four games.

We can start to visualize a match between me and bright beginner by drawing two bell curves that have a mean of 1000 and 1200 respectively with both having a standard deviation of 200:

The above graph shows what the ratings represent: they’re an indicator of how we’re expected to perform if we play a game. The most likely performance is exactly what the rating is (the mean value). One non-obvious point is that you can subtract two bell curves and get another bell curve. The new center is the difference of the means and the resulting curve is a bit wider than the previous curves. By taking my skill curve (red) and subtracting the beginner’s curve (blue), you’ll get this resulting curve (purple):

Note that it’s centered at 1200 - 1000 = 200. Although interesting to look on its own, it gives some useful information. This curve is representing all possible game outcomes between me and the beginner. The middle shows that I’m expected to be 200 points better. The far left side shows that there is a tiny chance that the beginner has a game where he plays as if he’s 700 points better than I am. The far right shows that there is a tiny chance that I’ll play as if I’m 1100 points better. The curve actually goes on forever in both ways, but the expected probability for those outcomes is so small that it’s effectively zero.

As a player, you really only care about one very specific point on this curve: zero. Since I have a higher rating, I’m interested in all possible outcomes where the difference is positive. These are the outcomes where I’m expected to outperform the beginner. On the other hand, the beginner is keeping his eye on everything to the left of zero. These are the outcomes where the performance difference is negative, implying that he outperforms me.

We can plug a few numbers into a calculator and see that there is about a 24% probability that the performance difference will be negative, implying the beginner wins, and a 76% chance that the difference will be positive, meaning that I win. This is roughly the 75% that we were expecting for a 200 point difference.

This has been a bit too concrete for my particular match with a beginner. We can generalize it by creating another curve where the horizontal axis represents the difference in player ratings and the vertical axis represents the total probability of winning given that rating difference:

As expected, having two players with equal ratings, and thus a rating difference of 0, implies the odds of winning are 50%. Likewise, if you look at the -200 mark, you see the curve is at the 24% that we calculated earlier. Similarly, +200 is at the 76% mark. This also shows that outcomes on the far left side are quite unlikely. For example, the odds of me winning a game against Magnus Carlsen, who is at the top of the chess leaderboard with a rating of 2813, would be at the -1613 mark (1200 - 2813) on this chart and have a probability near one in a billion. I won’t hold my breath. (Actually, most chess groups use a slightly different curve, but the ideas are the same. See the accompanying math paper for details.)

All of these curves were probabilities of what might happen, not what actually happened. In actuality, let’s say I lost the game by some silly blunder (oops!). The question that the beginner wants to know is how much his rating will go up. It also makes sense that my rating will go down as a punishment for the loss. The harder question is just how much should the ratings change?

By winning, the beginner demonstrated that he was probably better than the 25% winning probability we thought he would have. One way of updating ratings is to imagine that each player bets a certain amount of his rating on each game. The amount of the bet is determined by the probability of the outcome. In addition, we decide how dramatic the ratings change should be for an individual game. If you believe the most recent game should count 100%, then you’d expect my rating to go down a lot and his to go up a lot. The decision of how much the most recent game should count leads to what chess guys call the multiplicative “K-factor.”

The K-Factor is what we multiply a probability by to get the total amount of a rating change. It reflects the maximum possible change in a person’s rating. A reasonable choice of a weight is that the most recent game counts about 7% which leads to a K-factor of 24. New players tend to have more fluctuations than well-established players, so new players might get a K-Factor of 32 while grand masters have a K-factor around 10. Here’s how the K-Factor changes with respect to how much the latest game should count:

Using a K-Factor of 24 means that my rating will now be lowered to 1182 and the beginner’s will rise to 1018. Our curves are now closer together:

Note that our standard deviations never change. Here are the probabilities if we were to play again:

This method is known as the Elo rating system, named after Arpad Elo, the chess enthusiast who created it. It’s relatively simple to implement and most games that calculate skill end here.

I Thought You Said You’d Talk About TrueSkill?

Everything so far has just been prerequisites to the main event; the TrueSkill paper assumes you’re already familiar with it. It was all sort of new to me, so it took awhile to get comfortable with the Elo ideas. Although the Elo model will get you far, there are a few notable things it doesn’t handle well:

  1. Newbies - In the Elo system, you’re typically assigned a “provisional” rating for the first 20 games. These games tend to have a higher K-factor associated with them in order to let the algorithm determine your skill faster before it’s slowed down by a non-provisional (and smaller) K-factor. We would like an algorithm that converges quickly onto a player’s true skill (get it?) to not waste their time having unbalanced matches. This means the algorithm should start giving reasonable approximations of skill within 5-10 games.
  2. Teams - Elo was explicitly designed for two players. Efforts to adapt it to work for multiple people on multiple teams have primarily been unsophisticated hacks. One such approach is to treat teams as individual players that duel against the other players on the opposing teams and then apply the average of the duels. This is the “duelling heuristic” mentioned in the TrueSkill paper. I implemented it in the accompanying project. It’s ok, but seems a bit too hackish and doesn’t converge well.
  3. Draws - Elo treats draws as a half win and half loss. This doesn’t seem fair because draws can tell you a lot. Draws imply you were evenly paired whereas a win indicates you’re better, but unsure how much better. Likewise, a loss indicates you did worse, but you don’t really know how much worse. So it seems that a draw is important to explicitly model.

The TrueSkill algorithm generalizes Elo by keeping track of two variables: your average (mean) skill and the system’s uncertainty about that estimate (your standard deviation). It does this instead of relying on a something like a fixed K-factor. Essentially, this gives the algorithm a dynamic k-factor. This addresses the newbie problem because it removes the need to have “provisional” games. In addition, it addresses the other problems in a nice statistical manner. Tracking these two values are so fundamental to the algorithm that Microsoft researchers informally referred to it as the μσ (mu-sigma) system until the marketing guys gave it the name TrueSkill.

We’ll go into the details shortly, but it’s helpful to get a quick visual overview of what TrueSkill does. Let’s say we have Eric, an experienced player that has played a lot and established his rating over time. In addition, we have newbie: Natalia.

Here’s what their skill curves might look like before a game:

And after Natalia wins:

Notice how Natalia’s skill curve becomes narrower and taller (i.e. makes a big update) while Eric’s curve barely moves. This shows that the TrueSkill algorithm thinks that she’s probably better than Eric, but doesn’t how much better. Although TrueSkill is a little more confident about Natalia’s mean after the game (i.e. it’s now taller in the middle), it’s still very uncertain. Looking at her updated bell curve shows that her skill could be between 15 and 50.

The rest of this post will explain how calculations like this occurred and how much more complicated scenarios can occur. But to understand it well enough to implement it, we’ll need to learn a couple of new things.

Bayesian Probability

Most basic statistics classes focus on frequencies of events occurring. For example, the probability of getting a red marble when randomly drawing from a jar that has 3 red marbles and 7 blue marbles is 30%. Another example is that the probability of rolling two dice and getting a total of 7 is about 17%. The key idea in both of these examples is that you can count each type of outcome and then compute the frequency directly. Although helpful in calculating your odds at casino games, “frequentist” thinking is not that helpful with many practical applications, like finding your skill in a team.

A different approach is to think of probability as degree of belief in something. The basic idea is that you have some prior belief and then you observe some evidence that updates your belief leaving you with an updated posterior belief. As you might expect, learning about new evidence will typically make you more certain about your belief.

Let’s assume that you’re trying to find a treasure on a map. The treasure could be anywhere on the map, but you have a hunch that it’s probably around the center of the map and increasingly less likely as you move away from the center. We could track the probability of finding the treasure using the 3D multivariate Gaussian we saw earlier:

Now, let’s say that after studying a book about the treasure, you’ve learned that there’s a strong likelihood that treasure is somewhere along the diagonal line on the map. Perhaps this was based on some secret clue. Your clue information doesn’t necessarily mean the treasure will be exactly on that line, but rather that the treasure will most-likely be near it. The likelihood function might look like this in 3D:

We’d like to use our prior information and this new likelihood information to come up with a better posterior guess of the treasure. It turns out that we can just multiply the prior and likelihood to obtain a posterior distribution that looks like this:

This is giving us a smaller and more concentrated area to look at.

If you look at most textbooks, you typically just see this information using 2D isoprobability contour plots that we learned about earlier. Here’s the same information in 2D:

Prior:

Likelihood:

Posterior:

For fun, let’s say we found additional information saying the treasure is along the other diagonal with the following likelihood:

To incorporate this information, we’re able to take our last posterior and make that the prior for the next iteration using the new likelihood information to get this updated posterior:

This is a much more focused estimate than our original belief! We could iterate the procedure and potentially get an even smaller search area.

And that’s basically all there is to it. In TrueSkill, the buried treasure that we look for is a person’s skill. This approach to probability is called “Bayesian” because it was discovered by a Presbyterian minister in the 1700’s named Thomas Bayes who liked to dabble in math.

The central ideas to Bayesian statistics are the prior, the likelihood, and the posterior. There’s detailed math that goes along with this and is in the accompanying paper, but understanding these basic ideas is more important:

“When you understand something, then you can find the math to express that understanding. The math doesn’t provide the understanding.”— Lamport

Bayesian methods have only recently become popular in the computer age because computers can quickly iterate through several tedious rounds of priors and posteriors. Bayesian methods have historically been popular inside of Microsoft Research (where TrueSkill was invented). Way back in 1996, Bill Gates considered Bayesian statistics to be Microsoft Research’s secret sauce.

As we’ll see later on, we can use the Bayesian approach to calculate a person’s skill. In general, it’s highly useful to update your belief based off previous evidence (e.g. your performance in previous games). This usually works out well. However, sometimes “Black Swans” are present. For example, a turkey using Bayesian inference would have a very specific posterior distribution of the kindness of a farmer who feeds it every day for 1000 days only to be surprised by a Thanksgiving event that was so many standard deviations away from the turkey’s mean belief that he never would have saw it coming. Skill has similar potential for a “Thanksgiving” event where an average player beats the best player in the world. We’ll acknowledge that small possibility, but ignore it to simplify things (and give the unlikely winner a great story for the rest of his life).

TrueSkill claims that it is Bayesian, so you can be sure that there is going to be a concept of a prior and a likelihood in it— and there is. We’re getting closer, but we still need to learn a few more details.

The Marginalized, but Not Forgotten Distribution

Next we need to learn about “marginal distributions”, often just called “marginals.” Marginals are a way of distilling information to focus on what you care about. Imagine you have a table of sales for each month for the past year. Let’s say that you only care about total sales for the year. You could take out your calculator and add up all the sales in each month to get the total aggregate sales for the year. Since you care about this number and it wasn’t in the original report, you could add it in the margin of the table. That’s roughly where “margin-al” got its name.

Wikipedia has a great illustration on the topic: consider a guy that ignores his mom’s advice and never looks both ways when crossing the street. Even worse, he’s too engrossed in listening to his iPod that he doesn’t look any way, he just always crosses.

What’s the probability of him getting hit by a car at a specific intersection? Let’s simplify things by saying that it just depends on whether the light is red, yellow, or green.

Light StateRedYellowGreen
Probability of getting hit given light state1%9%90%

This is helpful, but it doesn’t tell us what we want. We also need to know how long the light stays a given color

Light colorRedYellowGreen
% Time in Color60%10%30%

There’s a bunch of probability data here that’s a bit overwhelming. If we join the probabilities together, we’ll have a “joint distribution” that’s just a big complicated system that tells us too much information.

We can start to distill this information down by calculating the probability of getting hit given each light state:

 RedYellowGreenTotal
Probability of Getting Hit1%*60% = 0.6%9%*10% = 0.9%90%*30% = 27%28.5%

In the right margin of the table we get the value that really matters to this guy. There’s a 28.5% marginal probability of getting hit if the guy never looks for cars and just always crosses the street. We obtained it by “summing out” the individual components. That is, we simplified the problem by eliminating variables and we eliminated variables by just focusing on the total rather than the parts.

This idea of marginalization is very general. The central question in this article is “computing your skill,” but your skill is complicated. When using Bayesian statistics, we often can’t observe something directly, so we have to come up with a probability distribution that’s more complicated and then “marginalize” it to get the distribution that we really want. We’ll need to marginalize your skill by doing a similar “summing-out” procedure as we did for the reckless guy above.

But before we do that, we need to learn another technique to make calculations simpler.

What’s a Factor Graph, and Why Do I Care?

Remember your algebra class when you worked with expressions like this?

Your teacher showed you that you could simplify this by “factor-ing” out w, like this:

We often factor expressions to make them easier to understand and to simplify calculations. Let’s replace the variables above with w=4, x=1, y=2, and z=3.

Let’s say the numbers on our calculator are circles and the operators are squares. We could come up with an “expression tree” to describe the calculation like this:

You can tell how tedious this computation is by counting 11 “buttons” we’d have to push. We could also factor it like this

This “factorization” has a total of 7 buttons, a savings of 4 buttons. It might not seem like much here, but factorizing is a big idea.

We face a similar problem of how to factor things when we’re looking to simplify a complicated probability distribution. We’ll soon see how your skill is composed of several “factors” in a joint distribution. We can simplify computations based on how variables are related to these factors. We’ll break up the joint distribution into a bunch of factors on a graph. This graph that links factors and variables is called a “factor graph.”

The key idea about a factor graph is that we represent the marginal conditional probabilities as variables and then represent each major function of those variables as a “factor.” We’ll take advantage of how the graph “factorizes” and imagine that each factor is a node on a network that’s optimized for efficiency. A key efficiency trick is that factor nodes send “messages” to other nodes. These messages help simplify further marginal computations. The “message passing” is very important and thus will be highlighted with arrows in the upcoming graphs; gray arrows represent messages going “down” the graph and black show messages coming “up” the graph.

The accompanying code and math paper go into details about exactly how this happens, but it’s important to realize the high level idea first. That is, we want to look at all the factors that go into creating the likelihood function for updating a person’s skill based on a game outcome. Representing this information in a factor graph helps us see how things are related.

Now we have all the foundational concepts that we’re ready for the main event: the TrueSkill factor graph!

Enough Chess, Let’s Rank Something Harder!

The TrueSkill algorithm is Bayesian because it’s composed of a prior multiplied by a likelihood. I’ve highlighted these two components in the sample factor graph from the TrueSkill paper that looks scary at first glance:

This factor graph shows the outcome of a match that had 3 teams all playing against each other. The first team (on the left) only has one player, but this player was able to defeat both of the other teams. The second team (in the middle) had two players and this team tied the third team (on the right) that had just one player.

In TrueSkill, we just care about a player’s marginal skill. However, as is often the case with Bayesian models, we have to explicitly model other things that impact the variable we care about. We’ll briefly cover each factor (more details are in the code and math paper).

Factor #1: What Do We Already Know About Your Skill?

The first factor starts the whole process. It’s where we get a player’s previous skill level from somewhere (e.g. a player database). At this point, we add some uncertainty to your skill’s standard deviation to keep game dynamics interesting and prevent the standard deviation from hitting zero since the rest of algorithm will make it smaller (since the whole point is to learn about you and become more certain).

There is a factor and a variable for each player. Each factor is a function that remembers a player’s previous skill. Each variable node holds the current value of a player’s skill. I say “current” because this is the value that we’ll want to know about after the whole algorithm is completed. Note that the message arrow on the factor only goes one way; we never go back to the prior factor. It just gets things going. However, we will come back to the variable.

But we’re getting ahead of ourselves.

Factor #2: How Are You Going To Perform?

Next, we add in beta (β). You can think of beta as the number of skill points to guarantee about an 80% chance of winning. The TrueSkill inventors refer to beta as defining the length of a “skill chain.”

The skill chain is composed of the worst player on the far left and the best player on the far right. Each subsequent person on the skill chain is “beta” points better and has an 80% win probability against the weaker player. This means that a small beta value indicates a high-skill game (e.g. Go) since smaller differences in points lead to the 80%:20% ratio. Likewise, a game based on chance (e.g. Uno) is a low-skill game that would have a higher beta and smaller skill chain.

Factor #3: How is Your Team Going to Perform?

Now we’re ready for one of the most controversial aspects of TrueSkill: computing the performance of a team as a whole. In TrueSkill, we assume the team’s performance is the sum of each team member’s performance. I say that it’s “controversial” because some members of the team probably work harder than others. Additionally, sometimes special dynamics occur that make the sum greater than the parts. However, we’ll fight the urge to make it much more complicated and heed Makridakis’s advice:

“Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones”

One cool thing about this factor is that you can weight each team member’s contribution by the amount of time that they played. For example, if two players are on a team but each player only played half of the time (e.g. a tag team), then we would treat them differently than if these two players played the entire time. This is officially known as “partial play.” Xbox game titles report the percentage of time a player was active in a game under the “X_PROPERTY_PLAYER_PARTIAL_PLAY_PERCENTAGE” property that is recorded for each player (it defaults to 100%). This information is used by TrueSkill to perform a fairer update. I implemented this feature in the accompanying source code.

Factor #4: How’d Your Team Compare?

Next, we compare team performances in pairs. We do this by subtracting team performances to come up with pairwise differences:

This is similar to what we did earlier with Elo and subtracting curves to get a new curve.

Factor #5: How Should We Interpret the Team Differences?

The bottom of the factor graph contains a comparison factor based on the team performance differences we just calculated:

The comparison depends on whether the pairwise difference was considered a “win” or a “draw.” Obviously, this depends on the rules of the game. It’s important to realize that TrueSkill only cares about these two types of results. TrueSkill doesn’t care if you won by a little or a lot, the only thing that matters is if you won. Additionally, in TrueSkill we imagine that there is a buffer of space called a “draw margin” where performances are equivalent. For example, in Olympic swimming, two swimmers can “draw” because their times are equivalent to 0.01 seconds even though the times differ by several thousandths of a second. In this case, the “draw margin” is relatively small around 0.005 seconds. Draws are very common in chess at the grandmaster level, so the draw margin would be much greater there.

The output of the comparison factor directly relates to how much your skill’s mean and standard deviation will change.

The exact math involved in this factor is complicated, but the core idea is simple:

  • Expected outcomes cause small updates because the algorithm already had a good guess of your skill.
  • Unexpected outcomes (upsets) cause larger updates to make the algorithm more likely to predict the outcome in the future.

The accompanying math paper goes into detail, but conceptually you can think of the performance difference as a number on the bottom (x-axis) of a graph. It represents the difference between the expected winner and the expected loser. A large negative number indicates a big upset (e.g. an underdog won) and a large positive number means the expected person won. The exact update of your skill’s mean will depend on the probability of a draw, but you can get a feel for it by looking at this graph:

Similarly, the update to a skill’s standard deviation (i.e. uncertainty) depends on how expected the outcome was. An expected outcome shrinks the uncertainty by a small amount (e.g. we already knew it was going to happen). Likewise, an unexpected outcome shrinks the standard deviation more because it was new information that we didn’t already have:

One problem with this comparison factor is that we use some fancy math that just makes an approximation (a good approximation, but still an approximation). We’ll refine the approximation in the next step.

The Inner Schedule: Iterate, Iterate, Iterate!

We can make a better approximation of the team difference factors by passing around the messages that keep getting updated in the following loop:

After a few iterations of this loop, the changes will be less dramatic and we’ll arrive at stable values for each marginal.

Enough Already! Give Me My New Rating!

Once the inner schedule has stabilized the values at the bottom of the factor graph, we can reverse the direction of each factor and propagate messages back up the graph. These reverse messages are represented by black arrows in the graph of each factor. Each player’s new skill rating will be the value of player’s skill marginal variable once messages have reached the top of the factor graph.

By default, we give everyone a “full” skill update which is the result of the above procedure. However, there are times when a game title might want to not make the match outcome count much because of less optimal playing conditions (e.g. there was a lot of network lag during the game). Games can do this with a “partial update” that is just a way to apply only a fraction of the full update. Game titles specify this via the X_PROPERTY_PLAYER_SKILL_UPDATE_WEIGHTING_FACTOR variable. I implemented this feature in the accompanying source code and describe it in the math paper.

Results

There are some more details left, but we’ll stop for now. The accompanying math paper and source code fill in most of the missing pieces. One of the best ways to learn the details is to implement TrueSkill yourself. Feel free to create a port of the accompanying project in your favorite language and share it with the world. Writing your own implementation will help solidify all the concepts presented here.

The most rewarding part of implementing the TrueSkill algorithm is to see it work well in practice. My coworkers have commented on how it’s almost “eerily” accurate at computing the right skill for everyone relatively quickly. After several months of playing foosball, the top of the leaderboard (sorted by TrueSkill: the mean minus 3 standard deviations) was very stable. Recently, a very good player started playing and is now the #2 player. Here’s a graph of the most recent changes in TrueSkill for the top 5 (of around 40) foosball players:

(Note: Look how quickly the system detected how good this new #2 player is even though his win ratio is right at 50%)

Another interesting aspect of implementing TrueSkill is that it has raised an awareness of ratings among players. People that otherwise wouldn’t have played together now occasionally play each other because they know they’re similarly matched and will have a good game. One advantage of TrueSkill is that it’s not that big of a deal to lose to a much better player, so it’s still ok to have unbalanced games. In addition, having ratings has been a good way to judge if you’re improving in ability with a new shot technique in foosball or learning more chess theory.

Fun Things from Here

The obvious direction to go from here is to add more games to the system and see if TrueSkill handles them equally well. Given that TrueSkill is the default ranking system on Xbox live, this will probably work out well. Another direction is to see if there’s a big difference in TrueSkill based on position in a team (e.g. midfield vs. goalie in foosball). Given TrueSkill’s sound statistics based on ranking and matchmaking, you might even have some success in using it to decide between to several options. You could have each option be a “player” and decide each “match” based on your personal whims of the day. If nothing else, this would be an interesting way to pick your next vacation spot or even your child’s name.

If you broaden the scope of your search to using the ideas that we’ve learned along the way, there’s a lot more applications. Microsoft’s AdPredictor (i.e. the part that delivers relevant ads on Bing) was created by the TrueSkill team and uses similar math, but is a different application.

As for me, it was rewarding to work with an algorithm that has fun social applications as well as picking up machine learning tidbits along the way. It’s too bad all of that didn’t help me hit the top of any of the leaderboards.

Oh well, it’s been a fun journey. I’d love to hear if you dived into the algorithm after reading this and would especially appreciate any updates to my code or other language forks.

Links:

  • The Math Behind TrueSkill - A math-filled paper that fills in some of the details left out of this post.
  • Moserware.Skills Project on GitHub - My full implementation of Elo and TrueSkill in C#. Please feel free to create your own language forks.
  • Microsoft's online TrueSkill Calculators - Allows you to play with the algorithm without having to download anything. My implementation matches the results of these calculators.

Special thanks to Ralf Herbrich, Tom Minka, and Thore Graepel on the TrueSkill team at Microsoft Research Cambridge for their help in answering many of my detailed questions about their fascinating algorithm.