Fixing “On Pace” Numbers

Suppose I tell you that a baseball team has just started the season 10-0. You literally know nothing about the team besides this information. What is a reasonable expectation for the number of games this team will win? Even if you don’t know the answer offhand, you probably know that the answer is not “162.” Tom Tango has been taking to Twitter recently to mock these “on-pace” numbers, and for good reason — saying the above hypothetical team is “on pace” for 162 wins has no real meaning in reality. So how do we fix it? I’m going to proceed in a way that a Bayesian statistician might, but mostly explaining the logic behind the reasoning, rather than going through any complicated math. So follow me if you want to see how a statistician thinks.

First question: why is it so hard to believe that this team might go 162-0? After all, if I told you the UConn women’s basketball team started their season 10-0, it might not be too much of a stretch to believe they could go undefeated. Or if I told you it hasn’t rained in the Atacama Desert in ten days, you might believe me when I say it won’t rain for the next 150 or so. What’s different about baseball, then? Well, we might start with the fact that it’s never happened before. In fact, no team has ever really come close. Thanks to Sean Lahman’s database we can pull historical winning percentages for every team since 1961 and make a histogram of them:

View post on imgur.com


If you’re expecting this team’s win percentage to fall anywhere outside the 30%-70% range, you’re kidding yourself. In fact, getting a winning percentage of more than 60% is no small feat. This is what Bayesians call the “prior” — knowledge we bring to the problem before the problem itself is posed. If someone asked you “how will my team do this season?” without giving you any other information, you’d put them somewhere in this range.

Mathematically, we can put numbers on this histogram. It looks a bit like a Beta distribution and we can estimate the parameters of this distribution from its mean and variance. We get alpha = beta = 24.5, and a pretty nice fit from doing this:

View post on imgur.com


So where are we with this problem? If we had no knowledge of anything, we’d put this team at .500, with a 95% confidence interval of about .360 – .640. But we have to apply the knowledge that this team started 10-0. How do we do that? In statistics, we call this “updating our prior.” We combine the 10-0 start with the knowledge of how teams performed historically using a beta-binomial model. It sounds complicated, and the math behind it is, to some extent. But the end result is easy! It turns out that if the prior follows a Beta distribution with parameters alpha and beta, the “posterior” (result of combining our observed data with the prior) follows a Beta distribution with parameters alpha + S and beta + F, where S is the number of observed successes and F is the number of observed failures. In our case that means alpha = 24.5 + 10 = 34.5 and beta = 24.5 + 0 = 24.5. We can plot the new (posterior) distribution along with the old (prior) one:

View post on imgur.com


You can see that the new distribution has moved to the right, as it should be. It’s also a bit tighter than the prior, given the new information that we have. You can see that the probability is much lower that this is a true-talent .400 team. And our best estimate for their win percentage? That would be 58.4%, good for a 94-95 win season*. Basically, we’re just adding 24 wins and losses to their ledger and reading the win percentage from there. If you prefer, it may make more sense to bank those 10 wins and apply the 58.4% to the rest of the season, giving 98-99 wins. Maybe that seems a little high to you, but it’s sure as heck more reasonable than saying they’re on pace to win 162 games.

* This is the mean; the mode would actually be slightly higher.

This is not a fully Bayesian framework, but it’s a fairly easy and common way of dealing with small sample sizes. The general framework can be applied to all sorts of problems, not just for wins and losses but dealing with small sample sizes of any kind. The math is fairly accessible as well, as getting the parameters for the Beta distribution is as easy as finding the mean and variance, then plugging into a formula. Try it out the next time you’re confronted with such a problem.

Code to generate the data and plots for this post is available on my GitHub.





The Kudzu Kid does not believe anyone actually reads these author bios.

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
a eskpert
6 years ago

This is a really cool use of conjugate priors. I wish my instructor used this example in undergrad.

Tangotigermember
6 years ago

{clap clap clap}
Every Stats 101 class should have this article as one of its lessons.

tz
6 years ago
Reply to  Tangotiger

This should also be re-posted every year on its anniversary as a public service announcement. Don’t think I’ve ever seen this topic explained as succinctly.