The Outcome Machine: Predicting At Bats Before They Happen

A player comes up to the plate. He’s a very good hitter; he’s hitting .300 on the year and has 40 home runs. On the mound stands a pitcher, also very good. The pitcher is a Cy Young candidate, and his ERA sits barely over 2.00. He leads the league in strikeouts and issues very few walks.

After a 10-pitch battle, the pitcher is the one to crack and the batter slaps a hanging curveball into the gap for a double. The batter has won. His batting average for the at bat is a very nice 1.000. Same for his OBP. His slugging percentage? 2.000. Fantastic. If he did this every time, he’d be MVP, no question, every year. The pitcher, meanwhile, has a WHIP for the at bat of #DIV/0!. Hasn’t even recorded a single out. His ERA is the same. He’s not doing too great. But let’s be fair. We’ll give him the benefit of the doubt, since we know he’s a good pitcher – we’ll pretend he recorded one out before this happened. Now his WHIP is 3.000. Yeesh – ugly. If he keeps pitching like this, his ERA will climb, too, since double after double after double is sure to drive every previous runner home.

Now, obviously, this is a bit ridiculous. Not every at bat is the same. The hitter won’t double every single at bat, and the pitcher won’t allow a double every time either. Baseball is a game of random variation, skill, luck, quality of opponents and teammates, and a whole bunch of other elements. In our scenario, all those elements came together to result in a two-bagger. But, like we said, you can’t expect that to happen every single time just because it happens once.

So… how do we predict what will happen in an at bat? Any person well-versed in baseball research knows that past performance against a specific batter or pitcher means little in terms of how the next at bat will turn out, at least not until you get a meaningful number of plate appearances – and even then it’s not the best tool.

Of course, if we knew the result of every at bat before it happened, it would take most of the fun out of watching. But we’re never going to be able to do that, and so we might as well try to predict as best we can. And so I have come up with a methodology for doing so that I think is very accurate and reliable, and this post is meant to present it to you.

To claim full credit for the inspiration behind this idea would be wrong; FanGraphs author and baseball-statistics aficionado Steve Staude wrote an article back in June 2013 aiming to predict the probability of a strikeout given both the batter’s and the pitcher’s strikeout rates, which led me to this topic. In that article he found a very consistent (and good) model that predicted strikeouts:

Expected Matchup K% = B x P / (0.84 x B x P + 0.16)
Where B = the batter’s historical K% against the handedness of the pitcher; and P = the pitcher’s historical K% against the handedness of the batter

He then followed that up with another article that provided an interactive tool that you could play around with to get the expected K% for a matchup of your choosing and introduced a few new formulas (mostly suggested in the comments of his first article) to provide different perspectives. It’s all very interesting stuff.

But all that gets us is K%. Which, you know, is great, and strikeouts are probably one of the most important and indicative raw numbers to know for a matchup. But that doesn’t tell us about any other stats. So as a means of following up on what he’s done (something he mentioned in the article but I have not seen any evidence of) and also as a way to find the probability of each outcome for every type of matchup (a daunting task), I did my own research.

My methodology was very similar. I took all players and plate appearances from 2003-2013 (Steve’s dataset was 2002-2012; also, I got the data all from retrosheet.org via baseballheatmaps.com – both truly indispensable resources) and for each player found their K%, BB%, 1B%, 2B%, 3B%, HR%, HBP%, and BABIP during that time. This means that a player like, say, Derek Jeter will only have his 2003-2013 stats included, not any from before 2003. I further refined that by separating each player’s numbers into vs. righty and vs. lefty numbers (Steve, in another article, proved that handedness matchups were important). I did this for both batters and pitchers. Then, for each statistic, I grouped the numbers for the batters and the numbers for the pitchers, and found the percentage of plate appearances involving a batter and a pitcher with the two grouped numbers that ended in the result in question. That’s kind of a mouthful, so let me provide an example:

1

These are my results for strikeout percentage (numbers here are expressed as decimals out of 1, not percentages out of 100). Total means the total proportion of plate appearances with those parameters that ended in a strikeout, while batter and pitcher mean the K% of the batter and pitcher, respectively. Count(*) measures exactly how many instances of that exact matchup there were in my data. Another important point to note – this is by no means all of the combinations that exist; in fact, for strikeouts, there were over 2,000, far more than the 20 shown here. I did have to remove many of those since there were too few observations to make meaningful assumptions…

2

…but I was still left with a good amount of data to work with (strikeout percentage gave me just over 400 groupings, which was plenty for my task). I went through this process for each of the rate stats that I laid out above.

My next step was to come up with a model that fit these data – in other words, estimate the total K% from the batter and pitcher K%. I did this by running a multiple regression in R, but I encountered some problems with the linearity of the data. For example, here are the results of my regression for BB% plotted against the real values for BB%:

3

It looks pretty good – and the r^2 of the regression line was .9653, which is excellent – but it appears to be a little bit curved. To counter that I ran a regression with the dependent variable being the natural logarithm of the total BB%, and the independent variables being the natural logarithms of the batter’s and pitcher’s BB%. After running the regression, here is what I got:

4

The scatterplot is much more linear, and the r^2 increased to .988. This means that ln(total) = ln(bat)*coefficient + ln(pitch)*coefficient + intercept. So if we raise both sides from the e, we get total = e^(ln(bat)*coefficient + ln(pit)*coefficient + intercept). This formula, obviously with different coefficients and intercepts, fits each of K%, BB%, 1B%, 3B%, HR%, and HBP% remarkably well; for some reason, both 2B% and BABIP did not need to be “linearized” like this and were fitted better by a simple regression without any logarithm doctoring.

Here are the regression equations, along with the r^2, for each of the stats:

Stat Regression equation r^2
K% e^(.9427*ln(bat) + .9254*ln(pit) + 1.5268) 0.9887
BB% e^(.906*ln(bat) + .8644*ln(pit) + 1.9975) 0.9880
1B% e^(1.01*ln(bat) + 1.017*ln(pit) + 1.943) 0.9312
2B% .9206*bat + .95779*pit – .03968 0.7315
3B% e^(.8435*ln(bat) + .8698*ln(pit) + 3.8809) 0.7739
HR% e^(.9576*ln(bat) + .9268*ln(pit) + 3.2129) 0.8474
HBP% e^(.8761*ln(bat) + .7623*ln(pit) + 2.995) 0.8963
BABIP 1.0403*bat + .9135*pit – .2573 0.9655

The first thing that should jump out to you (or at least one of the first) is the extremely high correlation for BABIP. It totally blew my mind to think that you can find the probability, with 96% accuracy, that a batted ball will fall for a hit, given the batter’s BABIP and pitcher’s BABIP.

Another immediate observation: K%, BB%, and HBP% generally have higher correlations than 1B%, 2B%, 3B%, and HR%. This is likely due to the increased luck and randomness that a batted ball is subjected to; for example, a triple needs to have two things happen to become a triple (being put in play and falling in an area where the batter will get exactly three bases), whereas a strikeout only needs one thing to happen – the batter needs to strike out. Overall, I was very satisfied with these results, since the correlations were overall higher than I expected.

Now comes the good part – putting it all together. We have all the inputs we need to calculate many commonly-used batting stats: AVG, OBP, SLG, OPS, and wOBA. So once we input the batter and pitcher numbers, we should be able to calculate those stats with high accuracy. I developed a tool to do just that:

For a full explanation of the tool and how to use it, head over to to my (new and shiny!) blog. I encourage you to go play around with this to see the different results.

One last thing: it is important to note that I made one big assumption in doing this research that isn’t exactly true and may throw the results off a little bit. The regressions I ran were based off of results for players over their whole career (or at least the part between 2003-2013), which isn’t a great reflector of true talent level. In the long run, I think the results still will hold because there were so many data points, but in using the interactive spreadsheet, your inputs should be whatever you think is the correct reflection of a player’s true talent level (which is why I would suggest using projection systems; I think those are the best determinations of talent), and that will almost certainly not be career numbers.





Jonah is a baseball analyst and Red Sox fan. He would like it if you followed him on Twitter @japemstein, but can't really do anything about it if you don't.

30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
gump
9 years ago

this is awesome. makes me wonder what ootp/bb mogul are using in their engines

Steve
9 years ago

So does this outperform the standard odds ratio/log5 method?

tz
9 years ago

Really cool stuff.

For the inputs, should we use the expected performance based upon the applicable vs. lefty/vs. righty split? It seems that we should, given that the model was calibrated on that basis.

Matt
9 years ago

Interesting. Where did you learn statistics and modeling?

Chad
9 years ago

This was delicious. Very meaty analysis– in all seriousness, great work. I’m interested why you used R rather than Python or perhaps another language

fivetwentyone
9 years ago
Reply to  Jonah Pemstein

scipy , in particular scipy.stats
http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html

also scikit-learn,
http://scikit-learn.org/stable/

I prefer Python and I know it better, but in my experience getting regressions up and running in R is easier than in Python

Joe
9 years ago

R is the standard language for statistical computing especially in research. I don’t see why it’s that weird.

Nate
9 years ago

Very cool stuff, and as a Strat and OOBP gamer, I wondered what the actual combination of pitcher and batter tendencies did (as opposed to the “pick either pitcher card or batter card” implementation of strat, especially).

Nitpicky point:

“It totally blew my mind to think that you can find the probability, with 96% accuracy, that a batted ball will fall for a hit, given the batter’s BABIP and pitcher’s BABIP.”

I am not sure that “96% of the variability in resultant BABIP is explained by variability in the pitcher and batter BABIP according to this model” is the same thing as saying that you can predict the BABIP with 96% accuracy. The latter makes it sound as if you would be exactly correct 96% of the time, when I think the better interpretation is that the measured BABIP will center to the BABIP predicted very tightly (with only 4% of the variability left to “noise” it up). I.e., the error bars would be tiny.

E.g., If I got an r^2 of .5, I wouldn’t think that meant I was going to predict an output variable with 50% accuracy. Apologies if that’s overly pedantic; I think it is an amazing point, and amazing further (given the year to year variability in pitcher BABIP) that the model hews so close, and that there isn’t something else (team BABIP?) that would be required for that robust of a model.

MGL
9 years ago

This is good stuff, but, you will find that one of the terms in the proper regression should be the league average rates for each on the components. You don’t include that and it works with the assumption that the LA rates are fixed for your entire time period, which isn’t true of course.

You will find that if you did a regression for various time periods in which the LA’s were different, you would come up with different coefficients. So it is a little dangerous to use one formula for a player in a certain environment when the regression formula is based an other environment, especially if the rates are quite different in the 2 different environments.

For example, plug in a 2014 league average batter versus a league average pitcher in K rate into your formula. My guess is that it won’t result in a LA result, which it should of course, because your coefficients are based on the average K rate over your entire sample which is quite a bit lower than it is now.

David
9 years ago

Sorry to be a downer but I think there’s a bit of a problem here. The R^2 is so high because the “total” and “batter” columns are mechanically related to one another. The BABIP (or whatever) for the “group” is computed off the same data as the batter BABIP. So, if a plate appearance totally at random results in a hit, this will mechanically link the two together. You point out that you’re using “true talent” by using full career stats and I agree this is an issue, but it’s more than that. You’re using the outcome data to define the quality of the pitcher and batter. When you calculate batter and pitcher BABIP, I believe that leaving out the outcome for the plate appearance in question will fix this.

This is a bigger

(also true for “pitcher”)

David
9 years ago
Reply to  Jonah Pemstein

Is the sample the same for measuring the outcomes and measuring the quality of the batter/pitcher? That is the key. If it is, then there is a problem. You’re guaranteed to have very high R^2.

Here’s a test of the problem. Change the batter and pitcher quality numbers to be the average in that group (rather than the limit of the bucket…0.164739 or whatever it is instead of 0.16). Then run your regression but weight it by the number of observations in each group. You’ll get an R^2 of exactly 1, for all of the statistics…even without controlling for pitcher quality.

Again, I don’t enjoy raining on anybody’s parade but this is a well-known problem that occurs when one runs a regression where the independent variable is a group average of the outcome. Sorry!

David
9 years ago
Reply to  David

Here’s another version. Imagine you just regressed a dummy variable for whether the plate appearance ended with a K on that batter’s K rate. You have to get a coefficient of exactly 1, since regression is computing conditional averages. This is concerning because it’s not measuring any real relationship, just the statistical definition of what regression computes. But the point is that this flawed regression is only slightly different from what you’re doing other than the K rate hasn’t been rounded and the outcome hasn’t been grouped.

Also, I mispoke on one point above…you won’t get an R^2 of 1; you’ll get a coefficient of 1…but it’s still a problem of fitting a variable to a mathematical transformation of the same variable.

In any case, the conclusion is not difficult. Just compute talent levels on different data than the outcome data and the problem goes away.

Derek R-C
9 years ago

You did it. You created baseball engineering

Derek R-C
9 years ago

I noticed 2B% had the lowest R^2 and also did follow your ln equation. I am curious as to why?

tz
9 years ago
Reply to  Jonah Pemstein

I noticed that 2B% and 3B% also are the only two stats where the coefficient for the pitcher’s rate is higher than the coefficient for the hitter’s rate (1B% is basically a tie). So in addition to the points you mentioned, I’m thinking that 2B% and 3B% might also be more dependent upon the combination of the pitcher’s home park and the outfield defense behind him.

Johnny
9 years ago

Interesting, but to see if your result is any good it would be best to train the model on a different set of data than your testing. e.g. train it on data from 2003 to 2010, and then test it on data from 2011 onward (never test on the same data you’re training the model with — you’ll be prone to over fitting the data).

MGL
9 years ago
Reply to  Johnny

Correct.

Kyle
9 years ago

japem, this is awesome stuff. Not being great at math, can you show me an example for 1B%? I am just confused why some stats have ln, and some dont. What would be the input for ln? Thanks for your help!

Kyle
9 years ago

japem, do you still have the spreadsheet to use for this? I see that you created one, but it says I cannot download it. Thanks so much!

Kyle Schutz
9 years ago

For some reason, I am not seeing the spreadsheet. I dont know if its my computer or what. Any possible way you can email me the spreadsheet? I really appreciate the help clearing that up. Its still a bit fuzzy, but that certainly helped. Thanks again.

Email is schutzk21@gmail.com