A player comes up to the plate. He’s a very good hitter; he’s hitting .300 on the year and has 40 home runs. On the mound stands a pitcher, also very good. The pitcher is a Cy Young candidate, and his ERA sits barely over 2.00. He leads the league in strikeouts and issues very few walks.
After a 10-pitch battle, the pitcher is the one to crack and the batter slaps a hanging curveball into the gap for a double. The batter has won. His batting average for the at bat is a very nice 1.000. Same for his OBP. His slugging percentage? 2.000. Fantastic. If he did this every time, he’d be MVP, no question, every year. The pitcher, meanwhile, has a WHIP for the at bat of #DIV/0!. Hasn’t even recorded a single out. His ERA is the same. He’s not doing too great. But let’s be fair. We’ll give him the benefit of the doubt, since we know he’s a good pitcher – we’ll pretend he recorded one out before this happened. Now his WHIP is 3.000. Yeesh – ugly. If he keeps pitching like this, his ERA will climb, too, since double after double after double is sure to drive every previous runner home.
Now, obviously, this is a bit ridiculous. Not every at bat is the same. The hitter won’t double every single at bat, and the pitcher won’t allow a double every time either. Baseball is a game of random variation, skill, luck, quality of opponents and teammates, and a whole bunch of other elements. In our scenario, all those elements came together to result in a two-bagger. But, like we said, you can’t expect that to happen every single time just because it happens once.
So… how do we predict what will happen in an at bat? Any person well-versed in baseball research knows that past performance against a specific batter or pitcher means little in terms of how the next at bat will turn out, at least not until you get a meaningful number of plate appearances – and even then it’s not the best tool.
Of course, if we knew the result of every at bat before it happened, it would take most of the fun out of watching. But we’re never going to be able to do that, and so we might as well try to predict as best we can. And so I have come up with a methodology for doing so that I think is very accurate and reliable, and this post is meant to present it to you.
To claim full credit for the inspiration behind this idea would be wrong; FanGraphs author and baseball-statistics aficionado Steve Staude wrote an article back in June 2013 aiming to predict the probability of a strikeout given both the batter’s and the pitcher’s strikeout rates, which led me to this topic. In that article he found a very consistent (and good) model that predicted strikeouts:
Expected Matchup K% = B x P / (0.84 x B x P + 0.16)
Where B = the batter’s historical K% against the handedness of the pitcher; and P = the pitcher’s historical K% against the handedness of the batter
He then followed that up with another article that provided an interactive tool that you could play around with to get the expected K% for a matchup of your choosing and introduced a few new formulas (mostly suggested in the comments of his first article) to provide different perspectives. It’s all very interesting stuff.
But all that gets us is K%. Which, you know, is great, and strikeouts are probably one of the most important and indicative raw numbers to know for a matchup. But that doesn’t tell us about any other stats. So as a means of following up on what he’s done (something he mentioned in the article but I have not seen any evidence of) and also as a way to find the probability of each outcome for every type of matchup (a daunting task), I did my own research.
My methodology was very similar. I took all players and plate appearances from 2003-2013 (Steve’s dataset was 2002-2012; also, I got the data all from retrosheet.org via baseballheatmaps.com – both truly indispensable resources) and for each player found their K%, BB%, 1B%, 2B%, 3B%, HR%, HBP%, and BABIP during that time. This means that a player like, say, Derek Jeter will only have his 2003-2013 stats included, not any from before 2003. I further refined that by separating each player’s numbers into vs. righty and vs. lefty numbers (Steve, in another article, proved that handedness matchups were important). I did this for both batters and pitchers. Then, for each statistic, I grouped the numbers for the batters and the numbers for the pitchers, and found the percentage of plate appearances involving a batter and a pitcher with the two grouped numbers that ended in the result in question. That’s kind of a mouthful, so let me provide an example:
These are my results for strikeout percentage (numbers here are expressed as decimals out of 1, not percentages out of 100). Total means the total proportion of plate appearances with those parameters that ended in a strikeout, while batter and pitcher mean the K% of the batter and pitcher, respectively. Count(*) measures exactly how many instances of that exact matchup there were in my data. Another important point to note – this is by no means all of the combinations that exist; in fact, for strikeouts, there were over 2,000, far more than the 20 shown here. I did have to remove many of those since there were too few observations to make meaningful assumptions…
…but I was still left with a good amount of data to work with (strikeout percentage gave me just over 400 groupings, which was plenty for my task). I went through this process for each of the rate stats that I laid out above.
My next step was to come up with a model that fit these data – in other words, estimate the total K% from the batter and pitcher K%. I did this by running a multiple regression in R, but I encountered some problems with the linearity of the data. For example, here are the results of my regression for BB% plotted against the real values for BB%:
It looks pretty good – and the r^2 of the regression line was .9653, which is excellent – but it appears to be a little bit curved. To counter that I ran a regression with the dependent variable being the natural logarithm of the total BB%, and the independent variables being the natural logarithms of the batter’s and pitcher’s BB%. After running the regression, here is what I got:
The scatterplot is much more linear, and the r^2 increased to .988. This means that ln(total) = ln(bat)*coefficient + ln(pitch)*coefficient + intercept. So if we raise both sides from the e, we get total = e^(ln(bat)*coefficient + ln(pit)*coefficient + intercept). This formula, obviously with different coefficients and intercepts, fits each of K%, BB%, 1B%, 3B%, HR%, and HBP% remarkably well; for some reason, both 2B% and BABIP did not need to be “linearized” like this and were fitted better by a simple regression without any logarithm doctoring.
Here are the regression equations, along with the r^2, for each of the stats:
|K%||e^(.9427*ln(bat) + .9254*ln(pit) + 1.5268)||0.9887|
|BB%||e^(.906*ln(bat) + .8644*ln(pit) + 1.9975)||0.9880|
|1B%||e^(1.01*ln(bat) + 1.017*ln(pit) + 1.943)||0.9312|
|2B%||.9206*bat + .95779*pit – .03968||0.7315|
|3B%||e^(.8435*ln(bat) + .8698*ln(pit) + 3.8809)||0.7739|
|HR%||e^(.9576*ln(bat) + .9268*ln(pit) + 3.2129)||0.8474|
|HBP%||e^(.8761*ln(bat) + .7623*ln(pit) + 2.995)||0.8963|
|BABIP||1.0403*bat + .9135*pit – .2573||0.9655|
The first thing that should jump out to you (or at least one of the first) is the extremely high correlation for BABIP. It totally blew my mind to think that you can find the probability, with 96% accuracy, that a batted ball will fall for a hit, given the batter’s BABIP and pitcher’s BABIP.
Another immediate observation: K%, BB%, and HBP% generally have higher correlations than 1B%, 2B%, 3B%, and HR%. This is likely due to the increased luck and randomness that a batted ball is subjected to; for example, a triple needs to have two things happen to become a triple (being put in play and falling in an area where the batter will get exactly three bases), whereas a strikeout only needs one thing to happen – the batter needs to strike out. Overall, I was very satisfied with these results, since the correlations were overall higher than I expected.
Now comes the good part – putting it all together. We have all the inputs we need to calculate many commonly-used batting stats: AVG, OBP, SLG, OPS, and wOBA. So once we input the batter and pitcher numbers, we should be able to calculate those stats with high accuracy. I developed a tool to do just that:
For a full explanation of the tool and how to use it, head over to to my (new and shiny!) blog. I encourage you to go play around with this to see the different results.
One last thing: it is important to note that I made one big assumption in doing this research that isn’t exactly true and may throw the results off a little bit. The regressions I ran were based off of results for players over their whole career (or at least the part between 2003-2013), which isn’t a great reflector of true talent level. In the long run, I think the results still will hold because there were so many data points, but in using the interactive spreadsheet, your inputs should be whatever you think is the correct reflection of a player’s true talent level (which is why I would suggest using projection systems; I think those are the best determinations of talent), and that will almost certainly not be career numbers.
Jonah is a baseball analyst and Red Sox fan. He would like it if you followed him on Twitter @japemstein, but can't really do anything about it if you don't.