There are 12 “states” of the count in baseball: 0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1, 2-2, 3-0, 3-1, 3-2. In addition there are 3 “states” in which a plate appearance can end: strikeout, walk, and ball in play. This means that MLB plate appearances lend themselves wonderfully to analysis with Markov chains.
Every pitch thrown in MLB can be classified as a swinging strike, called strike, ball, foul, or ball in play. Each of these classifications has a defined effect in each count. For example, a swinging strike in an 0-1 count leads to an 0-2 count, and a foul in a 2-2 count leads to another 2-2 count.
Using PITCHf/x plate discipline statistics and a little algebra, it is possible to calculate the chance of each of these occurrences on any given pitch. Called strikes, swinging strikes, and balls are easy enough to calculate, but it gets tricky with fouls and balls in play. They both have the same requirements, in that the batter must swing and must make contact. To separate fouls from balls in play, then, we need to find how many pitches a pitcher allowed to be contacted, and then subtract the number of pitches that were put into play. This is easily found, since every batter faced by a pitcher either strikes out, walks, or puts the ball in play.
Unfortunately for the Markov process, major league players do not act randomly. In different counts, pitchers are more or less likely to throw the ball in the zone, and hitters are more or less likely to swing. This must be accounted for or the simulation will bear only a passing resemblance to the game actually played on the field. Using BaseballSavant, I found the rate at which pitchers throw in and out of the zone on every count, and then created an index stat like wRC+, where 100 is average and 110 is 10% more than average. For example, 3-0 counts have a Zone index of 129, and 0-2 counts have a Zone index of just 62. I did the same thing for Z-swing% and O-swing%. One caveat is that the Zone% numbers I got on BaseballSavant do not match those found in the PITCHf/x plate discipline stats. However, since these index stats are all RELATIVE to league average, it should not make a difference.
Once we have all this data for a pitcher, we can use a Markov chain to essentially simulate an infinite number of plate appearances for him. Every plate appearance starts at 0-0. By knowing the chances of all the per-pitch results, we can estimate how many 1-0 and 0-1 counts the pitcher would get into, and how many times the pitch would be put into play. From 1-0, we can estimate how many counts become 2-0 or 1-1 or balls in play, and from 0-1, we can estimate how many become 0-2 or 1-1 or balls in play. Simulating in this way, every plate appearance will eventually lead to a strikeout, walk, or ball in play.
For every pitcher who qualified for the ERA title in 2014, I imported his Zone%, Z-swing%, O-swing%, Z-contact%, O-contact%, TBF, K, BB, and HBP (the last 4 only to calculate fair/foul%). Using these, I created a transition matrix for each pitcher that shows the probabilities of moving to any state of the count from any other given count. For example, here is Clayton Kershaw’s 2014 transition matrix.
The left column represents the count before a given pitch is thrown. The top row represents the count after that pitch has been thrown. The intersection of any column and row is the chance of that particular transition occurring. So, for 2014 Kershaw, there was a 54.6% chance that he would get ahead of a batter 0-1, a 34.4% chance he would fall behind 1-0, and an 11% chance the batter would put the first pitch into play. Since the transition matrix shows the probabilities associated with throwing one pitch, raising the matrix to the second power simulates throwing 2 pitches. Similarly, finding the limit of the matrix simulates throwing an infinite number of pitches, after which a plate appearance is certain to be over. This is why the limit of Kershaw’s matrix (shown below) only has non-zero probabilities in the last 3 columns; after an infinite number of pitches, a plate appearance will have finally reached a conclusion of a strikeout, walk, or ball in play.
Now, to predict Kershaw’s K% and BB%, we need only look at the top row, since all plate appearances begin with an 0-0 count. After a 0-0 count, we estimate Kershaw has a 28.5% chance to strike out any given batter and a 4.1% chance to walk him. Kershaw in 2014 actually had a 31.9% strikeout rate and a 4.1% walk rate.
This method produces a very robust r-squared of .86 when plotting xK% vs. actual K%. Unfortunately, r-squared drops to .54 when plotting xBB% vs. actual BB%.
I then imported the same statistics for batters, because there really is no reason why this method should not work equally well for both pitchers and hitters. It actually seems to work better as a whole on batters, with an r-squared of .81 for batters’ strikeouts and .77 for batters’ walks.
If there are any players in particular you’re interested in, I have included the full list of all qualified pitchers and position players, with both their expected and actual strikeout and walk rates.
One advantage of this method over any of the many regression based estimates using plate discipline stats is that this can be further tailored to each player. The reason for this is that ZONE+, ZSWING+, and OSWING+ are all league average indexes, and some players’ talents are just not captured by league averages. For example, Dustin Pedroia’s expected strikeout rate is nowhere near his actual strikeout rate. Presumably, Pedroia has swing tendencies in certain counts that are markedly different from the average hitter. By examining these swing tendencies, it is likely possible to predict Pedroia’s yearly strikeout rates with much greater accuracy, as those tendencies are probably part of his approach at the plate year after year. Still, as preliminary research into this area, these I think these results as a whole are very promising.