Author: Captain Tenneal

Author Archive

Using Markov Chains to Predict K% and BB%

May 22, 2015

There are 12 “states” of the count in baseball: 0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1, 2-2, 3-0, 3-1, 3-2. In addition there are 3 “states” in which a plate appearance can end: strikeout, walk, and ball in play. This means that MLB plate appearances lend themselves wonderfully to analysis with Markov chains.

Every pitch thrown in MLB can be classified as a swinging strike, called strike, ball, foul, or ball in play. Each of these classifications has a defined effect in each count. For example, a swinging strike in an 0-1 count leads to an 0-2 count, and a foul in a 2-2 count leads to another 2-2 count.

Using PITCHf/x plate discipline statistics and a little algebra, it is possible to calculate the chance of each of these occurrences on any given pitch. Called strikes, swinging strikes, and balls are easy enough to calculate, but it gets tricky with fouls and balls in play. They both have the same requirements, in that the batter must swing and must make contact. To separate fouls from balls in play, then, we need to find how many pitches a pitcher allowed to be contacted, and then subtract the number of pitches that were put into play. This is easily found, since every batter faced by a pitcher either strikes out, walks, or puts the ball in play.

Unfortunately for the Markov process, major league players do not act randomly. In different counts, pitchers are more or less likely to throw the ball in the zone, and hitters are more or less likely to swing. This must be accounted for or the simulation will bear only a passing resemblance to the game actually played on the field. Using BaseballSavant, I found the rate at which pitchers throw in and out of the zone on every count, and then created an index stat like wRC+, where 100 is average and 110 is 10% more than average. For example, 3-0 counts have a Zone index of 129, and 0-2 counts have a Zone index of just 62. I did the same thing for Z-swing% and O-swing%. One caveat is that the Zone% numbers I got on BaseballSavant do not match those found in the PITCHf/x plate discipline stats. However, since these index stats are all RELATIVE to league average, it should not make a difference.

	ZONE+	ZSWING+	OSWING+
0-0	110	61	53
0-1	88	112	98
0-2	62	131	117
1-0	113	91	82
1-1	99	119	115
1-2	75	134	135
2-0	121	91	80
2-1	115	123	120
2-2	95	137	152
3-0	129	18	19
3-1	128	114	106
3-2	122	139	169

Once we have all this data for a pitcher, we can use a Markov chain to essentially simulate an infinite number of plate appearances for him. Every plate appearance starts at 0-0. By knowing the chances of all the per-pitch results, we can estimate how many 1-0 and 0-1 counts the pitcher would get into, and how many times the pitch would be put into play. From 1-0, we can estimate how many counts become 2-0 or 1-1 or balls in play, and from 0-1, we can estimate how many become 0-2 or 1-1 or balls in play. Simulating in this way, every plate appearance will eventually lead to a strikeout, walk, or ball in play.

For every pitcher who qualified for the ERA title in 2014, I imported his Zone%, Z-swing%, O-swing%, Z-contact%, O-contact%, TBF, K, BB, and HBP (the last 4 only to calculate fair/foul%). Using these, I created a transition matrix for each pitcher that shows the probabilities of moving to any state of the count from any other given count. For example, here is Clayton Kershaw’s 2014 transition matrix.

	0-0	0-1	0-2	1-0	1-1	1-2	2-0	2-1	2-2	3-0	3-1	3-2	K	BB	IP
0-0	0	0.546	0	0.344	0	0	0	0	0	0	0	0	0	0	0.110
0-1	0	0	0.471	0	0.350	0	0	0	0	0	0	0	0	0	0.180
0-2	0	0	0.207	0	0	0.395	0	0	0	0	0	0	0.221	0	0.177
1-0	0	0	0	0	0.542	0	0.290	0	0	0	0	0	0	0	0.168
1-1	0	0	0	0	0	0.509	0	0.283	0	0	0	0	0	0	0.208
1-2	0	0	0	0	0	0.240	0	0	0.317	0	0	0	0.238	0	0.204
2-0	0	0	0	0	0	0	0	0.564	0	0.260	0	0	0	0	0.175
2-1	0	0	0	0	0	0	0	0	0.541	0	0.225	0	0	0	0.234
2-2	0	0	0	0	0	0	0	0	0.283	0	0	0.231	0.246	0	0.241
3-0	0	0	0	0	0	0	0	0	0	0	0.664	0	0	0.298	0.038
3-1	0	0	0	0	0	0	0	0	0	0	0	0.567	0	0.203	0.229
3-2	0	0	0	0	0	0	0	0	0	0	0	0.332	0.242	0.144	0.282
K	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
BB	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
IP	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

The left column represents the count before a given pitch is thrown. The top row represents the count after that pitch has been thrown. The intersection of any column and row is the chance of that particular transition occurring. So, for 2014 Kershaw, there was a 54.6% chance that he would get ahead of a batter 0-1, a 34.4% chance he would fall behind 1-0, and an 11% chance the batter would put the first pitch into play. Since the transition matrix shows the probabilities associated with throwing one pitch, raising the matrix to the second power simulates throwing 2 pitches. Similarly, finding the limit of the matrix simulates throwing an infinite number of pitches, after which a plate appearance is certain to be over. This is why the limit of Kershaw’s matrix (shown below) only has non-zero probabilities in the last 3 columns; after an infinite number of pitches, a plate appearance will have finally reached a conclusion of a strikeout, walk, or ball in play.

	0-0	0-1	0-2	1-0	1-1	1-2	2-0	2-1	2-2	3-0	3-1	3-2	K	BB	IP
0-0	0	0	0	0	0	0	0	0	0	0	0	0	0.285	0.041	0.674
0-1	0	0	0	0	0	0	0	0	0	0	0	0	0.369	0.023	0.608
0-2	0	0	0	0	0	0	0	0	0	0	0	0	0.530	0.014	0.455
1-0	0	0	0	0	0	0	0	0	0	0	0	0	0.243	0.082	0.675
1-1	0	0	0	0	0	0	0	0	0	0	0	0	0.341	0.046	0.613
1-2	0	0	0	0	0	0	0	0	0	0	0	0	0.505	0.029	0.466
2-0	0	0	0	0	0	0	0	0	0	0	0	0	0.202	0.197	0.602
2-1	0	0	0	0	0	0	0	0	0	0	0	0	0.295	0.111	0.594
2-2	0	0	0	0	0	0	0	0	0	0	0	0	0.459	0.069	0.471
3-0	0	0	0	0	0	0	0	0	0	0	0	0	0.136	0.515	0.349
3-1	0	0	0	0	0	0	0	0	0	0	0	0	0.205	0.326	0.469
3-2	0	0	0	0	0	0	0	0	0	0	0	0	0.362	0.216	0.422
K	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
BB	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
IP	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

Now, to predict Kershaw’s K% and BB%, we need only look at the top row, since all plate appearances begin with an 0-0 count. After a 0-0 count, we estimate Kershaw has a 28.5% chance to strike out any given batter and a 4.1% chance to walk him. Kershaw in 2014 actually had a 31.9% strikeout rate and a 4.1% walk rate.

This method produces a very robust r-squared of .86 when plotting xK% vs. actual K%. Unfortunately, r-squared drops to .54 when plotting xBB% vs. actual BB%.

I then imported the same statistics for batters, because there really is no reason why this method should not work equally well for both pitchers and hitters. It actually seems to work better as a whole on batters, with an r-squared of .81 for batters’ strikeouts and .77 for batters’ walks.

If there are any players in particular you’re interested in, I have included the full list of all qualified pitchers and position players, with both their expected and actual strikeout and walk rates.

Player	xK%	2014 K%	xBB%	2014 BB%
Hughes	19.1	21.8	1.8	1.9
Kershaw	28.5	31.9	4.1	4.1
Price	25.6	26.9	3.7	3.8
Sale	31.3	30.4	6.1	5.7
Zimmermann	23.2	22.8	3.3	3.6
Scherzer	28.5	27.9	6.4	7
Bumgarner	24.9	25.1	5.1	4.9
Lackey	22	19.7	3.7	5.6
Kluber	28.1	28.3	4.9	5.4
Strasburg	26.8	27.9	5.6	5
Samardzija	24.4	23	4.8	4.9
Hamels	24.8	23.9	5.6	7.1
McCarthy	20.1	20.9	4.4	3.9
Cueto	23.6	25.2	6.8	6.8
Wood	24.9	24.5	6.3	6.5
Kennedy	24.8	24.5	7.5	8.3
Greinke	25.4	25.2	6.4	5.2
Odorizzi	24.1	24.2	9	8.2
Hutchison	23.1	23.4	7.2	7.6

Teheran	21.4	21	5.2	5.8
Harang	20.8	18.4	6	8.1
Eovaldi	17.9	16.6	4.9	5
Felix	26	27.2	6.8	5
Dickey	21.9	18.9	6.3	8.1
Fat Bartolo	17.4	17.8	4	3.5
Kazmir	20.9	21.1	6.5	6.4
Wainwright	21.4	19.9	5.1	5.6
Wheeler	25.2	23.6	9.8	9.9
Ventura	21.2	20.3	6.9	8.8
Fister	17.8	14.8	4.4	3.6
Chen	17.8	17.6	5.9	4.5
Norris	20	20.2	8.4	7.6
Lester	22.6	24.9	8.1	5.4
Richards	24.9	24.2	8	7.5
Porcello	18.3	15.4	4.6	4.9
Shields	20.8	19.2	6.5	4.7
Lewis	18.8	17.5	5.7	6.3
Simon	18.3	15.5	5.5	6.8
Iwakuma	18.6	21.7	5.5	3

Lynn	20.3	20.9	8.6	8.3
Wood	18.3	18.7	8.4	9.7
Hammel	22.5	22.1	7.7	6.2
Noesi	18.9	16.8	6.3	7.6
Verlander	18.5	17.8	6.8	7.3
Miller	17.9	16.6	6.9	9.6
Young	18.2	15.7	7.5	8.7
Koehler	20.3	19.1	6.6	8.8
Archer	22.9	21	8.1	8.8
Roark	19	17.3	5.9	4.9
Haren	19.1	18.7	7.6	4.6
Peavy	18.4	18.5	7.4	7.4
Ross	25.2	24	9	8.9
Niese	17.9	17.6	5.3	5.7
Tillman	17.5	17.2	7.9	7.6
Cobb	22.6	21.9	8.4	6.9
Danks	19.4	15.1	7.1	8.7
Garza	18.3	18.5	7.4	7.4
Santana	22.1	21.9	7.3	7.7
Quintana	20.4	21.4	9.1	6.3

Alvarez	15.7	14.4	3.9	4.3
Liriano	27.3	25.3	10.8	11.7
Volquez	20.5	17.3	7.1	8.8
Guthrie	16.1	14.4	6.3	5.7
Buchholz	18.7	17.9	7.3	7.3
Gray	20.5	20.4	7.7	8.2
Burnett	21.5	20.3	8.8	10.3
Collmenter	16.4	16	6.9	5.4
Vargas	19.2	16.2	6.9	5.2
Lohse	17.4	17.3	6.9	5.5
de la Rosa	19.2	18.1	10	8.7
Leake	16.7	18.2	6.9	5.5
Vogelsong	17.9	19.4	9.6	7.4
Cosart	18	15	8.4	9.5
Weaver	19.2	19	8.2	7.3
Hudson	16.1	15.2	5.8	4.3
Feldman	15.6	14	8.3	6.5
Kuroda	17.3	17.8	8	4.3
Hernandez	17.4	14.5	9	10.1
Buehrle	14.7	13.9	6.2	5.4

Keuchel	18.5	18.1	8.3	5.9
Peralta	16.5	18.4	9.5	7.3
Elias	19.5	20.6	11.1	9.2
Miley	18.3	21.1	10.3	8.7
Kendrick	15.1	14	7.3	6.6
Wilson	21	19.8	13.1	11.2
Gibson	15.8	14.1	8.5	7.5
Stults	14.5	14.5	8.7	5.9
Gallardo	16.3	17.9	11.1	6.6
McCutchen	19.8	17.7	11.7	13
V-Mart	13.9	6.6	8.3	10.9
Abreu	23.7	21.1	6.5	8.2
Stanton	27.6	26.6	12.9	14.7
Trout	27.9	26.1	12.4	11.8
Bautista	19.8	14.3	12	15.5
Rizzo	23.2	18.8	9.5	11.9
E5	20.3	15.1	9.5	11.4
Brantley	10.9	8.3	8	7.7
Cabrera	17.6	17.1	7.1	8.8
Beltre	16.6	12.1	6.7	9.3

Puig	17.5	19.4	10.4	10.5
Werth	24	18	11.4	13.2
Freeman	18.7	20.5	12.3	12.7
Morneau	11.8	10.9	5.7	6.2
Posey	15.5	11.4	7.5	7.8
Cruz	22	20.6	7	8.1
Kemp	24.8	24.2	7.9	8.7
Ortiz	16.7	15.8	11.1	12.5
Lucroy	18.3	10.8	6.4	10.1
Gomez	19.4	21.9	7.1	7.3
Harrison	17.9	14.7	3.8	4
Upton	27	26.7	8.2	9.4
Altuve	9	7.5	3.3	5.1
Han-Ram	16.2	16.4	9.9	10.9
Duda	25.3	22.7	11.5	11.6
Rendon	17.9	15.2	8.7	8.5
Cano	12.2	10.2	6.6	9.2
Holliday	14.3	15	9.5	11.1
Marte	25.2	24	6.3	6.1
Smith	20	16.7	11.6	13.2

LaRoche	19.8	18.4	12.5	14
Walker	15.2	15.4	9	7.9
Cabrera	13.5	10.8	7.1	6.9
Santana	22.6	18.8	14.2	17.1
Gonzalez	19.3	17	6.1	8.5
Donaldson	19.9	18.7	10.5	10.9
Frazier	22	21.1	8.4	7.9
Fowler	20.8	21.4	13.2	13.1
Seager	18.7	18	9.7	8
Gordon	22.9	19.6	9.9	10.1
Carter	32.4	31.8	9.1	9.8
Peralta	19	17.8	8.7	9.2
Valbuena	24.7	20.7	8.8	11.9
Span	14.3	9.7	5.6	7.5
Calhoun	19.7	19.4	6.3	7.1
Castro	18	17.6	7.3	6.2
Yelich	22.9	20.8	10.9	10.6
Pence	20.8	18.4	8.6	7.3
Jones	20	19.5	5.2	2.8
Gomes	23	23.2	5.6	4.6

Eaton	20.7	15.4	5.3	8
Pujols	14.7	10.2	5.7	6.9
Braun	19.9	19.5	6	7.1
Chisenhall	20.2	18.6	5.2	7.3
Dozier	25.9	18.2	8.6	12.6
Moss	27.8	26.4	9.7	11.6
Blackmon	16.3	14.8	5.7	4.8
Carpenter	25.1	15.7	9.9	13.4
Ozuna	27.8	26.8	6.8	6.7
Adams	19	20.2	5.6	4.6
Hunter	16	15.2	4.6	3.9
Ramirez	13.9	14.1	4.7	4
Dunn	30.9	31.1	14.1	13.9
Zobrist	17.6	12.8	9.4	11.5
Gardner	25.6	21.1	9.6	8.8
Plouffe	19.7	18.7	9.3	9.1
Davis	21.6	22.2	7.6	5.8
Gillaspie	14.9	15.4	6.7	7.1
Byrd	29.4	29	4.3	5.5
Heyward	18	15.1	9.7	10.3

Desmond	27.4	28.2	6.9	7.1
Kendrick	19.9	16.3	5.7	7.1
Ellsbury	14	14.6	7.9	7.7
Cespedes	20.6	19.8	5.4	5.4
Markakis	16.1	11.8	8.1	8.7
Utley	15.8	12.8	8.5	8
Suzuki	15.9	9.1	6.8	6.8
Prado	18.2	14	6.9	4.5
Murphy	13.4	13.4	6.4	6.1
Sandoval	12.1	13.3	4.9	6.1
Mauer	23.5	18.5	9.2	11.6
Choo	26.7	24.8	9.9	11
Reyes	12.7	11.1	5.6	5.8
Granderson	25.3	21.6	10.1	12.1
Aoki	11	8.9	8	7.8
Rollins	21.4	16.4	8.2	10.5
McGehee	16	14.8	8.5	9.7
Kinsler	11.3	10.9	5.7	4
Loney	12.7	12.3	7.2	6.3
Pedroia	19.3	12.3	6	8.4

Solarte	14.6	10.8	7.9	9.9
Teixeira	24.2	21.5	10.3	11.4
Longoria	20.3	19	6.2	8.1
Jones	20.4	21.2	8.9	8.4
Headley	21.9	23	10.9	9.6
Navarro	18	14.6	5.9	6.2
Ramirez	13.2	12.3	4.8	3.7
Crisp	18.3	12.3	8.9	12.3
Freese	24.9	24.3	7.3	7.4
Hosmer	17.4	17	7.7	6.4
Jennings	22.2	19.9	8.3	8.7
Gordon	20.5	16.5	4.5	4.8
Butler	17.3	15.9	5.6	6.8
de Aza	24.6	22.5	6.4	7.4
Crawford	24.8	22.9	7.4	10.5
Rios	18.7	17.9	7.1	4.4
Wright	18.7	19.3	7	7.2
Davis	34.1	33	10	11.4
Aybar	11	9.7	4.7	5.6
Cabrera	16.7	17.5	7.2	8

Montero	19.5	17.3	7.5	10
Castellanos	23.9	24.2	6.7	6.2
Escobar	14.8	13.4	4.6	3.7
Martin	20.5	19.6	5.5	6.7
Howard	30.1	29.3	9.8	10.3
McCann	16.9	14.3	7	5.9
Ackley	19.9	16.6	5.8	5.9
Revere	15.1	7.8	3.9	2.1
Perez	14	14	3.4	3.6
Hardy	24.8	18.3	5	5.1
Viciedo	20.2	21.7	6.5	5.7
Lowrie	13.6	14	7.3	9
Mercer	19.5	16	5.5	6.3
Escobar	10.9	11.3	8.9	8.1
Parra	14.8	17.4	7.3	5.6
Bogaerts	26.3	23.2	6.8	6.6
Jackson	23.4	22	8	7.2
LeMahieu	16.5	18	6	6.1
Castro	27	29.5	8.1	6.6
Andrus	18.5	14	8.5	6.7

Hechavarria	13.2	15	4	4.5
Hill	17.7	17	7.4	5.2
Kipnis	22.3	18	7.6	9
Johnson	26	26	3.9	3.8
Bruce	26.1	27.3	8.5	8.1
Hamilton	20.4	19.1	6	5.6
Brown	15.2	17.8	8.4	6.6
Infante	14.6	11.8	6.2	5.7
Jeter	12.1	13.7	5.5	5.5
Upton	29.3	29.7	7.4	9.8
Simmons	11.1	10.4	5.1	5.6
Segura	15.6	12.6	4.3	5
Craig	21.6	22.4	7.2	6.9
Dominguez	21.9	20.6	5.2	4.8
Cozart	15.3	14.5	5.3	4.6

One advantage of this method over any of the many regression based estimates using plate discipline stats is that this can be further tailored to each player. The reason for this is that ZONE+, ZSWING+, and OSWING+ are all league average indexes, and some players’ talents are just not captured by league averages. For example, Dustin Pedroia’s expected strikeout rate is nowhere near his actual strikeout rate. Presumably, Pedroia has swing tendencies in certain counts that are markedly different from the average hitter. By examining these swing tendencies, it is likely possible to predict Pedroia’s yearly strikeout rates with much greater accuracy, as those tendencies are probably part of his approach at the plate year after year. Still, as preliminary research into this area, these I think these results as a whole are very promising.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG