Looking at xK% and xBB% Using StatCast Zones

by Joe Wilkey

August 17, 2021

Mike Podhorzer has recently been publishing articles regarding his xK%, primarily using Baseball-Reference’s strike rates. I have been trying for a few years now to come up with my own using StatCast numbers, and you’d be surprised how little data is required to get a very close xK% and xBB% for both hitters and pitchers.

I love Mike’s work, but that xK% equation is a bit unwieldy, including seven inputs, two of which have a correlation of -.829. It also uses roughly 63% of the pitches seen by hitters by accounting for all strikes, in addition to three other variables on top of that. I wanted to come up with something simpler that didn’t require so much input and wasn’t so overly constrained. After all, if we throw so much data in there that we basically know the outcome before we start, what good is it going forward? To that end, I dove into StatCast to see what might work.

To start with, StatCast has four defined “attack zones:” Heart, Shadow, Chase, and Waste. I wanted to use these since it’s more specific than the In Zone/Out of Zone in standard plate discipline metrics. Next, I broke the results of those four zones into four outcomes for each zone: In Play, Whiff, Foul, and Take. I have omitted intentional balls and intentional walks from the model since those are not necessarily “skill” based, at least in terms of plate discipline. They do get added back in later after all the calculations are complete.

After this breakdown, I have 16 zone/outcome combinations. I examined all batters who saw 5,000 total pitches from 2016-20 (N=255) and all pitchers who had 5,000 pitches from 2016-20 (N=181). For an idea of frequency for each combination, Shadow/Take is the most frequent at around 20.2% in the hitter set, and Waste/In Play is the least frequent at 0.02%. As I mentioned, I removed IBB from the PA count and BB count, since those PA are basically guaranteed to be walks, Miguel Cabrera antics notwithstanding.

I started with the zone/outcome that had the highest correlation to the rate in question, ran a linear regression with that combination, and proceeded to compare the remaining combinations with the difference between the xRate and the actual rate. Once I found a set of zone/outcome combinations such that adding another combination to the set did not appreciably increase the accuracy (a judgment call, to be sure), I used Principal Component Analysis (h/t Lucas Kelly) to try to sharpen the pencil in terms of covariance, except for xBB% for pitchers, where the two combinations were not as tied together and is an interesting beast to tame. It turns out both rates for hitters and pitchers can be somewhat-well-defined to pretty-well-defined by two zone/outcome combinations each.

For both hitters and pitchers, using Shadow/Whiff% and Shadow/In Play% provided a very good approximation for K%, with an adjusted R^2 of 0.834 for hitters and 0.800 for pitchers over the 2016-20 sample. These two combinations account for roughly 14% of pitches seen, or less than one out of every seven pitches. Even better, if you look at all players in 2021 with at least 50 PA/TBF, the adjusted R^2 using the xK% equation from the 2016-20 data is 0.645 for hitters and 0.644 for pitchers, so it works pretty well for smaller sample sizes in an independent set as well. Also, if you use the model on the league as a whole, the xK% is very close to the overall league K%. The equation for hitters is:

xK% = 1.89353*S/W% – 2.01758*S/IP% + 26.935%

and for pitchers:

xK% = 2.32362*S/W% – 2.19556*S/IP% + 26.954%

For hitters, using Chase/Take% and Shadow/In Play% (again) approximated BB% well, with an adjusted R^2 of 0.816. These two combinations account for roughly 26% of pitches seen. Applying the 2016-20 model to the 2021 hitters gave an adjusted R^2 of 0.577, again a pretty good result for smaller sample sizes in an independent set. Again, using the model on the league as a whole, the xBB% is very close to the overall league BB%. The equation for hitters is:

xBB% = 0.91872*C/T% – 0.96350*S/IP% + 0.086%

Pitchers are a little trickier, but I ended up using the same combinations as hitters for simplicity’s sake. They seemed to result in the best bang for the buck and were actually not as covariate with each other as some other options. The adjusted R^2 of 0.530 is not nearly as good as the other models, and the 2021 adjusted R^2 of 0.391 was also weaker. I would love to delve into this more, but to me, it seems like walk rates are more defined by hitters than pitchers. This is on my list of things to investigate going forward, as I would be interested to see if including a “batters faced” component would be beneficial. I’m not as confident on this model, but here it is nonetheless:

xBB% = 0.96329*C/T% – 1.05340*S/IP% – 0.744%

I do find it particularly interesting that the coefficient for the xK% models are in the 27% range, where the xBB% models have a coefficient of less than 1%. If you think about an extreme case where S/W%, S/IP%, and C/T% are all zero, K% goes to 27% and BB% goes to zero. This makes sense to me. It only takes three strikes to strike out but you have to have four balls to walk, so you would think you’d get to Ks before BBs, all other things being equal.

Now that we’ve done this exercise, let’s look at leaders and laggards for these in the 2021 season. All of these models have IBB removed from PA and BB, the xStat model run, and then the IBB are added back into PA and BB. All data is through the All-Star Break. Let’s look at the hitters’ Ks first:

Positive Regressors (more Ks than anticipated):

xK Positive Regressors – Hitters

Name	PA	IBB	Pitches	S/W	S/IP	S/W%	S/IP%	K%	xK%	K	xK	xK-K
Isiah Kiner-Falefa	384	1	1331	39	152	2.93%	11.42%	15.89%	9.42%	61	36.2	-24.8
Jared Walsh	354	4	1376	82	115	5.96%	8.36%	27.68%	21.12%	98	74.7	-23.3
Brett Phillips	187	0	822	49	44	5.96%	5.35%	39.04%	27.42%	73	51.3	-21.7
Adam Duvall	290	0	1122	77	90	6.86%	8.02%	30.69%	23.75%	89	68.9	-20.1
Yordan Alvarez	315	2	1304	59	95	4.52%	7.29%	26.35%	20.67%	83	65.1	-17.9
Steven Duggar	186	2	768	37	45	4.82%	5.86%	32.80%	23.98%	61	44.6	-16.4
Joey Wendle	277	3	1067	48	97	4.50%	9.09%	22.38%	16.93%	62	46.9	-15.1
JaCoby Jones	105	0	396	36	36	9.09%	9.09%	40.00%	25.81%	42	27.1	-14.9
Gio Urshela	314	0	1221	75	111	6.14%	9.09%	24.84%	20.22%	78	63.5	-14.5
Alec Bohm	329	0	1334	71	101	5.32%	7.57%	26.14%	21.74%	86	71.5	-14.5
League Average						6.24%	7.56%	23.76%	23.49%

These are the 10 biggest differences in terms of Ks, not K%, so there is no PA minimum here. It’s interesting to see an array of different profiles here, with actual K% ranging from 16% to 40% and xK% ranging from 9% to 26%. This seems to suggest that this could identify underperformers across an array of different approaches. Now for the overperformers:

Negative Regressors (fewer Ks than expected)

xK Negative Regressors – Hitters

Name	PA	IBB	Pitches	S/W	S/IP	S/W%	S/IP%	K%	xK%	K	xK	xK-K
Vladimir Guerrero Jr.	374	5	1387	102	85	7.35%	6.13%	17.65%	28.11%	66	105.1	39.1
Ozzie Albies	371	1	1398	112	112	8.01%	8.01%	17.79%	25.87%	66	96.0	30.0
Brandon Crawford	302	3	1195	107	92	8.95%	7.70%	21.19%	28.08%	64	84.8	20.8
Carlos Santana	375	3	1537	90	133	5.86%	8.65%	14.93%	20.40%	56	76.5	20.5
Max Muncy	319	4	1312	60	72	4.57%	5.49%	17.87%	24.21%	57	77.2	20.2
Juan Soto	332	7	1367	58	87	4.24%	6.36%	15.66%	21.66%	52	71.9	19.9
Cesar Hernandez	371	0	1461	92	83	6.30%	5.68%	22.10%	27.40%	82	101.6	19.6
Charlie Blackmon	327	1	1184	58	98	4.90%	8.28%	13.46%	19.45%	44	63.6	19.6
Chris Taylor	343	1	1418	111	75	7.83%	5.29%	25.66%	31.00%	88	106.3	18.3
Nelson Cruz	318	6	1163	90	101	7.74%	8.68%	17.92%	23.61%	57	75.1	18.1
League Average						6.24%	7.56%	23.76%	23.49%

This group is almost all players with above average K% who are expected to be generally league average range. Could be that I’m missing something here, but it may be that pitchers may change their approach to these players in the second half to account for something. Perhaps something to be revisited at the end of the season. Next up, pitcher K underperformers:

Positive Regressors (less Ks than anticipated):

xK Positive Regressors – Pitchers

Name	TBF	IBB	Pitches	S/W	S/IP	S/W%	S/IP%	K%	xK%	K	xK	xK-K
Luis Castillo	449	2	1767	135	136	7.64%	7.70%	21.38%	27.68%	96	124.3	28.3
Sandy Alcantara	476	2	1760	133	142	7.56%	8.07%	21.43%	26.69%	102	127.0	25.0
John Gant	323	2	1305	72	92	5.52%	7.05%	16.72%	24.15%	54	78.0	24.0
Griffin Canning	277	0	1080	95	83	8.80%	7.69%	22.38%	30.52%	62	84.5	22.5
Zach Davies	405	0	1539	91	145	5.91%	9.42%	14.57%	20.01%	59	81.0	22.0
Shane McClanahan	252	0	987	99	60	10.03%	6.08%	28.17%	36.91%	71	93.0	22.0
Antonio Senzatela	414	0	1491	74	121	4.96%	8.12%	15.70%	20.67%	65	85.6	20.6
Patrick Sandoval	232	0	941	89	63	9.46%	6.70%	25.43%	34.23%	59	79.4	20.4
Johan Oviedo	244	1	914	59	73	6.46%	7.99%	16.39%	24.32%	40	59.3	19.3
Matt Shoemaker	284	1	1064	62	97	5.83%	9.12%	14.08%	20.41%	40	58.0	18.0
League Average						6.24%	7.56%	23.76%	24.85%

Just like the hitter underperformers, there is a wide range here. The double-edged sword with this model is it will adjust in real time to spin rate changes, but it’s more looking to strip out sequencing rather than pitch behavior. Now for the pitcher K overperformers:

Negative Regressors (more Ks than anticipated):

xK Negative Regressors – Pitchers

Name	TBF	IBB	Pitches	S/W	S/IP	S/W%	S/IP%	K%	xK%	K	xK	xK-K
Jose Berrios	440	0	1725	78	153	4.52%	8.87%	25.91%	17.99%	114	79.1	-34.9
Sonny Gray	265	1	1033	55	76	5.32%	7.36%	30.19%	23.09%	80	61.2	-18.8
Matt Barnes	143	2	602	39	28	6.48%	4.65%	44.06%	31.35%	63	44.8	-18.2
James Karinchak	160	0	719	50	38	6.95%	5.29%	42.50%	31.51%	68	50.4	-17.6
Jacob deGrom	324	0	1226	141	78	11.50%	6.36%	45.06%	39.71%	146	128.7	-17.3
Freddy Peralta	385	0	1615	114	90	7.06%	5.57%	35.06%	31.12%	135	119.8	-15.2
Brandon Woodruff	432	0	1695	109	119	6.43%	7.02%	29.86%	26.48%	129	114.4	-14.6
Aaron Nola	427	1	1677	119	132	7.10%	7.87%	29.51%	26.10%	126	111.4	-14.6
Corbin Burnes	345	0	1361	112	80	8.23%	5.88%	37.10%	33.17%	128	114.4	-13.6
Julio Urias	429	3	1579	109	131	6.90%	8.30%	27.74%	24.61%	119	105.6	-13.4
League Average						6.24%	7.56%	23.76%	24.85%

Some interesting stuff in this table. A couple of relievers show up, and before we start saying that the model is missing something for them, the top five xK% pitchers with at least 100 TBF are Josh Hader (44.96%), Liam Hendriks (41.73%), Devin Williams (39.82%), Raisel Iglesias (39.74%), and Sam Howard (39.72%). They are a combined 16 Ks below their expected Ks (298 xKs combined). Also, holy guacamole Jacob DeGrom. I know the model says he should have fewer Ks, but it merely takes him to superhuman rather than ultrahuman. On to BB underperformers for hitters:

Positive Regressors (fewer BBs than anticipated):

xBB Positive Regressors – Hitters

Name	PA	IBB	Pitches	S/IP	C/T	S/IP%	C/T%	BB%	xBB%	BB	xBB	xBB-BB
Nick Solak	354	0	1405	109	259	7.76%	18.43%	5.37%	9.55%	19	33.8	14.8
Dansby Swanson	362	4	1453	100	258	6.88%	17.76%	7.18%	10.77%	26	39.0	13.0
Eugenio Suarez	359	0	1461	105	285	7.19%	19.51%	8.36%	11.08%	30	39.8	9.8
Randy Arozarena	357	2	1438	80	264	5.56%	18.36%	9.52%	12.09%	34	43.2	9.2
Yordan Alvarez	315	2	1304	95	239	7.29%	18.33%	7.62%	10.48%	24	33.0	9.0
Teoscar Hernandez	293	1	1103	92	204	8.34%	18.50%	6.48%	9.35%	19	27.4	8.4
Eli White	166	0	683	36	124	5.27%	18.16%	6.63%	11.69%	11	19.4	8.4
Kevin Kiermaier	209	1	778	52	136	6.68%	17.48%	6.22%	10.14%	13	21.2	8.2
Eric Haase	171	0	695	46	127	6.62%	18.27%	5.85%	10.50%	10	17.9	7.9
Garrett Hampson	301	1	1186	80	190	6.75%	16.02%	5.98%	8.61%	18	25.9	7.9
League Average						7.56%	17.38%	8.92%	8.76%

All but one of these players are expected to have an above-average walk rate per the model. Yordan Alvarez shows up both here and on the K% underperformers list, so he could be in for a good second half if these models are in any way accurate. Now for the overperformers:

Negative Regressors (more BBs than anticipated):

xBB Negative Regressors – Hitters

Name	PA	IBB	Pitches	S/IP	C/T	S/IP%	C/T%	BB%	xBB%	BB	xBB	xBB-BB
Yasmani Grandal	246	0	1147	56	235	4.88%	20.49%	24.39%	14.20%	60	34.9	-25.1
Yandy Diaz	323	2	1262	108	243	8.56%	19.26%	16.10%	10.09%	52	32.6	-19.4
Carlos Santana	375	3	1537	133	309	8.65%	20.10%	15.73%	10.94%	59	41.0	-18.0
Jose Altuve	368	2	1338	136	240	10.16%	17.94%	11.68%	7.28%	43	26.8	-16.2
Joey Gallo	351	4	1474	75	323	5.09%	21.91%	20.51%	16.28%	72	57.1	-14.9
Garrett Cooper	238	0	955	80	145	8.38%	15.18%	12.18%	5.96%	29	14.2	-14.8
J.T. Realmuto	276	2	1097	92	171	8.39%	15.59%	11.96%	7.01%	33	19.3	-13.7
Leury Garcia	270	0	949	87	136	9.17%	14.33%	9.26%	4.42%	25	11.9	-13.1
Carson Kelly	187	1	773	56	129	7.24%	16.69%	14.97%	8.93%	28	16.7	-11.3
Jake Fraley	149	1	678	34	139	5.01%	20.50%	22.15%	14.67%	33	21.9	-11.1
League Average						7.56%	17.38%	8.92%	8.76%

There are a lot of eye-popping walk rates here, and six out of the 10 are still expected to have above-average walk rates. Similar to the xK% overperformers, it will be interesting to see the second halves for some of these players. I have not dug into it, but I would bet pitchers are “avoiding” throwing these players strikes and they’re just generally taking a lot. Carlos Santana is the anti-Yordan Alvarez, with more walks than expected and fewer strikeouts. I have been suggested to add age or career PAs to the model by my wife, who is a statistics whiz, so perhaps that is something that could be included going forward and may explain both Alvarez and Santana. I’ll toss the pitcher BB tables in here too, just for posterity, even though I’m not as confident in this model:

Positive Regressors (more BBs than anticipated):

xBB Positive Regressors – Pitchers

Name	TBF	IBB	Pitches	S/IP	C/T	S/IP%	C/T%	BB%	xBB%	BB	xBB	xBB-BB
John Gant	323	2	1305	92	236	7.05%	18.08%	16.41%	9.81%	53	31.7	-21.3
Yusei Kikuchi	392	0	1489	122	223	8.19%	14.98%	8.67%	5.05%	34	19.8	-14.2
Justin Dunn	218	0	870	65	141	7.47%	16.21%	13.30%	7.00%	29	15.3	-13.7
Jake Brentz	169	1	666	48	104	7.21%	15.62%	15.38%	7.26%	26	12.3	-13.7
Carlos Martinez	363	2	1281	115	212	8.98%	16.55%	9.92%	6.26%	36	22.7	-13.3
Junior Guerra	182	1	732	59	133	8.06%	18.17%	15.93%	8.77%	29	16.0	-13.0
Triston McKenzie	212	0	895	52	182	5.81%	20.34%	18.87%	12.72%	40	27.0	-13.0
Alex Reyes	176	1	717	42	130	5.86%	18.13%	18.18%	11.06%	32	19.5	-12.5
Kyle Hendricks	445	1	1560	169	223	10.83%	14.29%	4.49%	1.84%	20	8.2	-11.8
Dan Winkler	137	0	561	45	82	8.02%	14.62%	13.14%	4.89%	18	6.7	-11.3
League Average						7.56%	17.38%	8.92%	8.03%

Negative Regressors (fewer BBs than anticipated):

xBB Negative Regressors – Pitchers

Name	TBF	IBB	Pitches	S/IP	C/T	S/IP%	C/T%	BB%	xBB%	BB	xBB	xBB-BB
Jacob deGrom	324	0	1226	78	204	6.36%	16.64%	3.40%	8.58%	11	27.8	16.8
Corbin Burnes	345	0	1361	80	226	5.88%	16.61%	4.35%	9.06%	15	31.3	16.3
Zack Greinke	459	0	1714	160	339	9.33%	19.78%	5.01%	8.47%	23	38.9	15.9
Max Scherzer	377	0	1597	90	270	5.64%	16.91%	5.84%	9.61%	22	36.2	14.2
Antonio Senzatela	414	0	1491	121	275	8.12%	18.44%	5.07%	8.47%	21	35.1	14.1
Zack Wheeler	471	1	1806	132	311	7.31%	17.22%	5.52%	8.34%	26	39.3	13.3
Eduardo Rodriguez	385	0	1523	123	282	8.08%	18.52%	5.45%	8.58%	21	33.1	12.1
Carlos Rodon	360	1	1511	70	247	4.63%	16.35%	7.22%	10.37%	26	37.3	11.3
Zach Eflin	421	2	1536	140	253	9.11%	16.47%	3.33%	5.97%	14	25.1	11.1
Adam Wainwright	438	3	1659	128	306	7.72%	18.44%	7.08%	9.52%	31	41.7	10.7
League Average						7.56%	17.38%	8.92%	8.03%

John Gant shows up on both “good” tables while DeGrom and Corbin Burnes show up on both “bad” tables. Antonio Senzatela shows up on +Ks and +BBs, which is an interesting combo, although the less balls in play in Colorado the better. Again, it appears an age/experience factor may be showing up here too, something to look into going forward.

Over the course of a partial season, you can see that there can be some sizeable effects from applying this model to something like xwOBA. It’s something that I feel is missing from the xwOBA calculations and can be added in to really try to capture a more detailed plate discipline picture to get a better feel for a player’s expected performance. I think further evaluation could yield a more predictive (rather than descriptive) model, using age/experience and/or using a cluster analysis to identify “types” of hitters rather than simply looking at individual zone/outcome combinations.

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

newsenseMember since 2020

3 years ago

This is a thoughtful attempt, but I can’t help but think that treating all players’ rates as determined by these factors, so that any deviance is likely to regress, is a big oversimplification. Some players will be different based upon their approach and skill set. Some of that may relate to age but much more likely comes into play.

Joe WilkeyMember since 2025

Reply to newsense

But this is true of any model, right? FIP/xFIP definitely have this issue, as does BABIP and LOB%, which aren’t really models, but are generally considered to be “luck” factors.

I’ve been developing this over the course of a few years, and there are definitely players who are usually consistently over/under where they “should” be. Zack Greinke seems to constantly “beat” his xBB%, as does Trevor Bauer. Luis Castillo seems to underperform his xK% relatively consistently.

Models are intended to capture the bulk of cases. Yes, there will always be outliers, even consistent outliers. But at a certain point, further “refining” of the model can be a bad thing. I feel like this can give a good start on identifying players who could improve/decline in their plate discipline metrics going forward.

millerf94

Great post, I enjoyed reading it. I personally have been working on the plate discipline side of the xwOBA, and xBB is something I definitely want to include in the equation.
My question is, is it really okay to construct xK and xBB model from something with Inplay% ? Since K – BB – Inplay are the main (and maybe only?) event outcomes that could happen per PA. I think it is natural to have a high negative correlation between K – BB – Inplay, since if one happens the other two do not.

Reply to millerf94

Your point is definitely valid, and I was wary of that all the way along, but I only included InPlay% from a particular zone, which accounts for less than half (~45%) of the total pitches put into play over the time frame, and less than 8% of all pitches over the time frame, so it’s not like I’m accounting for 100% of the balls in play.

Furthermore, I would venture to say that putting the ball in play is a skill, especially in the Shadow zone, defined by StatCast as the region within one baseball diameter of the outside of the strike zone. The same names pop up a lot at the top and bottom of this list when you break it down by year, so they are earning both their low K and low BB rates.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG