Looking at xK% and xBB% Using StatCast Zones

Mike Podhorzer has recently been publishing articles regarding his xK%, primarily using Baseball-Reference’s strike rates. I have been trying for a few years now to come up with my own using StatCast numbers, and you’d be surprised how little data is required to get a very close xK% and xBB% for both hitters and pitchers.

I love Mike’s work, but that xK% equation is a bit unwieldy, including seven inputs, two of which have a correlation of -.829. It also uses roughly 63% of the pitches seen by hitters by accounting for all strikes, in addition to three other variables on top of that. I wanted to come up with something simpler that didn’t require so much input and wasn’t so overly constrained. After all, if we throw so much data in there that we basically know the outcome before we start, what good is it going forward? To that end, I dove into StatCast to see what might work.

To start with, StatCast has four defined “attack zones:” Heart, Shadow, Chase, and Waste. I wanted to use these since it’s more specific than the In Zone/Out of Zone in standard plate discipline metrics. Next, I broke the results of those four zones into four outcomes for each zone: In Play, Whiff, Foul, and Take. I have omitted intentional balls and intentional walks from the model since those are not necessarily “skill” based, at least in terms of plate discipline. They do get added back in later after all the calculations are complete.

After this breakdown, I have 16 zone/outcome combinations. I examined all batters who saw 5,000 total pitches from 2016-20 (N=255) and all pitchers who had 5,000 pitches from 2016-20 (N=181). For an idea of frequency for each combination, Shadow/Take is the most frequent at around 20.2% in the hitter set, and Waste/In Play is the least frequent at 0.02%. As I mentioned, I removed IBB from the PA count and BB count, since those PA are basically guaranteed to be walks, Miguel Cabrera antics notwithstanding.

I started with the zone/outcome that had the highest correlation to the rate in question, ran a linear regression with that combination, and proceeded to compare the remaining combinations with the difference between the xRate and the actual rate. Once I found a set of zone/outcome combinations such that adding another combination to the set did not appreciably increase the accuracy (a judgment call, to be sure), I used Principal Component Analysis (h/t Lucas Kelly) to try to sharpen the pencil in terms of covariance, except for xBB% for pitchers, where the two combinations were not as tied together and is an interesting beast to tame. It turns out both rates for hitters and pitchers can be somewhat-well-defined to pretty-well-defined by two zone/outcome combinations each.

For both hitters and pitchers, using Shadow/Whiff% and Shadow/In Play% provided a very good approximation for K%, with an adjusted R^2 of 0.834 for hitters and 0.800 for pitchers over the 2016-20 sample. These two combinations account for roughly 14% of pitches seen, or less than one out of every seven pitches. Even better, if you look at all players in 2021 with at least 50 PA/TBF, the adjusted R^2 using the xK% equation from the 2016-20 data is 0.645 for hitters and 0.644 for pitchers, so it works pretty well for smaller sample sizes in an independent set as well. Also, if you use the model on the league as a whole, the xK% is very close to the overall league K%. The equation for hitters is:

xK% = 1.89353*S/W% – 2.01758*S/IP% + 26.935%

and for pitchers:

xK% = 2.32362*S/W% – 2.19556*S/IP% + 26.954%

For hitters, using Chase/Take% and Shadow/In Play% (again) approximated BB% well, with an adjusted R^2 of 0.816. These two combinations account for roughly 26% of pitches seen. Applying the 2016-20 model to the 2021 hitters gave an adjusted R^2 of 0.577, again a pretty good result for smaller sample sizes in an independent set. Again, using the model on the league as a whole, the xBB% is very close to the overall league BB%. The equation for hitters is:

xBB% = 0.91872*C/T% – 0.96350*S/IP% + 0.086%

Pitchers are a little trickier, but I ended up using the same combinations as hitters for simplicity’s sake. They seemed to result in the best bang for the buck and were actually not as covariate with each other as some other options. The adjusted R^2 of 0.530 is not nearly as good as the other models, and the 2021 adjusted R^2 of 0.391 was also weaker. I would love to delve into this more, but to me, it seems like walk rates are more defined by hitters than pitchers. This is on my list of things to investigate going forward, as I would be interested to see if including a “batters faced” component would be beneficial. I’m not as confident on this model, but here it is nonetheless:

xBB% = 0.96329*C/T% – 1.05340*S/IP% – 0.744%

I do find it particularly interesting that the coefficient for the xK% models are in the 27% range, where the xBB% models have a coefficient of less than 1%. If you think about an extreme case where S/W%, S/IP%, and C/T% are all zero, K% goes to 27% and BB% goes to zero. This makes sense to me. It only takes three strikes to strike out but you have to have four balls to walk, so you would think you’d get to Ks before BBs, all other things being equal.

Now that we’ve done this exercise, let’s look at leaders and laggards for these in the 2021 season. All of these models have IBB removed from PA and BB, the xStat model run, and then the IBB are added back into PA and BB. All data is through the All-Star Break. Let’s look at the hitters’ Ks first:

Positive Regressors (more Ks than anticipated):

xK Positive Regressors – Hitters
Name PA IBB Pitches S/W S/IP S/W% S/IP% K% xK% K xK xK-K
Isiah Kiner-Falefa 384 1 1331 39 152 2.93% 11.42% 15.89% 9.42% 61 36.2 -24.8
Jared Walsh 354 4 1376 82 115 5.96% 8.36% 27.68% 21.12% 98 74.7 -23.3
Brett Phillips 187 0 822 49 44 5.96% 5.35% 39.04% 27.42% 73 51.3 -21.7
Adam Duvall 290 0 1122 77 90 6.86% 8.02% 30.69% 23.75% 89 68.9 -20.1
Yordan Alvarez 315 2 1304 59 95 4.52% 7.29% 26.35% 20.67% 83 65.1 -17.9
Steven Duggar 186 2 768 37 45 4.82% 5.86% 32.80% 23.98% 61 44.6 -16.4
Joey Wendle 277 3 1067 48 97 4.50% 9.09% 22.38% 16.93% 62 46.9 -15.1
JaCoby Jones 105 0 396 36 36 9.09% 9.09% 40.00% 25.81% 42 27.1 -14.9
Gio Urshela 314 0 1221 75 111 6.14% 9.09% 24.84% 20.22% 78 63.5 -14.5
Alec Bohm 329 0 1334 71 101 5.32% 7.57% 26.14% 21.74% 86 71.5 -14.5
League Average 6.24% 7.56% 23.76% 23.49%

These are the 10 biggest differences in terms of Ks, not K%, so there is no PA minimum here. It’s interesting to see an array of different profiles here, with actual K% ranging from 16% to 40% and xK% ranging from 9% to 26%. This seems to suggest that this could identify underperformers across an array of different approaches. Now for the overperformers:

Negative Regressors (fewer Ks than expected)

xK Negative Regressors – Hitters
Name PA IBB Pitches S/W S/IP S/W% S/IP% K% xK% K xK xK-K
Vladimir Guerrero Jr. 374 5 1387 102 85 7.35% 6.13% 17.65% 28.11% 66 105.1 39.1
Ozzie Albies 371 1 1398 112 112 8.01% 8.01% 17.79% 25.87% 66 96.0 30.0
Brandon Crawford 302 3 1195 107 92 8.95% 7.70% 21.19% 28.08% 64 84.8 20.8
Carlos Santana 375 3 1537 90 133 5.86% 8.65% 14.93% 20.40% 56 76.5 20.5
Max Muncy 319 4 1312 60 72 4.57% 5.49% 17.87% 24.21% 57 77.2 20.2
Juan Soto 332 7 1367 58 87 4.24% 6.36% 15.66% 21.66% 52 71.9 19.9
Cesar Hernandez 371 0 1461 92 83 6.30% 5.68% 22.10% 27.40% 82 101.6 19.6
Charlie Blackmon 327 1 1184 58 98 4.90% 8.28% 13.46% 19.45% 44 63.6 19.6
Chris Taylor 343 1 1418 111 75 7.83% 5.29% 25.66% 31.00% 88 106.3 18.3
Nelson Cruz 318 6 1163 90 101 7.74% 8.68% 17.92% 23.61% 57 75.1 18.1
League Average 6.24% 7.56% 23.76% 23.49%

This group is almost all players with above average K% who are expected to be generally league average range. Could be that I’m missing something here, but it may be that pitchers may change their approach to these players in the second half to account for something. Perhaps something to be revisited at the end of the season. Next up, pitcher K underperformers:

Positive Regressors (less Ks than anticipated):

xK Positive Regressors – Pitchers
Name TBF IBB Pitches S/W S/IP S/W% S/IP% K% xK% K xK xK-K
Luis Castillo 449 2 1767 135 136 7.64% 7.70% 21.38% 27.68% 96 124.3 28.3
Sandy Alcantara 476 2 1760 133 142 7.56% 8.07% 21.43% 26.69% 102 127.0 25.0
John Gant 323 2 1305 72 92 5.52% 7.05% 16.72% 24.15% 54 78.0 24.0
Griffin Canning 277 0 1080 95 83 8.80% 7.69% 22.38% 30.52% 62 84.5 22.5
Zach Davies 405 0 1539 91 145 5.91% 9.42% 14.57% 20.01% 59 81.0 22.0
Shane McClanahan 252 0 987 99 60 10.03% 6.08% 28.17% 36.91% 71 93.0 22.0
Antonio Senzatela 414 0 1491 74 121 4.96% 8.12% 15.70% 20.67% 65 85.6 20.6
Patrick Sandoval 232 0 941 89 63 9.46% 6.70% 25.43% 34.23% 59 79.4 20.4
Johan Oviedo 244 1 914 59 73 6.46% 7.99% 16.39% 24.32% 40 59.3 19.3
Matt Shoemaker 284 1 1064 62 97 5.83% 9.12% 14.08% 20.41% 40 58.0 18.0
League Average 6.24% 7.56% 23.76% 24.85%

Just like the hitter underperformers, there is a wide range here. The double-edged sword with this model is it will adjust in real time to spin rate changes, but it’s more looking to strip out sequencing rather than pitch behavior. Now for the pitcher K overperformers:

Negative Regressors (more Ks than anticipated):

xK Negative Regressors – Pitchers
Name TBF IBB Pitches S/W S/IP S/W% S/IP% K% xK% K xK xK-K
Jose Berrios 440 0 1725 78 153 4.52% 8.87% 25.91% 17.99% 114 79.1 -34.9
Sonny Gray 265 1 1033 55 76 5.32% 7.36% 30.19% 23.09% 80 61.2 -18.8
Matt Barnes 143 2 602 39 28 6.48% 4.65% 44.06% 31.35% 63 44.8 -18.2
James Karinchak 160 0 719 50 38 6.95% 5.29% 42.50% 31.51% 68 50.4 -17.6
Jacob deGrom 324 0 1226 141 78 11.50% 6.36% 45.06% 39.71% 146 128.7 -17.3
Freddy Peralta 385 0 1615 114 90 7.06% 5.57% 35.06% 31.12% 135 119.8 -15.2
Brandon Woodruff 432 0 1695 109 119 6.43% 7.02% 29.86% 26.48% 129 114.4 -14.6
Aaron Nola 427 1 1677 119 132 7.10% 7.87% 29.51% 26.10% 126 111.4 -14.6
Corbin Burnes 345 0 1361 112 80 8.23% 5.88% 37.10% 33.17% 128 114.4 -13.6
Julio Urias 429 3 1579 109 131 6.90% 8.30% 27.74% 24.61% 119 105.6 -13.4
League Average 6.24% 7.56% 23.76% 24.85%

Some interesting stuff in this table. A couple of relievers show up, and before we start saying that the model is missing something for them, the top five xK% pitchers with at least 100 TBF are Josh Hader (44.96%), Liam Hendriks (41.73%), Devin Williams (39.82%), Raisel Iglesias (39.74%), and Sam Howard (39.72%). They are a combined 16 Ks below their expected Ks (298 xKs combined). Also, holy guacamole Jacob DeGrom. I know the model says he should have fewer Ks, but it merely takes him to superhuman rather than ultrahuman. On to BB underperformers for hitters:

Positive Regressors (fewer BBs than anticipated):

xBB Positive Regressors – Hitters
Name PA IBB Pitches S/IP C/T S/IP% C/T% BB% xBB% BB xBB xBB-BB
Nick Solak 354 0 1405 109 259 7.76% 18.43% 5.37% 9.55% 19 33.8 14.8
Dansby Swanson 362 4 1453 100 258 6.88% 17.76% 7.18% 10.77% 26 39.0 13.0
Eugenio Suarez 359 0 1461 105 285 7.19% 19.51% 8.36% 11.08% 30 39.8 9.8
Randy Arozarena 357 2 1438 80 264 5.56% 18.36% 9.52% 12.09% 34 43.2 9.2
Yordan Alvarez 315 2 1304 95 239 7.29% 18.33% 7.62% 10.48% 24 33.0 9.0
Teoscar Hernandez 293 1 1103 92 204 8.34% 18.50% 6.48% 9.35% 19 27.4 8.4
Eli White 166 0 683 36 124 5.27% 18.16% 6.63% 11.69% 11 19.4 8.4
Kevin Kiermaier 209 1 778 52 136 6.68% 17.48% 6.22% 10.14% 13 21.2 8.2
Eric Haase 171 0 695 46 127 6.62% 18.27% 5.85% 10.50% 10 17.9 7.9
Garrett Hampson 301 1 1186 80 190 6.75% 16.02% 5.98% 8.61% 18 25.9 7.9
League Average 7.56% 17.38% 8.92% 8.76%

All but one of these players are expected to have an above-average walk rate per the model. Yordan Alvarez shows up both here and on the K% underperformers list, so he could be in for a good second half if these models are in any way accurate. Now for the overperformers:

Negative Regressors (more BBs than anticipated):

xBB Negative Regressors – Hitters
Name PA IBB Pitches S/IP C/T S/IP% C/T% BB% xBB% BB xBB xBB-BB
Yasmani Grandal 246 0 1147 56 235 4.88% 20.49% 24.39% 14.20% 60 34.9 -25.1
Yandy Diaz 323 2 1262 108 243 8.56% 19.26% 16.10% 10.09% 52 32.6 -19.4
Carlos Santana 375 3 1537 133 309 8.65% 20.10% 15.73% 10.94% 59 41.0 -18.0
Jose Altuve 368 2 1338 136 240 10.16% 17.94% 11.68% 7.28% 43 26.8 -16.2
Joey Gallo 351 4 1474 75 323 5.09% 21.91% 20.51% 16.28% 72 57.1 -14.9
Garrett Cooper 238 0 955 80 145 8.38% 15.18% 12.18% 5.96% 29 14.2 -14.8
J.T. Realmuto 276 2 1097 92 171 8.39% 15.59% 11.96% 7.01% 33 19.3 -13.7
Leury Garcia 270 0 949 87 136 9.17% 14.33% 9.26% 4.42% 25 11.9 -13.1
Carson Kelly 187 1 773 56 129 7.24% 16.69% 14.97% 8.93% 28 16.7 -11.3
Jake Fraley 149 1 678 34 139 5.01% 20.50% 22.15% 14.67% 33 21.9 -11.1
League Average 7.56% 17.38% 8.92% 8.76%

There are a lot of eye-popping walk rates here, and six out of the 10 are still expected to have above-average walk rates. Similar to the xK% overperformers, it will be interesting to see the second halves for some of these players. I have not dug into it, but I would bet pitchers are “avoiding” throwing these players strikes and they’re just generally taking a lot. Carlos Santana is the anti-Yordan Alvarez, with more walks than expected and fewer strikeouts. I have been suggested to add age or career PAs to the model by my wife, who is a statistics whiz, so perhaps that is something that could be included going forward and may explain both Alvarez and Santana. I’ll toss the pitcher BB tables in here too, just for posterity, even though I’m not as confident in this model:

Positive Regressors (more BBs than anticipated):

xBB Positive Regressors – Pitchers
Name TBF IBB Pitches S/IP C/T S/IP% C/T% BB% xBB% BB xBB xBB-BB
John Gant 323 2 1305 92 236 7.05% 18.08% 16.41% 9.81% 53 31.7 -21.3
Yusei Kikuchi 392 0 1489 122 223 8.19% 14.98% 8.67% 5.05% 34 19.8 -14.2
Justin Dunn 218 0 870 65 141 7.47% 16.21% 13.30% 7.00% 29 15.3 -13.7
Jake Brentz 169 1 666 48 104 7.21% 15.62% 15.38% 7.26% 26 12.3 -13.7
Carlos Martinez 363 2 1281 115 212 8.98% 16.55% 9.92% 6.26% 36 22.7 -13.3
Junior Guerra 182 1 732 59 133 8.06% 18.17% 15.93% 8.77% 29 16.0 -13.0
Triston McKenzie 212 0 895 52 182 5.81% 20.34% 18.87% 12.72% 40 27.0 -13.0
Alex Reyes 176 1 717 42 130 5.86% 18.13% 18.18% 11.06% 32 19.5 -12.5
Kyle Hendricks 445 1 1560 169 223 10.83% 14.29% 4.49% 1.84% 20 8.2 -11.8
Dan Winkler 137 0 561 45 82 8.02% 14.62% 13.14% 4.89% 18 6.7 -11.3
League Average 7.56% 17.38% 8.92% 8.03%

Negative Regressors (fewer BBs than anticipated):

xBB Negative Regressors – Pitchers
Name TBF IBB Pitches S/IP C/T S/IP% C/T% BB% xBB% BB xBB xBB-BB
Jacob deGrom 324 0 1226 78 204 6.36% 16.64% 3.40% 8.58% 11 27.8 16.8
Corbin Burnes 345 0 1361 80 226 5.88% 16.61% 4.35% 9.06% 15 31.3 16.3
Zack Greinke 459 0 1714 160 339 9.33% 19.78% 5.01% 8.47% 23 38.9 15.9
Max Scherzer 377 0 1597 90 270 5.64% 16.91% 5.84% 9.61% 22 36.2 14.2
Antonio Senzatela 414 0 1491 121 275 8.12% 18.44% 5.07% 8.47% 21 35.1 14.1
Zack Wheeler 471 1 1806 132 311 7.31% 17.22% 5.52% 8.34% 26 39.3 13.3
Eduardo Rodriguez 385 0 1523 123 282 8.08% 18.52% 5.45% 8.58% 21 33.1 12.1
Carlos Rodon 360 1 1511 70 247 4.63% 16.35% 7.22% 10.37% 26 37.3 11.3
Zach Eflin 421 2 1536 140 253 9.11% 16.47% 3.33% 5.97% 14 25.1 11.1
Adam Wainwright 438 3 1659 128 306 7.72% 18.44% 7.08% 9.52% 31 41.7 10.7
League Average 7.56% 17.38% 8.92% 8.03%

John Gant shows up on both “good” tables while DeGrom and Corbin Burnes show up on both “bad” tables. Antonio Senzatela shows up on +Ks and +BBs, which is an interesting combo, although the less balls in play in Colorado the better. Again, it appears an age/experience factor may be showing up here too, something to look into going forward.

Over the course of a partial season, you can see that there can be some sizeable effects from applying this model to something like xwOBA. It’s something that I feel is missing from the xwOBA calculations and can be added in to really try to capture a more detailed plate discipline picture to get a better feel for a player’s expected performance. I think further evaluation could yield a more predictive (rather than descriptive) model, using age/experience and/or using a cluster analysis to identify “types” of hitters rather than simply looking at individual zone/outcome combinations.





newest oldest most voted
newsense
Member
Member
newsense

This is a thoughtful attempt, but I can’t help but think that treating all players’ rates as determined by these factors, so that any deviance is likely to regress, is a big oversimplification. Some players will be different based upon their approach and skill set. Some of that may relate to age but much more likely comes into play.

millerf94
Member
millerf94

Great post, I enjoyed reading it. I personally have been working on the plate discipline side of the xwOBA, and xBB is something I definitely want to include in the equation.
My question is, is it really okay to construct xK and xBB model from something with Inplay% ? Since K – BB – Inplay are the main (and maybe only?) event outcomes that could happen per PA. I think it is natural to have a high negative correlation between K – BB – Inplay, since if one happens the other two do not.