Looking at xK% and xBB% Using StatCast Zones
Mike Podhorzer has recently been publishing articles regarding his xK%, primarily using Baseball-Reference’s strike rates. I have been trying for a few years now to come up with my own using StatCast numbers, and you’d be surprised how little data is required to get a very close xK% and xBB% for both hitters and pitchers.
I love Mike’s work, but that xK% equation is a bit unwieldy, including seven inputs, two of which have a correlation of -.829. It also uses roughly 63% of the pitches seen by hitters by accounting for all strikes, in addition to three other variables on top of that. I wanted to come up with something simpler that didn’t require so much input and wasn’t so overly constrained. After all, if we throw so much data in there that we basically know the outcome before we start, what good is it going forward? To that end, I dove into StatCast to see what might work.
To start with, StatCast has four defined “attack zones:” Heart, Shadow, Chase, and Waste. I wanted to use these since it’s more specific than the In Zone/Out of Zone in standard plate discipline metrics. Next, I broke the results of those four zones into four outcomes for each zone: In Play, Whiff, Foul, and Take. I have omitted intentional balls and intentional walks from the model since those are not necessarily “skill” based, at least in terms of plate discipline. They do get added back in later after all the calculations are complete.
After this breakdown, I have 16 zone/outcome combinations. I examined all batters who saw 5,000 total pitches from 2016-20 (N=255) and all pitchers who had 5,000 pitches from 2016-20 (N=181). For an idea of frequency for each combination, Shadow/Take is the most frequent at around 20.2% in the hitter set, and Waste/In Play is the least frequent at 0.02%. As I mentioned, I removed IBB from the PA count and BB count, since those PA are basically guaranteed to be walks, Miguel Cabrera antics notwithstanding.
I started with the zone/outcome that had the highest correlation to the rate in question, ran a linear regression with that combination, and proceeded to compare the remaining combinations with the difference between the xRate and the actual rate. Once I found a set of zone/outcome combinations such that adding another combination to the set did not appreciably increase the accuracy (a judgment call, to be sure), I used Principal Component Analysis (h/t Lucas Kelly) to try to sharpen the pencil in terms of covariance, except for xBB% for pitchers, where the two combinations were not as tied together and is an interesting beast to tame. It turns out both rates for hitters and pitchers can be somewhat-well-defined to pretty-well-defined by two zone/outcome combinations each.
For both hitters and pitchers, using Shadow/Whiff% and Shadow/In Play% provided a very good approximation for K%, with an adjusted R^2 of 0.834 for hitters and 0.800 for pitchers over the 2016-20 sample. These two combinations account for roughly 14% of pitches seen, or less than one out of every seven pitches. Even better, if you look at all players in 2021 with at least 50 PA/TBF, the adjusted R^2 using the xK% equation from the 2016-20 data is 0.645 for hitters and 0.644 for pitchers, so it works pretty well for smaller sample sizes in an independent set as well. Also, if you use the model on the league as a whole, the xK% is very close to the overall league K%. The equation for hitters is:
xK% = 1.89353*S/W% – 2.01758*S/IP% + 26.935%
and for pitchers:
xK% = 2.32362*S/W% – 2.19556*S/IP% + 26.954%
For hitters, using Chase/Take% and Shadow/In Play% (again) approximated BB% well, with an adjusted R^2 of 0.816. These two combinations account for roughly 26% of pitches seen. Applying the 2016-20 model to the 2021 hitters gave an adjusted R^2 of 0.577, again a pretty good result for smaller sample sizes in an independent set. Again, using the model on the league as a whole, the xBB% is very close to the overall league BB%. The equation for hitters is:
xBB% = 0.91872*C/T% – 0.96350*S/IP% + 0.086%
Pitchers are a little trickier, but I ended up using the same combinations as hitters for simplicity’s sake. They seemed to result in the best bang for the buck and were actually not as covariate with each other as some other options. The adjusted R^2 of 0.530 is not nearly as good as the other models, and the 2021 adjusted R^2 of 0.391 was also weaker. I would love to delve into this more, but to me, it seems like walk rates are more defined by hitters than pitchers. This is on my list of things to investigate going forward, as I would be interested to see if including a “batters faced” component would be beneficial. I’m not as confident on this model, but here it is nonetheless:
xBB% = 0.96329*C/T% – 1.05340*S/IP% – 0.744%
I do find it particularly interesting that the coefficient for the xK% models are in the 27% range, where the xBB% models have a coefficient of less than 1%. If you think about an extreme case where S/W%, S/IP%, and C/T% are all zero, K% goes to 27% and BB% goes to zero. This makes sense to me. It only takes three strikes to strike out but you have to have four balls to walk, so you would think you’d get to Ks before BBs, all other things being equal.
Now that we’ve done this exercise, let’s look at leaders and laggards for these in the 2021 season. All of these models have IBB removed from PA and BB, the xStat model run, and then the IBB are added back into PA and BB. All data is through the All-Star Break. Let’s look at the hitters’ Ks first:
Positive Regressors (more Ks than anticipated):
Name | PA | IBB | Pitches | S/W | S/IP | S/W% | S/IP% | K% | xK% | K | xK | xK-K |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Isiah Kiner-Falefa | 384 | 1 | 1331 | 39 | 152 | 2.93% | 11.42% | 15.89% | 9.42% | 61 | 36.2 | -24.8 |
Jared Walsh | 354 | 4 | 1376 | 82 | 115 | 5.96% | 8.36% | 27.68% | 21.12% | 98 | 74.7 | -23.3 |
Brett Phillips | 187 | 0 | 822 | 49 | 44 | 5.96% | 5.35% | 39.04% | 27.42% | 73 | 51.3 | -21.7 |
Adam Duvall | 290 | 0 | 1122 | 77 | 90 | 6.86% | 8.02% | 30.69% | 23.75% | 89 | 68.9 | -20.1 |
Yordan Alvarez | 315 | 2 | 1304 | 59 | 95 | 4.52% | 7.29% | 26.35% | 20.67% | 83 | 65.1 | -17.9 |
Steven Duggar | 186 | 2 | 768 | 37 | 45 | 4.82% | 5.86% | 32.80% | 23.98% | 61 | 44.6 | -16.4 |
Joey Wendle | 277 | 3 | 1067 | 48 | 97 | 4.50% | 9.09% | 22.38% | 16.93% | 62 | 46.9 | -15.1 |
JaCoby Jones | 105 | 0 | 396 | 36 | 36 | 9.09% | 9.09% | 40.00% | 25.81% | 42 | 27.1 | -14.9 |
Gio Urshela | 314 | 0 | 1221 | 75 | 111 | 6.14% | 9.09% | 24.84% | 20.22% | 78 | 63.5 | -14.5 |
Alec Bohm | 329 | 0 | 1334 | 71 | 101 | 5.32% | 7.57% | 26.14% | 21.74% | 86 | 71.5 | -14.5 |
League Average | 6.24% | 7.56% | 23.76% | 23.49% |
These are the 10 biggest differences in terms of Ks, not K%, so there is no PA minimum here. It’s interesting to see an array of different profiles here, with actual K% ranging from 16% to 40% and xK% ranging from 9% to 26%. This seems to suggest that this could identify underperformers across an array of different approaches. Now for the overperformers:
Negative Regressors (fewer Ks than expected)
Name | PA | IBB | Pitches | S/W | S/IP | S/W% | S/IP% | K% | xK% | K | xK | xK-K |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Vladimir Guerrero Jr. | 374 | 5 | 1387 | 102 | 85 | 7.35% | 6.13% | 17.65% | 28.11% | 66 | 105.1 | 39.1 |
Ozzie Albies | 371 | 1 | 1398 | 112 | 112 | 8.01% | 8.01% | 17.79% | 25.87% | 66 | 96.0 | 30.0 |
Brandon Crawford | 302 | 3 | 1195 | 107 | 92 | 8.95% | 7.70% | 21.19% | 28.08% | 64 | 84.8 | 20.8 |
Carlos Santana | 375 | 3 | 1537 | 90 | 133 | 5.86% | 8.65% | 14.93% | 20.40% | 56 | 76.5 | 20.5 |
Max Muncy | 319 | 4 | 1312 | 60 | 72 | 4.57% | 5.49% | 17.87% | 24.21% | 57 | 77.2 | 20.2 |
Juan Soto | 332 | 7 | 1367 | 58 | 87 | 4.24% | 6.36% | 15.66% | 21.66% | 52 | 71.9 | 19.9 |
Cesar Hernandez | 371 | 0 | 1461 | 92 | 83 | 6.30% | 5.68% | 22.10% | 27.40% | 82 | 101.6 | 19.6 |
Charlie Blackmon | 327 | 1 | 1184 | 58 | 98 | 4.90% | 8.28% | 13.46% | 19.45% | 44 | 63.6 | 19.6 |
Chris Taylor | 343 | 1 | 1418 | 111 | 75 | 7.83% | 5.29% | 25.66% | 31.00% | 88 | 106.3 | 18.3 |
Nelson Cruz | 318 | 6 | 1163 | 90 | 101 | 7.74% | 8.68% | 17.92% | 23.61% | 57 | 75.1 | 18.1 |
League Average | 6.24% | 7.56% | 23.76% | 23.49% |
This group is almost all players with above average K% who are expected to be generally league average range. Could be that I’m missing something here, but it may be that pitchers may change their approach to these players in the second half to account for something. Perhaps something to be revisited at the end of the season. Next up, pitcher K underperformers:
Positive Regressors (less Ks than anticipated):
Name | TBF | IBB | Pitches | S/W | S/IP | S/W% | S/IP% | K% | xK% | K | xK | xK-K |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Luis Castillo | 449 | 2 | 1767 | 135 | 136 | 7.64% | 7.70% | 21.38% | 27.68% | 96 | 124.3 | 28.3 |
Sandy Alcantara | 476 | 2 | 1760 | 133 | 142 | 7.56% | 8.07% | 21.43% | 26.69% | 102 | 127.0 | 25.0 |
John Gant | 323 | 2 | 1305 | 72 | 92 | 5.52% | 7.05% | 16.72% | 24.15% | 54 | 78.0 | 24.0 |
Griffin Canning | 277 | 0 | 1080 | 95 | 83 | 8.80% | 7.69% | 22.38% | 30.52% | 62 | 84.5 | 22.5 |
Zach Davies | 405 | 0 | 1539 | 91 | 145 | 5.91% | 9.42% | 14.57% | 20.01% | 59 | 81.0 | 22.0 |
Shane McClanahan | 252 | 0 | 987 | 99 | 60 | 10.03% | 6.08% | 28.17% | 36.91% | 71 | 93.0 | 22.0 |
Antonio Senzatela | 414 | 0 | 1491 | 74 | 121 | 4.96% | 8.12% | 15.70% | 20.67% | 65 | 85.6 | 20.6 |
Patrick Sandoval | 232 | 0 | 941 | 89 | 63 | 9.46% | 6.70% | 25.43% | 34.23% | 59 | 79.4 | 20.4 |
Johan Oviedo | 244 | 1 | 914 | 59 | 73 | 6.46% | 7.99% | 16.39% | 24.32% | 40 | 59.3 | 19.3 |
Matt Shoemaker | 284 | 1 | 1064 | 62 | 97 | 5.83% | 9.12% | 14.08% | 20.41% | 40 | 58.0 | 18.0 |
League Average | 6.24% | 7.56% | 23.76% | 24.85% |
Just like the hitter underperformers, there is a wide range here. The double-edged sword with this model is it will adjust in real time to spin rate changes, but it’s more looking to strip out sequencing rather than pitch behavior. Now for the pitcher K overperformers:
Negative Regressors (more Ks than anticipated):
Name | TBF | IBB | Pitches | S/W | S/IP | S/W% | S/IP% | K% | xK% | K | xK | xK-K |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Jose Berrios | 440 | 0 | 1725 | 78 | 153 | 4.52% | 8.87% | 25.91% | 17.99% | 114 | 79.1 | -34.9 |
Sonny Gray | 265 | 1 | 1033 | 55 | 76 | 5.32% | 7.36% | 30.19% | 23.09% | 80 | 61.2 | -18.8 |
Matt Barnes | 143 | 2 | 602 | 39 | 28 | 6.48% | 4.65% | 44.06% | 31.35% | 63 | 44.8 | -18.2 |
James Karinchak | 160 | 0 | 719 | 50 | 38 | 6.95% | 5.29% | 42.50% | 31.51% | 68 | 50.4 | -17.6 |
Jacob deGrom | 324 | 0 | 1226 | 141 | 78 | 11.50% | 6.36% | 45.06% | 39.71% | 146 | 128.7 | -17.3 |
Freddy Peralta | 385 | 0 | 1615 | 114 | 90 | 7.06% | 5.57% | 35.06% | 31.12% | 135 | 119.8 | -15.2 |
Brandon Woodruff | 432 | 0 | 1695 | 109 | 119 | 6.43% | 7.02% | 29.86% | 26.48% | 129 | 114.4 | -14.6 |
Aaron Nola | 427 | 1 | 1677 | 119 | 132 | 7.10% | 7.87% | 29.51% | 26.10% | 126 | 111.4 | -14.6 |
Corbin Burnes | 345 | 0 | 1361 | 112 | 80 | 8.23% | 5.88% | 37.10% | 33.17% | 128 | 114.4 | -13.6 |
Julio Urias | 429 | 3 | 1579 | 109 | 131 | 6.90% | 8.30% | 27.74% | 24.61% | 119 | 105.6 | -13.4 |
League Average | 6.24% | 7.56% | 23.76% | 24.85% |
Some interesting stuff in this table. A couple of relievers show up, and before we start saying that the model is missing something for them, the top five xK% pitchers with at least 100 TBF are Josh Hader (44.96%), Liam Hendriks (41.73%), Devin Williams (39.82%), Raisel Iglesias (39.74%), and Sam Howard (39.72%). They are a combined 16 Ks below their expected Ks (298 xKs combined). Also, holy guacamole Jacob DeGrom. I know the model says he should have fewer Ks, but it merely takes him to superhuman rather than ultrahuman. On to BB underperformers for hitters:
Positive Regressors (fewer BBs than anticipated):
Name | PA | IBB | Pitches | S/IP | C/T | S/IP% | C/T% | BB% | xBB% | BB | xBB | xBB-BB |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Nick Solak | 354 | 0 | 1405 | 109 | 259 | 7.76% | 18.43% | 5.37% | 9.55% | 19 | 33.8 | 14.8 |
Dansby Swanson | 362 | 4 | 1453 | 100 | 258 | 6.88% | 17.76% | 7.18% | 10.77% | 26 | 39.0 | 13.0 |
Eugenio Suarez | 359 | 0 | 1461 | 105 | 285 | 7.19% | 19.51% | 8.36% | 11.08% | 30 | 39.8 | 9.8 |
Randy Arozarena | 357 | 2 | 1438 | 80 | 264 | 5.56% | 18.36% | 9.52% | 12.09% | 34 | 43.2 | 9.2 |
Yordan Alvarez | 315 | 2 | 1304 | 95 | 239 | 7.29% | 18.33% | 7.62% | 10.48% | 24 | 33.0 | 9.0 |
Teoscar Hernandez | 293 | 1 | 1103 | 92 | 204 | 8.34% | 18.50% | 6.48% | 9.35% | 19 | 27.4 | 8.4 |
Eli White | 166 | 0 | 683 | 36 | 124 | 5.27% | 18.16% | 6.63% | 11.69% | 11 | 19.4 | 8.4 |
Kevin Kiermaier | 209 | 1 | 778 | 52 | 136 | 6.68% | 17.48% | 6.22% | 10.14% | 13 | 21.2 | 8.2 |
Eric Haase | 171 | 0 | 695 | 46 | 127 | 6.62% | 18.27% | 5.85% | 10.50% | 10 | 17.9 | 7.9 |
Garrett Hampson | 301 | 1 | 1186 | 80 | 190 | 6.75% | 16.02% | 5.98% | 8.61% | 18 | 25.9 | 7.9 |
League Average | 7.56% | 17.38% | 8.92% | 8.76% |
All but one of these players are expected to have an above-average walk rate per the model. Yordan Alvarez shows up both here and on the K% underperformers list, so he could be in for a good second half if these models are in any way accurate. Now for the overperformers:
Negative Regressors (more BBs than anticipated):
Name | PA | IBB | Pitches | S/IP | C/T | S/IP% | C/T% | BB% | xBB% | BB | xBB | xBB-BB |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yasmani Grandal | 246 | 0 | 1147 | 56 | 235 | 4.88% | 20.49% | 24.39% | 14.20% | 60 | 34.9 | -25.1 |
Yandy Diaz | 323 | 2 | 1262 | 108 | 243 | 8.56% | 19.26% | 16.10% | 10.09% | 52 | 32.6 | -19.4 |
Carlos Santana | 375 | 3 | 1537 | 133 | 309 | 8.65% | 20.10% | 15.73% | 10.94% | 59 | 41.0 | -18.0 |
Jose Altuve | 368 | 2 | 1338 | 136 | 240 | 10.16% | 17.94% | 11.68% | 7.28% | 43 | 26.8 | -16.2 |
Joey Gallo | 351 | 4 | 1474 | 75 | 323 | 5.09% | 21.91% | 20.51% | 16.28% | 72 | 57.1 | -14.9 |
Garrett Cooper | 238 | 0 | 955 | 80 | 145 | 8.38% | 15.18% | 12.18% | 5.96% | 29 | 14.2 | -14.8 |
J.T. Realmuto | 276 | 2 | 1097 | 92 | 171 | 8.39% | 15.59% | 11.96% | 7.01% | 33 | 19.3 | -13.7 |
Leury Garcia | 270 | 0 | 949 | 87 | 136 | 9.17% | 14.33% | 9.26% | 4.42% | 25 | 11.9 | -13.1 |
Carson Kelly | 187 | 1 | 773 | 56 | 129 | 7.24% | 16.69% | 14.97% | 8.93% | 28 | 16.7 | -11.3 |
Jake Fraley | 149 | 1 | 678 | 34 | 139 | 5.01% | 20.50% | 22.15% | 14.67% | 33 | 21.9 | -11.1 |
League Average | 7.56% | 17.38% | 8.92% | 8.76% |
There are a lot of eye-popping walk rates here, and six out of the 10 are still expected to have above-average walk rates. Similar to the xK% overperformers, it will be interesting to see the second halves for some of these players. I have not dug into it, but I would bet pitchers are “avoiding” throwing these players strikes and they’re just generally taking a lot. Carlos Santana is the anti-Yordan Alvarez, with more walks than expected and fewer strikeouts. I have been suggested to add age or career PAs to the model by my wife, who is a statistics whiz, so perhaps that is something that could be included going forward and may explain both Alvarez and Santana. I’ll toss the pitcher BB tables in here too, just for posterity, even though I’m not as confident in this model:
Positive Regressors (more BBs than anticipated):
Name | TBF | IBB | Pitches | S/IP | C/T | S/IP% | C/T% | BB% | xBB% | BB | xBB | xBB-BB |
---|---|---|---|---|---|---|---|---|---|---|---|---|
John Gant | 323 | 2 | 1305 | 92 | 236 | 7.05% | 18.08% | 16.41% | 9.81% | 53 | 31.7 | -21.3 |
Yusei Kikuchi | 392 | 0 | 1489 | 122 | 223 | 8.19% | 14.98% | 8.67% | 5.05% | 34 | 19.8 | -14.2 |
Justin Dunn | 218 | 0 | 870 | 65 | 141 | 7.47% | 16.21% | 13.30% | 7.00% | 29 | 15.3 | -13.7 |
Jake Brentz | 169 | 1 | 666 | 48 | 104 | 7.21% | 15.62% | 15.38% | 7.26% | 26 | 12.3 | -13.7 |
Carlos Martinez | 363 | 2 | 1281 | 115 | 212 | 8.98% | 16.55% | 9.92% | 6.26% | 36 | 22.7 | -13.3 |
Junior Guerra | 182 | 1 | 732 | 59 | 133 | 8.06% | 18.17% | 15.93% | 8.77% | 29 | 16.0 | -13.0 |
Triston McKenzie | 212 | 0 | 895 | 52 | 182 | 5.81% | 20.34% | 18.87% | 12.72% | 40 | 27.0 | -13.0 |
Alex Reyes | 176 | 1 | 717 | 42 | 130 | 5.86% | 18.13% | 18.18% | 11.06% | 32 | 19.5 | -12.5 |
Kyle Hendricks | 445 | 1 | 1560 | 169 | 223 | 10.83% | 14.29% | 4.49% | 1.84% | 20 | 8.2 | -11.8 |
Dan Winkler | 137 | 0 | 561 | 45 | 82 | 8.02% | 14.62% | 13.14% | 4.89% | 18 | 6.7 | -11.3 |
League Average | 7.56% | 17.38% | 8.92% | 8.03% |
Negative Regressors (fewer BBs than anticipated):
Name | TBF | IBB | Pitches | S/IP | C/T | S/IP% | C/T% | BB% | xBB% | BB | xBB | xBB-BB |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Jacob deGrom | 324 | 0 | 1226 | 78 | 204 | 6.36% | 16.64% | 3.40% | 8.58% | 11 | 27.8 | 16.8 |
Corbin Burnes | 345 | 0 | 1361 | 80 | 226 | 5.88% | 16.61% | 4.35% | 9.06% | 15 | 31.3 | 16.3 |
Zack Greinke | 459 | 0 | 1714 | 160 | 339 | 9.33% | 19.78% | 5.01% | 8.47% | 23 | 38.9 | 15.9 |
Max Scherzer | 377 | 0 | 1597 | 90 | 270 | 5.64% | 16.91% | 5.84% | 9.61% | 22 | 36.2 | 14.2 |
Antonio Senzatela | 414 | 0 | 1491 | 121 | 275 | 8.12% | 18.44% | 5.07% | 8.47% | 21 | 35.1 | 14.1 |
Zack Wheeler | 471 | 1 | 1806 | 132 | 311 | 7.31% | 17.22% | 5.52% | 8.34% | 26 | 39.3 | 13.3 |
Eduardo Rodriguez | 385 | 0 | 1523 | 123 | 282 | 8.08% | 18.52% | 5.45% | 8.58% | 21 | 33.1 | 12.1 |
Carlos Rodon | 360 | 1 | 1511 | 70 | 247 | 4.63% | 16.35% | 7.22% | 10.37% | 26 | 37.3 | 11.3 |
Zach Eflin | 421 | 2 | 1536 | 140 | 253 | 9.11% | 16.47% | 3.33% | 5.97% | 14 | 25.1 | 11.1 |
Adam Wainwright | 438 | 3 | 1659 | 128 | 306 | 7.72% | 18.44% | 7.08% | 9.52% | 31 | 41.7 | 10.7 |
League Average | 7.56% | 17.38% | 8.92% | 8.03% |
John Gant shows up on both “good” tables while DeGrom and Corbin Burnes show up on both “bad” tables. Antonio Senzatela shows up on +Ks and +BBs, which is an interesting combo, although the less balls in play in Colorado the better. Again, it appears an age/experience factor may be showing up here too, something to look into going forward.
Over the course of a partial season, you can see that there can be some sizeable effects from applying this model to something like xwOBA. It’s something that I feel is missing from the xwOBA calculations and can be added in to really try to capture a more detailed plate discipline picture to get a better feel for a player’s expected performance. I think further evaluation could yield a more predictive (rather than descriptive) model, using age/experience and/or using a cluster analysis to identify “types” of hitters rather than simply looking at individual zone/outcome combinations.
This is a thoughtful attempt, but I can’t help but think that treating all players’ rates as determined by these factors, so that any deviance is likely to regress, is a big oversimplification. Some players will be different based upon their approach and skill set. Some of that may relate to age but much more likely comes into play.
But this is true of any model, right? FIP/xFIP definitely have this issue, as does BABIP and LOB%, which aren’t really models, but are generally considered to be “luck” factors.
I’ve been developing this over the course of a few years, and there are definitely players who are usually consistently over/under where they “should” be. Zack Greinke seems to constantly “beat” his xBB%, as does Trevor Bauer. Luis Castillo seems to underperform his xK% relatively consistently.
Models are intended to capture the bulk of cases. Yes, there will always be outliers, even consistent outliers. But at a certain point, further “refining” of the model can be a bad thing. I feel like this can give a good start on identifying players who could improve/decline in their plate discipline metrics going forward.
Great post, I enjoyed reading it. I personally have been working on the plate discipline side of the xwOBA, and xBB is something I definitely want to include in the equation.
My question is, is it really okay to construct xK and xBB model from something with Inplay% ? Since K – BB – Inplay are the main (and maybe only?) event outcomes that could happen per PA. I think it is natural to have a high negative correlation between K – BB – Inplay, since if one happens the other two do not.
Your point is definitely valid, and I was wary of that all the way along, but I only included InPlay% from a particular zone, which accounts for less than half (~45%) of the total pitches put into play over the time frame, and less than 8% of all pitches over the time frame, so it’s not like I’m accounting for 100% of the balls in play.
Furthermore, I would venture to say that putting the ball in play is a skill, especially in the Shadow zone, defined by StatCast as the region within one baseball diameter of the outside of the strike zone. The same names pop up a lot at the top and bottom of this list when you break it down by year, so they are earning both their low K and low BB rates.