Estimating Plate-Discipline Stats for Earlier Players

The plate discipline stats at FanGraphs are fantastic. Lots of stuff can be drawn from them – and the articles I’ve linked to are only scratching the surface both of what’s already been done and what we can still do with them. So many things are great about them: they’re very stable, they’re good indicators of other statistics that might be less stable, and they’re  completely isolated to the batter and pitcher. The problem is, they only go back to 2002 (for the BIS ones) or 2007 (for the Pitchf/x ones). So what if we want plate discipline numbers for players from before then? How do we know how often Babe Ruth or Willy Mays or Hank Aaron swung at pitches inside the zone, or how often they made contact on pitches outside the zone?

Regressions, that’s how.

Using the Baseball Info Solutions plate discipline data (only because it goes back farther, and also has the SwStr% and F-Strike% stats), I ran a multivariate regression with R to find all the plate discipline numbers provided on FanGraphs: O-Swing%, Z-Swing%, Swing%, O-Contact%, Z-Contact%, Contact%, Zone%, F-Strike%, and SwStr%. I used the following stats as variables in the regression: BB% and K% (for obvious reasons), ISO (I figured maybe power hitters were more prone to different types of numbers), BABIP (same goes for hitters who could maintain higher BABIPs), HR% (same thinking as ISO), and OBP (combining hitting ability and plate discipline, even if somewhat crudely). My dataset was every qualified hitting season from 2002 until now. I couldn’t use any batted ball data (GB%, FB%, etc.) as a variable because we don’t have that prior to 2002 either. So that was what I had.

Some stats worked better than others – for example, the r^2 for Contact% was an excellent 0.8089, while for Zone% it was a measly 0.1551. And of course, it’s possible that the coefficients would be different for prior eras than they are now. But, hey, what can you do. Here, first, are the r^2s for each statistic, so you know how much to trust each number:

Statistic r^2
O-Swing% 0.3615
Z-Swing% 0.2450
Swing% 0.5222
O-Contact% 0.3956
Z-Contact% 0.7328
Contact% 0.8089
Zone% 0.1551
F-Strike% 0.4374
SwStr% 0.7072

And now for the actual coefficients:

Statistic Intercept BB% K% ISO BABIP HR% OBP
O-Swing% 0.32183 -0.99231 0.09971 -0.18619 0.50728 1.96589 -0.54037
Z-Swing% 0.64669 -0.66798 -0.03129 0.16784 0.23244 1.43928 -0.15409
Swing% 0.4852 -1.15845 0.03932 0.08247 0.14074 1.05097 -0.05289
O-Contact% 1.0226 1.1915 -1.5965 -0.5266 1.4718 1.3388 -1.8966
Z-Contact% 1.0124 0.02288 -0.66107 0.05412 0.02545 -0.8396 -0.04233
Contact% 1.0084 0.40198 -0.95703 -0.01352 0.25118 -0.77417 -0.36001
Zone% 0.48603 -0.72667 0.01344 0.22752 -0.53755 -1.59305 0.71355
F-Strike% 0.61752 -0.66725 0.14433 0.01348 0.04169 -0.2285 -0.02461
SwStr% 0.000416 -0.433719 0.449711 0.014265 -0.125661 0.493577 0.204283

(If you can’t see the whole table, here)

Note that for all the percentages – including the plate discipline numbers – I turned them into decimals: for example,  a BB% of 12.5% will be turned into 0.125, and  an O-Swing% of 20.7 will be 0.207, so if you’re calculating these on your own, keep that in mind.

There are some strange things in that table that I wouldn’t really expect. Here’s one: a higher O-Contact% leads to a much lower OBP, or maybe vice-versa*. The only logical explanation that I can offer is that balls out of the zone that are hit fall for hits less often, so BABIP and therefore OBP will each be lower. League average BABIP on balls out of the zone in 2013 (based on a quick search I did at Baseball Savant) was .243, well below the league average of .297. But that -1.89 coefficient still seems like too much. Some more explainable ones: HR% and Zone% are strongly inversely correlated (the more dangerous a hitter’s power, the fewer pitches they’ll see in the zone), BB% and O-Swing% are strongly inversely correlated (the fewer pitches you swing out of the zone, the more you’ll walk), and K% and SwStr% are fairly strongly correlated (the more you swing and miss, the more you’ll strike out).

To first examine these stats a little bit more, let’s take a look at the regressed numbers for players who have played since 2002 and compare them to their real numbers. Here’s Barry Bonds’s 2002 (the asterisk means it is the regressed, not real, numbers)

O-Swing% Z-Swing% Swing% O-Contact% Z-Contact% Contact% Zone% F-Strike% SwStr%
11.5% 70.1% 36.7% 39.6% 89.8% 80.8% 43.1% 45.1% 6.5%
O-Swing%* Z-Swing* Swing%* O-Contact%* Z-Contact%* Contact%* Zone%* F-Strike%* SwStr%*
-7.1% 59.5% 24.3% 54.2% 91.3% 87.4% 46.7% 40% 1.5%

Hmmm… not off to the greatest start. Z-Contact, Zone, F-Strike, and Contact percentages were pretty good, but the rest were waaaay off. O-Swing gave out a negative number. As good as Barry Bonds might have been, that just isn’t possible. SwStr% is also pretty off – only pure contact hitter Marco Scutaro has ever posted a swinging strike percentage that low since the BIS data started being recorded, and nobody has every been lower. (Scutaro had 1.5% in 2013). Not terrible, though. How about Miguel Cabrera’s 2013 MVP season?

O-Swing% Z-Swing% Swing% O-Contact% Z-Contact% Contact% Zone% F-Strike% SwStr%
34.1% 77.5% 52.1% 69.6% 87.6% 80.8% 41.5% 60.3% 9.6%
O-Swing%* Z-Swing* Swing%* O-Contact%* Z-Contact%* Contact%* Zone%* F-Strike%* SwStr%*
22% 71% 45.2% 58.1% 87% 80% 47% 53.9% 8.8%

Hey, not bad! The O-Swing is pretty off, and the O-Contact is a little too low, but other than that they’re all fairly close to the real values. I think we’re getting somewhere here.

Now let’s look at some seasons for which we don’t have the real numbers. Ever wondered how Babe Ruth’s plate discipline was in 1927?

O-Swing%* Z-Swing* Swing%* O-Contact%* Z-Contact%* Contact%* Zone%* F-Strike%* SwStr%*
14% 70.9% 40.8% 52.5% 86.9% 80.2% 46.6% 49.2% 7.8%

Not bad. We obviously can’t verify this (at least not without a lot of painstaking effort, and likely not at all) but that seems reasonable enough. Average contact rates in the zone, good swinging strike percentage, not very many swings outside the zone. How about the king of plate discipline, Ted Williams? Here are his numbers from his 1957 season, in which he had a 223 wRC+ and nearly 10 WAR:

O-Swing%* Z-Swing* Swing%* O-Contact%* Z-Contact%* Contact%* Zone%* F-Strike%* SwStr%*
8.8% 66.1% 36.1% 61.1% 91.2% 86.5% 47.4% 47.5% 4.1%

Wow. Really, really good. That’s a crazy low O-Swing% and yet a fairly middle-of-the-pack Swing% overall, which goes exactly with what we would expect from a man with a famed, disciplined plate approach. He rarely swung and missed, making contact on nine out of ten swings and only whiffing on one out of every twenty five pitches he saw.

I could really go on and on, but I think I’ll end by showing you the (supposed) single worst season by these regressed plate discipline numbers between 1903 and 2001. See if you can guess who it is:

O-Swing%* Z-Swing* Swing%* O-Contact%* Z-Contact%* Contact%* Zone%* F-Strike%* SwStr%*
34.4% 75.1% 53.5% 43.3% 78% 67.1% 46.4% 60.8% 16.2%

This will shock you, I’m sure, but… It’s Dave Kingman.


* Most likely, high O-Contact% causes low OBP and not vice-versa. This brings us into dangerous territory, however, because we don’t want to assume that everyone with low OBP has high O-Contact%. There are other factors that go into low OBP as well, and somebody could very easily have a low O-Contact% and a low OBP. It is like this with each of the regressed stats. But this is the best I could really do.

Jonah is a baseball analyst and Red Sox fan. He would like it if you followed him on Twitter @japemstein, but can't really do anything about it if you don't.

Comments are closed.