Hitting Stat Correlation Remix

We all love baseball. And, since we’re on FanGraphs, odds are we also love baseball stats. A stat is always intended to measure one thing — how many home runs did Miguel Cabrera hit, how many bases did Ricky Henderson steal, how often does Joey Votto get on base, and so on.

The savvy fan knows that no one stat tells the whole story. Even WAR, which is our best estimator for how much value a player brings, requires us to dig further into how the player got there. Was it defense? Offense? If it was offense, how’d he get there — lots of walks, lots of home runs, or did he have a high BABIP? And which of these are most likely to repeat?

That’s where correlation comes into play. If there’s a high correlation season to season, then odds are what we’re measuring* is repeatable, and we can expect more of the same going forward. Otherwise, we should expect a regression (positive or negative) toward the league mean the next season.

Or maybe a player has a high line-drive rate. How’d they get there? Do they swing a lot, do they tend to be power hitters, etc. There are a lot of relationships within a season as well.

*Even at a seasonal level, we’re not necessarily measuring true talent so much as performance. Estimating true talent involves a lot of regression, and that’s a fun and important study, but not what I chose to focus on for this tool.

A few years back, Steve Staude published a hitting stat correlation tool that let anybody explore these correlations at their leisure. It was a fun way to explore the data, not to mention a neat piece of Excel engineering. I wanted to bring it up to date a bit, and in the process switch from Excel to Tableau. I can’t embed the view in this post, but you can view it by clicking through to the view on Tableau Public.

I decided to include every season in the FanGraphs database (I’m sure I owe a database admin somewhere an apology). By default, it filters on 300 PA for both metrics (either intra-season or next season), but you may drop the floor to 1 PA. It’s a terrible idea and you really shouldn’t do it, but I’m not here to tell you how to live your life.

I also added a yearly trend of the correlation. For most stats, this doesn’t add a lot, but there are some interesting stories. For instance, the yearly correlation between BABIP and AVG for players with 300 PA has been slowly dropping in the last 20 years. Reflective of more emphasis on walks, or perhaps defensive positioning?

The player with the highest swing rate and lowest strikeout rate? Randall Simon in 2002, with a 63.6% swing rate and a 5.9% K rate, which seems like some sort of joke. All that swinging meant a low 2.9% walk rate, so it’s not like he got away with anything.

The usual caveats about correlation not equaling causation apply. Just because you get a high r-squared doesn’t mean there’s a causal relationship; one always has to apply a common-sense analysis as well. That said, dive in and have some fun.





A data analyst in the Grand Rapids, Michigan area, Mark spends parts of his spare time working with spreadsheets. He vaguely recalls what the sun looks like.

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
The Kudzu Kid
7 years ago

This is very cool. Thanks for doing this.

bunslow
7 years ago

“For instance, the yearly correlation between BABIP and AVG for players with 300 PA has been slowly dropping in the last 20 years. Reflective of…?”

I would bet that’s due to the rising strikeout rate. The more you strike out, the less percentage of your BA is determined by balls in play. In 2016 the league average K% rate was 21.1%, while in 1996 it was 16.5%. That is to say, BABIP has around 5% less weight in BA than 20 years ago.

If strikeouts were to return to 20 years ago rate, I would bet the BABIP-BA correlation would also return to what it was then.