Author Archive

Hitting Stat Correlation Remix

We all love baseball. And, since we’re on FanGraphs, odds are we also love baseball stats. A stat is always intended to measure one thing — how many home runs did Miguel Cabrera hit, how many bases did Ricky Henderson steal, how often does Joey Votto get on base, and so on.

The savvy fan knows that no one stat tells the whole story. Even WAR, which is our best estimator for how much value a player brings, requires us to dig further into how the player got there. Was it defense? Offense? If it was offense, how’d he get there — lots of walks, lots of home runs, or did he have a high BABIP? And which of these are most likely to repeat?

That’s where correlation comes into play. If there’s a high correlation season to season, then odds are what we’re measuring* is repeatable, and we can expect more of the same going forward. Otherwise, we should expect a regression (positive or negative) toward the league mean the next season.

Or maybe a player has a high line-drive rate. How’d they get there? Do they swing a lot, do they tend to be power hitters, etc. There are a lot of relationships within a season as well.

*Even at a seasonal level, we’re not necessarily measuring true talent so much as performance. Estimating true talent involves a lot of regression, and that’s a fun and important study, but not what I chose to focus on for this tool.

A few years back, Steve Staude published a hitting stat correlation tool that let anybody explore these correlations at their leisure. It was a fun way to explore the data, not to mention a neat piece of Excel engineering. I wanted to bring it up to date a bit, and in the process switch from Excel to Tableau. I can’t embed the view in this post, but you can view it by clicking through to the view on Tableau Public.

I decided to include every season in the FanGraphs database (I’m sure I owe a database admin somewhere an apology). By default, it filters on 300 PA for both metrics (either intra-season or next season), but you may drop the floor to 1 PA. It’s a terrible idea and you really shouldn’t do it, but I’m not here to tell you how to live your life.

I also added a yearly trend of the correlation. For most stats, this doesn’t add a lot, but there are some interesting stories. For instance, the yearly correlation between BABIP and AVG for players with 300 PA has been slowly dropping in the last 20 years. Reflective of more emphasis on walks, or perhaps defensive positioning?

The player with the highest swing rate and lowest strikeout rate? Randall Simon in 2002, with a 63.6% swing rate and a 5.9% K rate, which seems like some sort of joke. All that swinging meant a low 2.9% walk rate, so it’s not like he got away with anything.

The usual caveats about correlation not equaling causation apply. Just because you get a high r-squared doesn’t mean there’s a causal relationship; one always has to apply a common-sense analysis as well. That said, dive in and have some fun.


Defining Balanced Lineups

We’re used to hearing about teams having balanced or deep lineups. Other teams are defined as “stars and scrubs”. While I think we all know what these term mean, it’s not something that’s ever been quantified (at least, not to my knowledge). Since the issue of depth is an interesting one to me, I thought it’d be fun to to tackle this using wOBA.

For each team, I calculated wOBA on a team level, then the weighted standard deviation for all players. This produces each teams’ distribution, but since the size of the standard deviation is dependent on the average, (meaning that it’s not standard when comparing teams) I used the coefficient of variation (aka CV, simply standard deviation/average) as the final measure of consistency. The lower the CV, the smaller the spread of wOBA performance.

Read the rest of this entry »