xHR%: Questing for a Formula (Part 4)

Apologies for the significant delay between the third post and this one. A little Dostoevsky and the end of the quarter really cramp one’s time. Since it’s been a while, it would probably be helpful for mildly interested readers to refresh themselves on Part 1, Part 2, and Part 3.

As a reminder, I have conceptualized a new statistic, xHR%, from which xHR (expected home runs) can and should be derived. Importantly, xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season rather than what will happen or what actually happened. In searching for the best formula possible, I came up with three different variations, all pictured below with explanations.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s home run tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea.

PA – Plate appearances

Now that most everything of importance has been reviewed, it’s time to draw some conclusions. But first, please consider the graphs below.

Expected home runs (in blue) graphed with actual home runs (orange) using the .5 method. I plotted expected home runs and actual home runs instead of xHR% and HR% because it’s easier to see the differences this way.

Expected home runs (in blue) graphed with actual home runs (orange) using the .6 method.

Expected home runs (in blue) graphed with actual home runs (orange) using the .7 method.

Conclusions

Honestly, those graphs look pretty much the same. Yes, as the method increases from .5 through .7, the numbers seem to get more bunched up around the mean, but the differences really aren’t significant between the methods. Nor are the results from those methods particularly different from the actual results. And therein lies the crux of the matter. The formulae suggest that what happened is what should have happened, but I don’t think that’s true.

I know a great deal of luck goes into baseball. I know as a player, as a fan, and as a budding analyst that luck plays a fairly large role in every pitch, every swing, and every flight the ball takes. I don’t know how to quantify it, but I know it’s there and that’s what sites like FanGraphs try to deal with day in and day out. Knowledge is power, and the key to winning sustainably is to know which players need the least amount of luck to play well and acquire them accordingly. Statistics like xFIP, WAR, and xLOB% aid analysts and baseball teams in their lifelong quests for knowledge, whether it be by hobby or trade.

For those reasons, xHR% in its current form is a mostly useless statistic. It fails to tell the tale I want it to tell — that players are occasionally lucky. An average difference of between .6 and 1 home runs per player simply doesn’t cut it because it essentially tells me what really happened. At this juncture it’s basically a glorified version of HR/PA where you have to spend a not insignificant amount of time searching for the right statistics from various sources. But hey, you could use it to impress girls by convincing them you’re smart and know a formula that looks sort of complicated (please don’t do that).

I don’t know how big of a difference there needs to be between what should have happened and what actually happened. Obviously there still has to be a strong relationship between them, but it needs to be weaker than an R² of .95, which is approximately what it was for the three methods.

All statistics that try to project the future and describe the past are educated shots in the dark. The concept is similar to the American dollar in that nearly all of their value is derived from our belief in them, in addition to some supposedly logical mathematical assumptions about how they work. Even mathematicians need a god, and if that god happens to be WAR, then so be it.

Even though my formula doesn’t do what I want it to do quite yet, I won’t give up. Did King Arthur and Sir Lancelot give up when they searched for the Holy Grail? No, they searched tirelessly until they were arrested by some black-clad British constables with nightsticks and thrown in the back of a van. I will keep working until I find what I’m looking for, or until I get arrested (but there’s really no reason for me to be).

I know that wasn’t particularly mathematical or analytical in the purest sense, and that it was more of a pseudo-philosophical tract than anything else, but please bear with me. Any suggestions would be helpful. I have some ideas, but I’d appreciate yours as well.

Part 5 will arrive as soon as possible, hopefully with a new formula, new results, and better data.

We hoped you liked reading xHR%: Questing for a Formula (Part 4) by Jackson Mejia!

Please support FanGraphs by becoming a member. We publish thousands of articles a year, host multiple podcasts, and have an ever growing database of baseball stats.

FanGraphs does not have a paywall. With your membership, we can continue to offer the content you've come to rely on and add to our unique baseball coverage.

A busy person, but one who spends his free time in front of a computer screen, fiddling with statistics. And yes, that describes everyone who regularly visits this website.

Member
ZachTheQuack

Hey, Jackson, keep up the good work. Just a few thoughts—disclaimer: a day’s worth of neuroanatomy labs and lectures have passed since I read through your series of articles… Firstly, I just wanted to make sure that you were using data from years 1, 2, and 3 to make descriptive statements about data in year 0 (for lack of a better term), e.g. information from 2012—2014 to evaluate HR in 2015. I ask only because I cannot find any explicit statement as to which year the HRD of the player being evaluated is supposed to be from. If it were… Read more »

Member
ZachTheQuack

Well, it depends what your goal is. If your goal is to decide the luck factor on homeruns the data set that attempts to describe any regressed HR metric will indeed depend on the data for the season in question (and perhaps previous seasons); however, I’m not sure that delving into HR/PA is quite as pertinent as it seems at first blush (although, it is indeed, pertinent). It depends on what one considers to be luck, skill, a skill that is likely to be repeated, &c. The following is a long response that offers a variety of perspectives on the… Read more »

Member
ZachTheQuack

Quick personal bio: I’m a former physical chemist who is now in medical school who sadly uses his extensive baseball knowledge and mathtastic skills to make money playing fantasy baseball and daily fantasy rather than for the greater good of the SABRmetric community. But, as I find myself too pressed for time with medical school to capitalize on my knowledge anymore I’m coming around to the idea of sharing a lot of it. I don’t have any articles, but I am considering writing a few over the Summer so keep an eye out. As for why HR/PA in year 1… Read more »

Member
ZachTheQuack

Ah, I forgot to mention what possible uses remain for HR/PA as they are currently being used by your model. One possible use might be to use current season HRD and it’s deviation from the previous 2 years (or 3 year average) as an indicator of how heavily year 1 HR/PA should be weighed. That is to say, in the event of no difference between year 1 HRD and the previous years 2 and 3, perhaps a factor of zero is given to the current year’s HR/PA and years 2 and 3 are given a 70/30 split (a split based… Read more »