xHR%: Questing for a Formula (Part 3)

Part 3 of a series of posts regarding a new statistic, xHR%, and its obvious resultant, xHR. This article will examine formulas 2 and 3. 

As a reminder, I have attempted to create a new statistic, xHR%, from which xHR (expected home runs) can be derived. xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season. In searching for the best formula possible, I came up with three different variations, pictured below.

Today, I’m going to examine formulas 2 and 3 to measure their viability as formulas for xHR%. Hopefully the analysis will shine some light on a murky matter. Likely, formula 2 will end up being the best one because it probably balances in-season performance with prior performance better than formula 3, which has a heavier reliance on in-season performance. Thus, it will end up correlating too well with what actually happened (the same outcome is likely for formula 2).

Methodology

Luckily for myself and the readers, the process was a simple one. Pulling data from FanGraphs player pages, ESPN’s Home Run Tracker, and various Google searches, I compiled a data set from which to proceed. From FanGraphs, I collected all information for Part Two of the formula, including plate appearances and home runs. Unfortunately, because a few of the players from the sample were rookies or had fewer than three years of major league experience, I had to use regressed minor league numbers. In some cases, where that data wasn’t applicable, I dug through old scouting reports to find translatable game power numbers based off of scouting grades (and used a denominator of 600 plate appearances).

Then, from ESPN’s Home Run Tracker website, I obtained all relevant data for player home-run distance, average home-run distance for the player at home, and league average home-run distance. Due to my limited time, I only used players that qualified for the batting title during the 2015 season, yielding a potentially weak sample of only 130 players. Additionally, before anyone complains, please realize that the purpose of my research at this point is to obtain the most viable formula and refine it from there so that it can be applied across a wider population.

Results for Formula 2

Using Microsoft Excel, I calculated the resultant xHR% and xHR. Some key data points:

League Average HR% (actual):  3.03%

Average xHR%:  2.89%

Average Home Runs: 18.7

Expected Home Runs: 17.8

Please note that there is a significant amount of survivorship bias in this data. That is, because all of these players played enough to qualify for the batting title, they are likely significantly better than replacement level, which is why the percentages and home runs seem so high.

Correlation between xHR% and HR%: 0.974418884

R² for above: 0.949492162

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4265261

Correlation between xHR and HR: 0.977796283

R² for above: 0.956085571

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.474596069

Results for Formula 3

League Average HR% (actual):  3.03%

Average xHR%:  2.92%

Average Home Runs: 18.7

Expected Home Runs: 18.1

Again, note the survivorship bias that comes with having a slightly skewed sample

Correlation between xHR% and HR%: 0.986440621

R² for above: 0.973065099

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4615323

Correlation between xHR and HR:0.988287804

R² for above:0.976712783

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.698203408

Mostly Boring Analysis

I have opted to condense the analysis into one section instead of two because it would have otherwise been repetitive and boring.

I understand that that’s a lot to process, but the data really isn’t all that dissimilar. The expected home-run percentage is slightly lower than the actual home-run percentage for both of them, but it isn’t a massive difference by any means. When prorated to a 600 plate appearance season, xHR% for formula 2 predicts that the average player in the sample would have hit 17.3 home runs, while formula 3’s xHR% expects that the average home-run total would have been 17.5. In reality the average player hit 18.2 home runs per 600 plate appearances, so both were fairly close (maybe too close).

Both formulas had incredibly high correlations, with formula 3 correlating an insignificantly higher amount more. More importantly, formula 2 explains about 94% of the variance, while formula 3 accounts for 97%. The difference between those is relatively unimportant because they explain a very high amount of what occurred. Furthermore, p<.001, so the data must be statistically significant (actually many times lower than that).

Both formulas resulted in slightly lower standard deviations than what actually occurred, which is a recurring theme. In these formulas, the numbers have been clumped a little bit closer together and tend to underestimate rather than overestimate.

Players of Interest

Mr. Kole Calhoun – Last season he hit 26 home runs, but by both formulas he should have hit 3-4 fewer. Likely, this is because his only previous full season of home runs was in 2014, when he had only 17, in addition to the fact that I was forced to use scout grades for his third season. The scout grades were particularly off for Calhoun because he wasn’t even expected to be good enough for the majors, let alone be an above-average, high-value outfielder. Even though his overall offensive prowess declined slightly this past season (by 20 points of wRC+), he didn’t appear to be selling out for power, as his power profile numbers (FB%, Pull%, etc.) remained the same. Personally, I would expect him to regress next season, and I think the formula agrees with me.

Mr. Nolan Arenado – Arguably having the most unexpected offensive breakout of the season, he increased his home-run totals from 10 in 2013, to 18 in 2014, and finally to an astonishing 42 in 2015. While his totals were probably slightly Coors-inflated, they were real for the most part because his average home-run distance was excellent, in addition to the fact that 22 of his dingers came on the road. Arenado is young and likely to regress somewhat in the power department, but he is probably around to stay as a significant home-run threat. The formula was likely wrong on this one due to weighting of prior seasons, so go ahead and make the lazy Todd Helton comparison.

Mr. Carlos Gonzalez – Though Arenado’s teammate had the highest home-run total (40) of his career in 2015, it isn’t clear that he was anywhere near his peak statistically. His wRC+ was below his career average by six points, in addition to him being a net below-average player. All of this leads to the conclusion that he was selling out for power — which makes sense given that he lost over fifty points of batting average and on-base percentage from his 2010-13 peak years. While a viable argument could be made for his “subpar” performance being due to injuries, a better one could be made that his home runs were in part a result of playing half his games at Coors Field, where he hit 60% of his round-trippers. The formula says he should have hit about seven fewer home runs, which may be a best case scenario for next season given his penchant for injury. Additionally, while the Rockies are by no means full of talent, if Gonzalez continues his overall downward trend, he could get traded and lose the Coors advantage, or he could lose playing time.

Keep watch for a concluding piece in the next week. Criticism would be highly appreciated, but keep in mind that I’m still in high school and have yet to actually study statistics.





A busy person, but one who spends his free time in front of a computer screen, fiddling with statistics. And yes, that describes everyone who regularly visits this website.

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
The Best of FanGraphs: March 7-11, 2016 | FanGraphs Baseball
8 years ago

[…] xHR%: Questing for a Formula (Part 3), by Jackson Mejia This high school student has put together three thoughtful articles in his attempt to create xHR%. […]

scotman144member
8 years ago

I enjoyed the three posts so far. Have you tried running this on 2012-2104 data and comparing the xHR%’s calculated from that data against 2015 actual stats?

scotman144member
8 years ago
Reply to  scotman144

Oh mevermind: I totally mis-read the player by player rundowns.