## xHR%: Questing for a Formula (Part 1)

One of the most important developments in statistics — and its subordinate field, sabermetrics — is the usage of multiyear data to produce an expected outcome in a given year. It’s an old concept, one that’s been around for centuries, but it likely originated in sabermetrics circles with Bill James. In *Win Shares *(arguably the birth of WAR),* *the sabermetric response to *Principia* *Mathematica*, he details a procedure of finding park factors wherein the calculator uses a weighted average of several years of data in conjunction with league averages to find park factors for a certain ballpark.

Methods such as Mr. James’s allow the amateur sabermetrician (and even the mighty professional statistician) to determine what ought to have happened over a specific time period. Essentially, a descriptive statistic. The best example of a descriptive statistic for the unlearned reader is xFIP, which basically describes what a pitcher’s fielding-independent average runs allowed would have been if the pitcher had a league-average home runs per fly ball rate.

Several statistics fluctuate greatly from year to year and are thus considered unstable. Examples include BABIP, HR/FB% for pitchers, and line-drive percentage. HR/FB% in particular is very fluid because all sorts of variables go into whether a ball leaves the park or not. For instance, on a particularly windy day, an otherwise certain dinger might end up in the glove of an expectant center fielder on the warning track instead of in the beer glass of your paunchy friend in the cheap seats. Rendered down, xFIP takes the uncontrollable out of a pitcher’s runs-allowed average.

With this, and an excellent article about xLOB% from *The Hardball Times*, in mind, I started developing my own statistic a few days ago. xHR%, as I dubbed it, attempts to find an expected home-run percentage, and from there one can easily find expected home runs (xHR) by multiplying xHR% by plate appearances, a more understandable idea to the casual baseball fan. In order to calculate this, I wrote several different (albeit very similar) formulas:

More likely than not, your eyes glazed over in that section, so I will explain.

*HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s Home Run Tracker.*

*AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.*

*AHRDL – **Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.*

*Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. In cases where there isn’t available major-league data, then regressed minor-league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.*

*PA – Plate appearances*

*(For the uninitiated, HR% is HR/PA)*

Essentially, what I have created is a formula that describes home-run percentage. First off, I used *(.5)(AHRDH) *+ *(.5)(AHRDL) *in the denominator of the first part because a player spends half his time at home and half on the road. If I were so inclined, I could factor in every single stadium that gets visited, weight the average of them, and make that the denominator, but that’s just doing way too much work for a negligible (but likely more accurate) effect. Besides, writing that out in a formula would be a disaster because then there essentially couldn’t be a formula. Furthermore, having half of the denominator come from the player’s home stadium factors in whether or not the stadium is a home-run suppressor or inducer, which helps paint a more accurate picture of the player.

Dividing the player’s average *HRD* by*(.5)(AHRDH) *+ *(.5)(AHRDL)* allows the calculator to get a good idea of whether or not the player was “lucky” in his home runs. If his average home-run distance is less than the average of the league and his home stadium, then it follows that he is a below-average home-run hitter and his home-run totals ought to be lesser.

Since the values in the numerator and the denominator will invariably end up close in value to each other, I decided that this part of the formula could be used as the coefficient (as opposed to just throwing it out) because it will change the end number only slightly. Moreover, the xCo (as I call it) acts as a rough substitute for batted-ball distance and park dimensions in order to factor those into the formula.

The second part, the meat of the formula, uses a weighted average of multiple years of home-run-percentage data to help determine what should have been the home-run percentage in year one (the year being studied). Basically, it helps to throw out any extreme outlier seasons and regress them back a little bit to prior performance without stripping out everything that happened in that season (notice that in every formula the biggest weight is given to the season studied).

At this juncture, I cannot say for certain how much weight ought to be given to prior seasons. Obviously, a player can have a meaningful and lasting breakout season, with continued success for the rest of his career, making it inaccurate to heavily weight irrelevant data from a season two years ago. On the other hand, a player can have a false breakout, making it better to include more data from previous seasons. Undoubtedly that will be the subject of future posts. At present, the formula is a developmental one that will no doubt experience heavy changes in the future.

For the interested reader, some prior iterations of the formula are below:

As a reminder, with some small addenda, here is the explanation for each variable:

*HRDY3 – Average Home Run Distance Year Three (year three being the oldest of the three years in the sample). HRD is calculated with ESPN’s home run tracker. HRDY2 and HRDY1 follow the same idea.*

*AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium by any player.*

*AHRDL – **Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.*

*Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. **n cases where there isn’t available major league data, then regressed minor league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.*

*PA – Plate appearances*

*(You should be initiated at this point, so figure out HR% for yourself.)*

The reason these formulas were thrown out was that the xCo relied too heavily on seasons past to provide an accurate estimate. When I briefly tested this one on a few players, it delivered incredibly scattered results. Furthermore, there wouldn’t be any data available for rookies to use these iterations on because there’s no such thing as a minor-league or high-school home-run tracker (and if there were I probably wouldn’t trust it). The first formulas described are overall more elegant and more accurate.

Stay tuned for Part 2, when results will be delivered instead of postulations.