Challenging WAR and Other Statistics as Era-Adjustment Tools

This article is a casual version of my paper “Challenging Nostalgia and Performance Metrics in Baseball” published in Chance which showed, among other things, that wins above replacement (WAR) and the wide class of “versus your peers” statistics are incapable of accurately comparing players across eras. In particular, it was shown that WAR exhibits a very strong bias toward baseball players who played in earlier seasons. A collection of resources and an interactive web app within this framework can be viewed here.

How We Came To This Conclusion

In our research, we split baseball data into time periods and show that WAR includes players from the older era in its all-time rankings. Specifically, the older time period is defined by players who started their career in 1950 or before, and the newer group is defined by players who started their career after 1950. The split date of 1950 corresponds to the US Census that is closest to the integration of baseball in 1947. Prior to 1947, Major League Baseball was a largely all-white segregated sports league, but it slowly but surely integrated in America and the has steadily risen in popularity abroad. All the while, the world populations continue to grow as time progresses. Simply put, there are far more people in the baseball-eligible talent pool post-1950 than before.

We find that roughly 20% of the “realistic historic talent pool” belongs to the pre-1950 group. By “realistic historic talent pool” we mean the cumulative population of men ages 20-29 collected every 10 years arising from baseball playing countries (men ages 20-29 serve as a proxy for a concept of talent pool that is otherwise not well-defined). Before 1950, this population is basically just white American men. After 1950, this population includes all American men, as well as men from a plethora of baseball-playing countries.

Now go to your favorite version of WAR and count how many players in the top 25 started their career before 1950. As it stands now, baseball reference includes 15 players who started their career before 1950 in their top 25 career bWAR leaders. That’s 60%! FanGraphs’ version of WAR is a little better, with 12 of the top 25 fWAR leaders being players who started their career before 1950.

Both versions of WAR have all-time rankings of players that are out of alignment with reality. Why is this so? Remember that roughly only 20% of the talent pool was eligible to play during 1950 or before. With this figure in mind, we can compute the probabilities that one would observe at least as many old time players in the top 25 rankings. This probability is calculated via the binomial distribution.

You can perform this calculation using a binomial probabilities calculator (like the one here) by plugging in 0.20 (or 0.178, the value in the paper) for the probability of success, 25 for the number of trials, and, respectively, 12 and 15 for the number of successes that correspond to the fWAR and bWAR rankings. When you do this, you will see a very small value in the last text box corresponding to the P(X >= x). This value is the probability of observing as least as many as x number of players whose career began before integration in a ranking of the top 25 all-time players. A very small value of this probability means that the ranking under study is inconsistent with what we would expect in reality under two assumptions:

1. Baseball talent is uniformly distributed over time.

2. Interest in reaching the major leagues is stable over time.

The first of these assumptions specifies that innate talent is evenly dispersed across eras, which is not fully believable. The distribution of innate talent has likely improved over time as the MLB-eligible population has expanded, as noted by Stephen Jay Gould, Christina Kahrl at ESPN, and in Martin B. Schmidt and David J. Berri’s work on concentration of baseball talent in the Journal of Sports Economics. It is therefore likely that assumption No. 1 is to the benefit of the pre-integration players.

The second assumption is largely a balance between three factors: salaries, competition from other sports, and media exposure. The original paper explored the effects that competition from outside sports can have on the conclusions of the original analysis, and it was found that the conclusion of the original analysis holds even when accounting for the rise in popularity of other sports.

Data on historical MLB salaries is sparse, but we can see that the minimum and average salaries were relatively meager as recently as the late 1960s and early 1970s considering MLB did not have a great pension plan for fringe MLB players during this time frame. Major League salaries have greatly increased over time, and so has media coverage. The first MLB game to be broadcasted on the radio was in 1921, and nationally televised games began in the early 1950s. Given the current climate of the internet, social media platforms, and sports networks, it is hard to imagine pre-integration baseball occupying the imagination of the population, especially before the 1920s.

Simply put, it is straightforward to see that WAR fails to compare players properly across eras when accounting for the evolution of baseball and its popularity over time. Superior statistical accomplishments achieved by players who started their careers way back in the day are a reflection of the inability to properly compare talent across eras. It is highly unlikely that athletes from such a scarcely populated era of available baseball talent could represent top rankings so abundantly.

Why WAR Is a Biased Statistic

The reasons why WAR is biased in favor of old timers are perhaps easier to understand than the sheer magnitude of this bias, which is quantified in the previous section (in the resources provided or by you with the aid of a binomial probability calculator). Some of these reasons are mentioned in the previous section, and we reiterate Stephen Jay Gould’s article hyperlinked above. For a longer but fascinating and amazingly well-written version of this article, check out Gould’s book Full House: The Spread of Excellence from Plato to Darwin or watch the YouTube video here.

Ted Williams

The original paper places a line in the sand roughly at 1950, counting all players who began their careers before 1950 as pre-integration. This hard line is not to be taken as an absolute law of nature; it is chosen so that statistical testing could be done using a binomial distribution.

Ted Williams is a legendary player that straddles this line. He performed very well “post integration” and many people seem to think that he should not be placed in the pre-integration camp because of this. That being said, the following often stated claims are myths that do not hold much traction when one digs deeper.

Myth 1: Ted Williams played in an integrated MLB (at least as how we understand such a concept today).

Myth 2: Ted Williams’ career overlapped with Hank Aaron and Willie Mays.

Williams played in an American League that was very slow to integrate. (See Armour’s article on this topic, it is a fascinating and eye-opening read). For our purposes, Williams did not face any Hall of Fame African American, Latin American, or Latin pitchers, and he hardly faced any great ones in their prime.

In my opinion and to the best of my knowledge, the only challengers he had were 1958-60 Camilo Pascual, a washed-up Satchel Paige, and a washed-up Don Newcombe. He also faced Luis Aloma, Rudy Árias, Luis Arroyo, Charlie Beamon, Joe Black, Alex Carrasquel, Webbo Clarke, Sandy Consuegra, Jesse Flores, Mike Fornieles, Julio Gonzalez, Mudcat Grant, Evelio Hernandez, Dave Hoskins, Connie Johnson, Connie Marrero, Julio Moreno, Carlos Pascual, Pedro Ramos, Jose Santiago, and Bob Trice (let me know if I missed anyone). Ted Williams accumulated 6,505 plate appearances from 1947 on, and only 419 of these plate appearances (6.44%) were against these pitchers. Williams faced Mike Garcia 104 times, but Garcia was born in California and was in the minor leagues before 1947.

As for Myth 2, by the time Williams retired, there was no interleague play, no free agency, and no MLB draft. The NL and the AL were fundamentally different leagues that integrated at different rates, and Williams never played against Aaron or Mays in any game in which statistics were tabulated.

Of course, Williams’ counting statistics were diminished due to him missing most of five seasons by serving in World War II and the Korean War. He was a hero who became an aviator instead of taking the option to play baseball for the Navy. However, like many of the great players who went to war, Williams came back healthy and fit to play. That was not the case for at least three MLB players and 177 minor leaguers who were killed in these wars. This tally also does not include the tally of wounded baseball players or the sheer number of soldiers that may have made a career out of the military or decided to pursue other professions than baseball after returning from war. Segregation, a slowly integrating AL, and war efforts stunted the talent pool of Williams’ contemporaries. Therefore, his transcendent talent was able to shine in stat lines which, in one way or another, are compiled or are understood to be versus your peers. If the talent that was Ted Williams were to play today, then his stats would almost certainly be far worse.

None of this is meant to single out Ted Williams. Stan Musial, Mickey Mantle, and others are in a similar boat. The reason for the focus on Williams is that he is occasionally mentioned among the “Mount Rushmore” of the greatest all-time baseball players or position players on the basis of achievements that were posted in a far inferior league by today’s standards. At the present moment, who really knows what Williams’ career would look like if a proper era-adjustment tool existed; He still may end up being a top 10 position player in baseball history. That being said, even if he is a top 10 position player, his observed statistics would be far worse and we shouldn’t think that his career .482 OBP has any real meaning when compared with players from more modern eras.

Connections To WAR in the Mainstream

One of the goals of WAR is to provide a framework for people like us to compare players across eras. The following passage is taken directly from the What is WAR? page on FanGraphs:

“The goal of WAR is to provide a holistic metric of player value that allows for comparisons across team, league, year, and era and a framework for player evaluation.”

The work in this article has shown that WAR has struggled in its goal of providing fair comparisons of players who have played in vastly different eras. This matters when we evaluate a player’s career in historical context. Stats like JAWS (which uses bWAR) hide the problems with historical comparisons. See the description of JAWS below, taken from Jay Jaffe’s webpage:

“JAWS is a tool that facilitates the comparison of Hall of Fame candidates with those already enshrined at their position, using Wins Above Replacement to capture both defensive value as well as offensive value and to account for the wide variations in scoring that have occurred throughout baseball history.”

The Hall of Fame players of the past are beneficiaries of some biased performance metrics. With that in mind, comparisons based upon “versus your peers” statistics are therefore slanted in favor of old time players.

PP “Detrending”

Fair warning: this will dive into some theoretical statistics concepts.

FanGraphs may tout that a goal of fWAR is to provide a framework for era comparison. However, they do not state that this framework is absolute or even particularly accurate (it isn’t, but bWAR is worse). However, other authors with more dubious methods do float the claim that they successfully “detrend” statistics across eras. This passage is taken from Peterson and Penner’s (PP) paper at the journal Chaos, Solitons & Fractals (which can be accessed here):

“Individual performance metrics are commonly used to compare players from different eras. However, such cross-era comparison is often biased due to significant changes in success factors underlying player achievement rates (e.g. performance enhancing drugs and modern training regimens). Such historical comparison is more than fodder for casual discussion among sports fans, as it is also an issue of critical importance to the multi-billion dollar professional sport industry and the institutions (e.g. Hall of Fame) charged with preserving sports history and the legacy of outstanding players and achievements. To address this cultural heritage management issue, we report an objective statistical method for renormalizing career achievement metrics, one that is particularly tailored for common seasonal performance metrics, which are often aggregated into summary career metrics – despite the fact that many player careers span different eras.”

The methodology of PP detrending is developed in this paper (which can also be viewed on ArXiv, note that we are specifically referencing version v2).

As an era-adjustment tool for renormalizing baseball statistics, I heavily criticized the PP detrending method in my paper for being inappropriate, out of touch with reality, and not in alignment with the time series literature on the topic of detrending (side note: PP detrending was referred to as PPS detrending in my paper, the third author did not appear on the present paper which was published after mine).

The present PP detrending paper is more of the same, except they apply a Dickey-Fuller test of stationarity to their old analysis. This test verifies the obvious: the mean-detrending method did detrend nonstationarity of the mean. However, the variation in achievement was not seriously considered by the authors. This is a fundamental mistake in presentation of statistical concepts that was not caught by the peer review process at the non-statistics journal Chaos, Solitons & Fractals.

What the authors should have done is included a figure which displays the “detrended” achievement for every player season under consideration as a function of time. This would give an idea of how well their method “detrends” the distribution of achievement over time, not just the mean. As an example, it is reasonable to guess that PP detrending will fail to detrend the entire distribution of HR totals (PP’s most highlighted example), and that the figure I am calling for in reference to the distribution of HR totals would look something like this:

Why so? Two reasons:

First, look at the results in Table VII on version v2 of the ArXiv preprint. A simple count of Table VII shows that none of the top 50 single-season detrended HR counts occurred in 1950 or later. A closer look reveals decreasing counts in time: there are 23 seasons in the 1920s, 17 seasons in the 1930s, and 10 seasons in the 1940s. Moreover, there are seven detrended HR counts from the 1920s greater than the largest detrended HR count from the 1930s, and there are five detrended HR counts from the 1930s greater than the largest detrended HR count from the 1940s. If the distribution were in fact stationary, then we would expect a much more uniform scattering of years.

The second reason is mathematical. Suppose that home run talent follows a truncated power-law as stated in displayed equation (3) in the published version of the detrending method and displayed equation (4) in the ArXiv version. The truncated power law is proportional to a Gamma distribution in the shape parameterization (with a bit of algebra).

To motivate ideas, let X(t) be a Gamma distribution with fixed scale parameter b and a shape parameter a(t) that is allowed to change with time t. The mean of X(t) is E[X(t)] = a(t)b, and a simple mean-stabilized transformation can be taken as Y(t) = 1 / [a(t)b]. With this transformation in mind, we can let Z(t) = Y(t)X(t) be the mean-stabilized truncated power law distribution (mean-stabilized Gamma distribution). Notice that the mean E[Z(t)] = 1 for all allowable values of t. However, notice that the variance of Z(t) is Var[Z(t)] = 1/a(t) for all allowable values of t.

Connecting this back to baseball, we know that home runs have increased on average. This implies that a(t) increases as t increases, which further implies that the variance of the mean-stabilized power law distribution Var[Z(t)] = 1/a(t) decreases as t increases. Therefore, mean-stabilization introduces bias into the study of the greatest seasons ever by injecting changes to the variation.

In short, mean-stabilization is not the same as “detrending” and it does not lead to stationarity in general. This is a point that is missing in the research by PP, such as here:

“Notably, as a result and consistent with a stationary data generation process, the league averages are more constant over time after renormalization, thereby demonstrating the utility of this renormalization methods to standardize multi-era individual achievement metrics.”

This tries to assume that mean-stabilization (which is implied by stationarity) also implies stationarity. This is not true to the definition of stationarity, as the implication only goes one way. The math of this section coupled with the definition of stationarity should make it clear that PP is confused about the concepts of stationarity, detrending, and mean-stabilization.

[Technical aside 1: PP and PPS apply their mean-stabilization to the raw data directly without parametric consideration. The scale adjustment that they use to stabilize the mean works out similarly to the mathematical argument made in this section.]

[Technical aside 2: In the more recent paper, PP mention that the discrete-variable Log-Series distribution empirically fits better than the truncated power law distribution. However, I expect the same conclusions will hold with respect to the discrete-variable Log-Series distribution.]

Era Bridging

Berry, Reese, and Larkey (BRL) develop an era-bridging technique which they claim serves the role of a statistical time machine that allows for all players to be compared directly. They state:

“The goal is not to judge players relative to their contemporaries, but rather to compare all players directly. Hence the model that we use is a statistical time machine. We use additive models to estimate the innate ability of players, the effects of aging on performance, and the relative difficulty of each year within a sport. We measure each of these effects separated from the others. We use hierarchical models to model the distribution of players and specify separate distributions for each decade, thus allowing the “talent pool” within each sport to change. We study the changing talent pool in each sport and address Gould’s conjecture about the way in which populations change.”

BRL’s addressing of Gould’s conjecture is summarized with the following passage:

“The globalization has been less pronounced in MLB, where players are drawn mainly from the United States and other countries in the Americas. Baseball has remained fairly stable within the United States, where it has been an important part of the culture for more than a century.”

This rationale is not wrong relative to the other sports under study by BRL. However, this rationale ignores segregation (although segregation is mentioned by BRL), increases in the MLB-eligible population relative to available roster spots, and increases in the average overall talent of that population. It also downplays the demographic change in MLB as noted by Armour in a follow-up article.

We found that BRL did not fully account for Gould’s conjecture in the context of baseball. We also found it odd that:

1. The BRL model predicts that a .300 hitter in 1996 will have a lower than .300 average for several seasons from 1900–20. This conflicts with the well-established notion that the talent of baseball players has improved over time.

2. The BRL model includes slightly too many pre-integration players in their average rankings while basically including the right amount of pre-integration players in their home run rankings.

Perhaps the decadal hierarchical modeling approach did not properly capture the changing dynamics of the underlying talent pool or the evolution of the approach to hitting. Or perhaps the distributional assumptions that were made were too rigid.

What the Future Holds

The goal is to have more appropriate era-adjustment tools in the not-too-distant future. These tools will balance “versus your peers” statistics with the size of the MLB-eligible population. This would account for the full house of variation of talent that exists in the population.

Acknowledgements

I would to thank David Dalpiaz at the University of Illinois Department of Statistics and Jarrett Bline at the University of Illinois Department of Statistics and Economics for helpful comments. I would also like to thank the student-run Illini Analytics group for providing a forum to share ideas about Sabermetrics.

Special thanks to FanGraphs and Jay Jaffe for letting me be critical of fWAR and JAWS (all in good fun!). WAR is a very valuable statistic and we are lucky to have it.





Assistant Professor of Statistics, University of Illinois

8
Leave a Reply

4 Comment authors
Daniel EckCaptain TennealdodgerbleuWARrior Recent comment authors
newest oldest most voted
WARrior
Member
Member

Hi, Daniel. How to compare player performance in different eras is one of the most important, and I think you will agree with me, difficult issues in baseball analysis. There’s no question that the overall talent pool from which players are drawn is much larger today than it was in the past—integration, expanding American population, and more players from other countries have all contributed. The main factor working in the opposite direction seems be competition from other sports. You mention that you explored that issue in your original article. That seems to be behind a paywall, so I wonder if… Read more »

Captain Tenneal
Member

The basis for this piece seems to be the assumption that WAR is measuring the talent level of individual players. I know it’s often used in this way (as with JAWS and HOF cases), but given what WAR stands for and how it is calculated, using it to compare the talent level of players over time is completely inappropriate, and not at all what WAR purports to do. A statistic that measures someone’s value relative to some population will always have more outliers when that population is smaller and more volatile. When the average player is a farmboy who works… Read more »

dodgerbleu
Member
Member
dodgerbleu

Great comment