Last month, the sabermetrics community descended into complete and utter anarchy over the latest and greatest debate on WAR. Industry heavyweights like Bill James, Tom Tango, and our own Dave Cameron all weighed in on the merits of baseball’s premier metric. After the dust settled, Sam Miller published an article on ESPN igniting a different sort of debate on WAR.
Miller’s piece noted that aside from the possible flaws behind WAR itself, each corner of the internet is calculating it a different way. For pitching specifically, FanGraphs (fWAR), Baseball Reference (rWAR), and Baseball Prospectus (WARP) all publish measures of WAR that oftentimes have significant disagreements. But that’s by design.
These three metrics were brilliantly characterized by Miller as so:
- rWAR – “What Happened WAR”
- fWAR – “What Should Have Happened WAR”
- WARP – “What Should Have Should Have Happened WAR”
The rest of the piece is outstanding, and comes highly recommended by this author. In the aftermath, though, Tom Tango of MLB Advanced Media responded with the following challenge:
For you aspiring saberists: take the 10 pitchers that fWAR and rWAR most disagree with in year T, and tell us how they did in year T+1. Do that for say the last 5 years. Repeat the exercise with fWAR, WARP, then with rWAR, WARP. Report your findings. https://t.co/BjtbBRYhMd
— Tangotiger (@tangotiger) November 28, 2017
Given that I humbly consider myself to be an aspiring saberist, I took that challenge. Well, I first took the challenge of college final exams, but then the pitching WAR challenge!
The dataset from which I worked off included 1165 qualified individual pitching seasons spanning from 2000-2016. For each season, I collected the player’s fWAR, rWAR, WARP, RA9-WAR, and RA9-WAR in the subsequent year. As Tango suggested, using RA9-WAR to look retrospectively at our 3 competing pitching metrics will be the most effective way to measure the differences amongst the metrics themselves.
For those interested in the raw data, feel free to check it out here, and make a copy if you’d like to play around with it yourself.
Given the nature of the dataset, a logical first place to start was with a straightforward correlation table and go from there. That correlation table is displayed below.
As expected, small differences do exist between the various metrics in their abilities to predict future performance. In the sample, fWAR leads both WARP and rWAR by slight margins. For all you statheads out there, a linear regression on the data returns statistically significant p-values for fWAR and WARP, but not rWAR.
So that was fun, wasn’t it? With all of the nitty gritty math out of the way, let’s dive into a few examples. Miller already highlighted Teheran’s strange 2017 season, but as it turns out, there are far more extreme instances of metric disagreement.
Take Felix Hernandez’s 2006 season for example. His first full season in the bigs culminated in an underwhelming 4.52 ERA, but a 3.91 FIP and a 3.37 xFIP were promising signs of future success. Similarly, the WAR metrics were unable to come to any sort of consensus.
By WARP, the 20-year-old Hernandez was the 14th best pitcher in 2006. He was surrounded on the leaderboard by names like Roy Halladay, Randy Johnson, and Greg Maddux. By rWAR, his 2006 season ranked 135th alongside Jose Mesa, Cory Lidle, and interestingly enough, Greg Maddux.
fWAR, on the other hand, seems to have found a happy medium between the other two metrics. Sure enough, it was also the most accurate predictor of Hernandez’s RA9-WAR in 2007.
Taking a step back, I now wanted to determine which of the three metrics was the most accurate predictor of a pitcher’s future RA9-WAR. Just as Tango does, we’ll call the current season”Year T” and the next “Year T+1.” The results of this exercise are displayed below.
Yet again, we see a slight victory for the FanGraphs WAR metric. However, with over 1100 seasons in our sample, no single metric stands apart from the others. After all, they are designed with the same goal in mind: measure pitcher value. As you’ll see below, each metric usually ends up with a similar result to the others. (Click to view a larger version)
What happens, though, in instances like Teheran’s? When the metrics have stark disagreements with each other, which metric remains most reliable? To answer this question, I dug up the 10 most significant head-to-head disagreements among each of the metrics, and again looked at which version of WAR best predicted the RA9-WAR in Year T+1. Those results are listed below.
What stands out to me here is not only that fWAR still appears to be the best forward-looking metric, but also that in nine of its ten most significant disagreements with rWAR, the DIPS approach to WAR won out.
Just as in “The Great WAR Debate of 2017,” this discussion too is entirely dependent on what one intends to use WAR for. Here, we’ve established fWAR as an excellent forward-looking metric. Depending on who you ask, rWAR likely serves its best purpose illustrating, as Miller put it, what did happen. WARP may either be many years ahead of its time, or could still use a fair amount of tweaking. Or both. No matter, each version of pitching WAR comes with its own purpose, and each purpose has its own theoretical use.