A Different Sort of Debate on WAR

by Adam Kaufman

January 17, 2018

Last month, the sabermetrics community descended into complete and utter anarchy over the latest and greatest debate on WAR. Industry heavyweights like Bill James, Tom Tango, and our own Dave Cameron all weighed in on the merits of baseball’s premier metric. After the dust settled, Sam Miller published an article on ESPN igniting a different sort of debate on WAR.

Miller’s piece noted that aside from the possible flaws behind WAR itself, each corner of the internet is calculating it a different way. For pitching specifically, FanGraphs (fWAR), Baseball Reference (rWAR), and Baseball Prospectus (WARP) all publish measures of WAR that oftentimes have significant disagreements. But that’s by design.

These three metrics were brilliantly characterized by Miller as so:

rWAR – “What Happened WAR”
fWAR – “What Should Have Happened WAR”
WARP – “What Should Have Should Have Happened WAR”

The rest of the piece is outstanding, and comes highly recommended by this author. In the aftermath, though, Tom Tango of MLB Advanced Media responded with the following challenge:

For you aspiring saberists: take the 10 pitchers that fWAR and rWAR most disagree with in year T, and tell us how they did in year T+1. Do that for say the last 5 years. Repeat the exercise with fWAR, WARP, then with rWAR, WARP. Report your findings. https://t.co/BjtbBRYhMd

— Tangotiger 🍁 (@tangotiger) November 28, 2017

Given that I humbly consider myself to be an aspiring saberist, I took that challenge. Well, I first took the challenge of college final exams, but then the pitching WAR challenge!

The dataset from which I worked off included 1165 qualified individual pitching seasons spanning from 2000-2016. For each season, I collected the player’s fWAR, rWAR, WARP, RA9-WAR, and RA9-WAR in the subsequent year. As Tango suggested, using RA9-WAR to look retrospectively at our 3 competing pitching metrics will be the most effective way to measure the differences amongst the metrics themselves.

For those interested in the raw data, feel free to check it out here, and make a copy if you’d like to play around with it yourself.

Given the nature of the dataset, a logical first place to start was with a straightforward correlation table and go from there. That correlation table is displayed below.

As expected, small differences do exist between the various metrics in their abilities to predict future performance. In the sample, fWAR leads both WARP and rWAR by slight margins. For all you statheads out there, a linear regression on the data returns statistically significant p-values for fWAR and WARP, but not rWAR.

So that was fun, wasn’t it? With all of the nitty gritty math out of the way, let’s dive into a few examples. Miller already highlighted Teheran’s strange 2017 season, but as it turns out, there are far more extreme instances of metric disagreement.

Take Felix Hernandez’s 2006 season for example. His first full season in the bigs culminated in an underwhelming 4.52 ERA, but a 3.91 FIP and a 3.37 xFIP were promising signs of future success. Similarly, the WAR metrics were unable to come to any sort of consensus.

By WARP, the 20-year-old Hernandez was the 14th best pitcher in 2006. He was surrounded on the leaderboard by names like Roy Halladay, Randy Johnson, and Greg Maddux. By rWAR, his 2006 season ranked 135th alongside Jose Mesa, Cory Lidle, and interestingly enough, Greg Maddux.

fWAR, on the other hand, seems to have found a happy medium between the other two metrics. Sure enough, it was also the most accurate predictor of Hernandez’s RA9-WAR in 2007.

Taking a step back, I now wanted to determine which of the three metrics was the most accurate predictor of a pitcher’s future RA9-WAR. Just as Tango does, we’ll call the current season”Year T” and the next “Year T+1.” The results of this exercise are displayed below.
Yet again, we see a slight victory for the FanGraphs WAR metric. However, with over 1100 seasons in our sample, no single metric stands apart from the others. After all, they are designed with the same goal in mind: measure pitcher value. As you’ll see below, each metric usually ends up with a similar result to the others. (Click to view a larger version)

What happens, though, in instances like Teheran’s? When the metrics have stark disagreements with each other, which metric remains most reliable? To answer this question, I dug up the 10 most significant head-to-head disagreements among each of the metrics, and again looked at which version of WAR best predicted the RA9-WAR in Year T+1. Those results are listed below.

What stands out to me here is not only that fWAR still appears to be the best forward-looking metric, but also that in nine of its ten most significant disagreements with rWAR, the DIPS approach to WAR won out.

Just as in “The Great WAR Debate of 2017,” this discussion too is entirely dependent on what one intends to use WAR for. Here, we’ve established fWAR as an excellent forward-looking metric. Depending on who you ask, rWAR likely serves its best purpose illustrating, as Miller put it, what did happen. WARP may either be many years ahead of its time, or could still use a fair amount of tweaking. Or both. No matter, each version of pitching WAR comes with its own purpose, and each purpose has its own theoretical use.

On Jake Arrieta, Aaron Slegers, and Extreme Release Points

The Home Run Explosion, Home Runs, and Winning

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

E_Max23

7 years ago

https://www.patreon.com/posts/solution-for-war-16376783 …

i wrote about this, already posted, waiting to see if Jeff approves it, or could you at least tell me what must i improve, thx, just trying to be heard

-3

E_Max23

7 years ago

Reply to E_Max23

i’ll delete the link if it’s inappropriate, just trying to get Jeff’s attention

Giolito's changeup

7 years ago

Adam,

Thanks for the additional research on the three W’s. I had read Sam Miller’s article, and your piece is a nice follow up.

Look forward to your next article, of course, when college permits.

Adam Kaufman

7 years ago

Reply to Giolito's changeup

Thanks, Jay.

Hope to continue further to FanGraphs in the near future!

phealy48

7 years ago

Excellent article, perhaps the most interesting article I have read on WAR in a couple of months. Thanks.

Adam Kaufman

7 years ago

Reply to phealy48

Much appreciated, phealy48!

John EdwardsMember since 2020

7 years ago

Interesting! I wonder if fWAR was the best at predicting Hernandez’s stuff because it functions based off of a more peripheral statistic than rWAR. Good read, Adam.

Adam Kaufman

7 years ago

Reply to John Edwards

Thanks, John! It’s certainly possible. Each metric has its own strengths, and fWAR might just be best at reflecting King Felix’s true performance level.

BigChief

7 years ago

You may want to ask someone at BP but I believe because the most recent version of DRA includes data from PitchInfo, DRA and therefore WARP may be calculated differently pre-2008 than it is now.

If this is indeed the case I wouldn’t at all be surprised if WARP performs better now at year+1 than what is suggested from your data set.

Good work, btw.

Adam Kaufman

7 years ago

Reply to BigChief

Interesting point, thanks for letting me know. Perhaps isolating the dataset to include only 2008-present would return more favorable results for WARP.

Jeff Long

7 years ago

This is interesting work Adam, thanks for sharing. I do think this warrants a discussion about a/the goal of WAR(P), because I’m not sure I agree with the premise of Tom’s tweet wherein he suggests that WAR(P) aught to be predictive of performance in year T+1.

Another researcher pointed out to me: since fWAR has the most regressed inputs, it makes sense that it would perform the best on these tests as players tend to move toward the mean over time.

This brings up an important consideration which is that we should really look at all the inputs for a metric [WAR(P) in this case] and think through the pros/cons of each before testing them for validity. Each version of WAR(P) operates with certain assumptions, approaches, and more built in. As such using a singular metric, in this case RA9-WAR, doesn’t do rightful justice to the nuances that have been baked into each version.

Balancing he various iterations of WAR(P) requires a thorough examination of the bits and pieces that make them unique.

You’ve explored an interesting question with a good deal of rigor. That’s far more than most could say and while there’s opportunity for iteration and potential improvement, there’s not much “aspiring” about this.

Adam Kaufman

7 years ago

Reply to Jeff Long

Thanks so much for the feedback, Jeff. I wholeheartedly agree that there’s a more significant, theoretical question clouding the implications of the article. As the community debates back and forth on the various uses of WAR(P), I believe Tom was acknowledging that one of those uses is predicting future performance.

Your points about fWAR and its regressed inputs is interesting. In my mind, this raises a question of why it didn’t actually outperform the field by a more significant margin.

Thanks again for weighing in, and I hope at some point to be able to drop the “aspiring” from my title as well!

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG