Pitcher WAR and the Concept of Value
Whenever one makes any conclusion based off of anything, a bunch of underlying assumptions get shepherded in to the high-level conclusion that they output. Now that’s a didactic opening sentence, but it has a point–because statistics are full of underlying assumptions. Statistics are also, perhaps not coincidentally, full of high-level conclusions. These conclusions can be pretty wrong, though. By about five-hundred runs each and every season, in this case.
Relative player value is likely the most important area of sports analysis, but it’s not always easy. For example, it’s pretty easy to get a decent idea of value in baseball while it’s pretty hard to do the same for football. No one really knows the value of a pro-bowl linebacker compared to a pro-bowl left guard, for one. People have rough ideas, but these ideas are based more on tradition and ego than advanced analysis. Which is why football is still kind of in the dark ages, and baseball isn’t. But just because baseball is out of the dark ages, it doesn’t mean that it’s figured out. It doesn’t even mean that it’s even close to figured out.
Because this question right here still exists: What’s the value of a starting pitcher compared to a relief pitcher? At first glance this a question we have a pretty good grasp on. We have WAR, which isn’t perfect, yeah, but a lot of the imperfections get filtered out when talking about a position as whole. You can just compare your average WAR for starters with your average WAR for relievers and get a decent answer. If you want to compare the top guys then just take the top quartile and compare them, etc. Except, well, no, because underlying assumptions are nasty.
FanGraphs uses FIP-WAR as its primary value measure for pitchers, and it’s based on the basic theory that pitchers only really control walks, strikeouts, and home runs–and that everything else is largely randomness and isn’t easily measurable skill. RA9 WAR isn’t a good measure of individual player skill because a lot of it depends upon factors like defense and the randomness of where the ball ends up, etc. This is correct, of course. But when comparing the relative value of entire positions against each other, RA9 WAR is the way to go. Because when you add up all the players on all of the teams and average them, factors like defense and batted balls get averaged together too. We get inherently perfect league average defense and luck, and so RA9 WAR loses its bias. It becomes (almost) as exact as possible.
Is this really a big deal, though? If all of the confounding factors of RA9 WAR get factored together, wouldn’t the confounding factors of FIP-WAR get factored together too? What’s so bad about using FIP-WAR to judge value? Well there’s this: From 1995 onward, starting pitchers have never outperformed their peripherals. Relievers? They’ve outperformed each and every time. And it’s not like the opposite happened in 1994–I just had to pick some date to start my analysis. Here’s a table of FIP-WAR compared to RA9-WAR compared to starters for the last 18 years, followed by the same table for relievers.
Starter RA9-WAR/FIP-WAR Comparisons
Year | RA9 WAR | FIP WAR | Difference |
---|---|---|---|
1995 | 277.7 | 305.0 | -27.3 |
1996 | 323.2 | 337.1 | -13.9 |
1997 | 302.5 | 336.6 | -34.1 |
1998 | 326.8 | 357.8 | -31.0 |
1999 | 328.7 | 359.7 | -31.0 |
2000 | 323.0 | 348.6 | -25.6 |
2001 | 324.9 | 353.9 | -29.0 |
2002 | 331.4 | 348.6 | -17.2 |
2003 | 315.0 | 346.7 | -31.7 |
2004 | 311.9 | 343.0 | -31.1 |
2005 | 314.8 | 333.0 | -18.2 |
2006 | 317.0 | 345.7 | -28.7 |
2007 | 343.3 | 361.6 | -18.3 |
2008 | 325.7 | 351.9 | -26.2 |
2009 | 325.1 | 351.8 | -26.7 |
2010 | 317.8 | 353.6 | -35.8 |
2011 | 337.3 | 355.6 | -18.3 |
2012 | 311.1 | 337.6 | -26.5 |
2013 | 304.0 | 332.4 | -28.4 |
Reliever RA9-WAR/FIP-WAR Comparisons
Year | RA9 WAR | FIP WAR | Difference |
---|---|---|---|
1995 | 78.4 | 50.3 | 28.1 |
1996 | 73.9 | 61.8 | 12.1 |
1997 | 98.0 | 65.4 | 32.6 |
1998 | 101.6 | 70.4 | 31.2 |
1999 | 99.8 | 68.9 | 30.9 |
2000 | 106.9 | 80.2 | 26.7 |
2001 | 103.3 | 77.6 | 25.7 |
2002 | 91.1 | 76.6 | 14.5 |
2003 | 112.5 | 83.4 | 29.1 |
2004 | 117.7 | 85.1 | 32.6 |
2005 | 115.7 | 96.7 | 19.0 |
2006 | 112.7 | 84.0 | 28.7 |
2007 | 86.8 | 68.2 | 18.6 |
2008 | 104.1 | 79.7 | 24.4 |
2009 | 103.7 | 77.7 | 26.0 |
2010 | 109.0 | 74.9 | 34.1 |
2011 | 91.0 | 73.6 | 17.4 |
2012 | 116.3 | 91.3 | 25.0 |
2013 | 126.6 | 98.5 | 28.1 |
Ok, so that’s a lot of numbers. The basis, though, is that FIP thinks that starters are better than they actually are, while it thinks relievers are the converse. And this is true year after year, by margins that rise well above negligible. Starters allow roughly 250 more runs than they should according to FIP every season, while relievers allow about 250 less than they should by FIP’s methodologies–in much fewer innings. In more reduced terms this means that starters are over-valued by about 10% as whole, while relievers are consistently under-valued by about 25% according to FIP-WAR. Now, this isn’t a completely new idea. We’ve known that relievers tend to outperform peripherals for a while, but the truth is this: relievers really outperform peripherals, pretty much all the time always.
Relievers almost get to play a different game than starters. They don’t have to face lineups twice, they don’t have to throw their third or fourth-best pitches, they don’t have to conserve any energy, etc. There’s probably a lot more reasons that relievers are better than starters, too, and these reasons can’t be thrown out as randomness, because they pretty much always happen. Not necessarily on an individual-by-individual basis, but when trying to find the relative value between positions, the advantages of being a reliever are too big to be ignored.
How much better are relievers than starters at getting “lucky”? Well, a few stats that have been widely considered luck stats (especially for pitchers) for a while are BABIP and LOB. FIP assumes that starters and relievers are on even ground, as far as these two numbers are concerned. But are they? Here’s a few tables for comparison, using the same range of years as before.
BABIP Comparisons
Year | Starter BABIP | Reliever BABIP | Difference |
---|---|---|---|
1995 | 0.293 | 0.290 | 0.003 |
1996 | 0.294 | 0.299 | -0.005 |
1997 | 0.298 | 0.293 | 0.005 |
1998 | 0.298 | 0.292 | 0.006 |
1999 | 0.297 | 0.288 | 0.009 |
2000 | 0.289 | 0.284 | 0.005 |
2001 | 0.290 | 0.286 | 0.004 |
2002 | 0.295 | 0.293 | 0.002 |
2003 | 0.294 | 0.285 | 0.009 |
2004 | 0.298 | 0.292 | 0.005 |
2005 | 0.300 | 0.292 | 0.009 |
2006 | 0.293 | 0.289 | 0.003 |
2007 | 0.291 | 0.288 | 0.003 |
2008 | 0.297 | 0.290 | 0.007 |
2009 | 0.296 | 0.288 | 0.008 |
2010 | 0.292 | 0.283 | 0.008 |
2011 | 0.292 | 0.290 | 0.002 |
2012 | 0.294 | 0.288 | 0.006 |
2013 | 0.293 | 0.287 | 0.006 |
LOB Comparisons
Year | Starter LOB% | Reliever LOB% | Difference |
---|---|---|---|
1995 | 69.9% | 73.4% | -3.5% |
1996 | 70.9% | 73.2% | -2.4% |
1997 | 69.5% | 72.7% | -3.2% |
1998 | 69.9% | 73.1% | -3.2% |
1999 | 70.6% | 73.2% | -2.7% |
2000 | 71.4% | 74.3% | -2.8% |
2001 | 70.9% | 74.0% | -3.1% |
2002 | 70.2% | 72.3% | -2.0% |
2003 | 70.7% | 73.8% | -3.1% |
2004 | 70.4% | 74.0% | -3.6% |
2005 | 70.6% | 72.9% | -2.3% |
2006 | 70.9% | 74.2% | -3.3% |
2007 | 71.5% | 74.0% | -2.4% |
2008 | 71.3% | 73.9% | -2.6% |
2009 | 71.7% | 74.3% | -2.6% |
2010 | 72.0% | 75.3% | -3.3% |
2011 | 72.0% | 74.6% | -2.6% |
2012 | 73.1% | 76.2% | -3.1% |
2013 | 71.9% | 75.5% | -3.6% |
With the exception of BABIP in ’96, relievers always had better luck than starters. Batters simply don’t get on base as often–upon contacting the ball fairly between two white lines–when they’re facing guys that didn’t throw out the first pitch of the game. And when batters do get on, they don’t get home as often. Relievers mean bad news, if good news means scoring more runs.
Which is why we have to be careful when we issue exemptions to the assumptions of our favorite tools. There are a lot of solid methodologies that go into the formulation of FIP, but FIP is handicapped by the forced assumption that everyone is the same at the things that they supposedly can’t control. Value is the big idea–the biggest idea, probably–and it’s entirely influenced by how one chooses to look at something. In this case it’s pitching, and what it means to be a guy that only pitches roughly one inning at a time. Or perhaps it’s about this: What it means to be a guy who looks at a guy that pitches roughly one inning at a time, and then decides the worth of the guy who pitches said innings, assuming that one wishes to win baseball games.
The A’s and Rays just spent a bunch of money on relievers, after all. And we’re pretty sure they’re not dumb, probably.
Brandon Reppert is a computer "scientist" who finds talking about himself in the third-person peculiar.
“FIP is handicapped by the forced assumption that everyone is the same at the things that they supposedly can’t control”
All the data you presented here only reinforces my dislike for the using the work ‘luck’ to describe almost anything about baseball.
Also, I do not have the calculations to determine how statistically significant this data is, but it seems very clear that FIP is ignoring some skill(s), which skews the data it presents away from reality.
There are 535,109 innings of data for starting pitchers in the tables above. For relievers: 270,984 innings of data.
Excellent article. One question I have in the whole starters vs. relievers RA9/FIP, is the status of inherited runners.
http://www.fangraphs.com/blogs/relief-pitching-in-context/
As explained in the above article, starting pitchers are kind of unfairly treated when it comes to inherited runners and I wonder how league-wide cRA9 (from the article, context-RA9 or RE24 based RA9) tracks by year for starters vs relievers.
The more I try to wrap my sleep-deprived head around it, I keep going back and forth. Over large amounts of innings, the runs allowed would approach the run expectancy tables, so maybe there would be no effect. On the other hand, relievers brought in to face a specific batter usually gain a platoon advantage and probably beat the run expectancy table, which would even widen the relievers’ advantage over FIP.
So, in conclusion, taking inherited runners into account and assigning ‘blame’ for them scoring according to run expectancy tables would either help starters, help relievers, or have no effect, and I’m just not smart enough to work it out.
Tom Tango showed recently that RA is a better indicator of skill than FIP after 6-7 full seasons. It is all about sample size and regression.
it is no longer a question of pitchers have some impact on BABIP and LOB% and HR/FB, etc. the issue is and always has been how much data is needed to find it among all of the randomness.
And then, of course, there’s the notion that nothing is actually random–but that our data is imprecise or blunt–or that we simply don’t understand the data we do have well enough.
Even a coin flip isn’t random–it’s an elegant physics equation. Baseball’s quite more complicated, especially because it’s a game of relative skill. Obtaining and then understanding granularity is the only way we can get better without having to rely on huge sample sizes. There are so many factors in baseball that we’ll never understand all of them, but we can get better at them. We’re really just getting started with understanding even the simple vagaries of BABIP. Which is all just too many words to say that we need more data, and we need to use the data we do have better.
This is why I like SIERA. SIERA isn’t perfect (it doesn’t predict Kershaw’s dominance for instance) it at least goes farther than Ks, BBs, and HRs. SIERA assumes relievers can sustain a lower BABIP than starters because of the advantages you mentioned (less than once around the order, no need to hold back).
Because of the advantage in BABIP, relievers can also sustain a lower LOB% than starters. That’s because the more base runners you give up, the higher LOB% you can expect. Let one person on base in an inning and as long as it’s not a HR you don’t give up a run. Give up four base runners and you’re at least giving up a run, but more likely you’ll give up two or three runs. SIERA also takes this into account, by using BABIP with BB rate to estimate LOB%.
Just for clarification, SIERA doesn’t actually use BABIP or LOB%, it simply accounts for the fact that high strikeout pitchers have an added benefits that aren’t included in FIP (such as inducing weaker contact, driving a lower BABIP) by making the effect of strikeouts on SIERA non-linear (amont other adjustments).
Your point about SIERA being better to compare starters to relievers is right on. Over the past four years:
——– ERA / FIP / xFIP / SIERA
Starters – 4.10 / 4.05 / 4.01 / 4.07
Relievers – 3.72 / 3.83 / 3.92 / 3.58
xFIP does the worst job, of comparing the two, because it assumes that pitchers have no control over HR/FB rate. FIP is a little bit better because it accounts does include homers (HR/FB% of 9.5% for relievers vs 10.6% for starters). SIERA, which places a greater weight on high strikeouts (21% relievers vs 18% starters) might actually favor relievers a bit too much given these numbers.
Starters have lineups configured based on their handedness, relievers are able to take advantage of the platoon splits. Managers control the game flow based upon this which is why we see relievers come in for one batter in the middle of an inning only to change for the next batter. Good managers play match ups, but additionally batters performance against the same starter improves as the game goes on as they get to understand their pitch sequence and see all their pitches, so relievers only facing one batter provides them the advantage of mystique. I do believe that WAR probably undervalues relief pitchers which is why we see relievers get exorbitant contracts in terms of $/WAR but I think this also has to do with the binary result of a closer in a game.