Trying to Improve fWAR: Part 1
FanGraphs Wins Above Replacement is considered by many in the sabermetric community be the holy grail of WAR. And, even though I’m writing a piece that is critical of fWAR, FanGraphs is still the first website I go to when I want to get a basic understanding of a specific player or team’s value. Don’t view this article as an attack on fWAR or FanGraphs, both of which I use frequently; instead, consider this article as constructive criticism.
fWAR, specifically for pitchers, is riddled with minor problems that together make the metric less valuable. In Part 1 of the series, we’re going to look at a hotly debated issue regarding fWAR that has been brought up by other readers before: the fWAR park factors.
According to the FanGraphs glossary, a basic runs park factor is used when calculating fWAR. Because FIP models ERA, using runs park factors for FIP shouldn’t be a problem.
Unfortunately, this idea simply isn’t true. The inputs of FIP, HR/9, BB/9, and K/9, only include about 30% of plate appearances. Some ballparks (Citi Field for example), inflate HR/9 and FIP despite suppressing runs in general. If Pitcher fWAR is based on FIP, FIP park factors, not runs park factors, must be used. Below is a table comparing runs and FIP park factors for different teams/ballparks, with FIP park factor equaling ((13*HRPF)+(3*BBPF)-(2*SOPF))/(14), with all of the data coming from the FanGraphs park factors.
Season | Team | Basic | FIP | Difference |
---|---|---|---|---|
2014 | Reds | 101 | 112 | -11 |
2014 | Brewers | 103 | 111 | -8 |
2014 | White Sox | 104 | 111 | -7 |
2014 | Yankees | 103 | 110 | -7 |
2014 | Mets | 95 | 102 | -7 |
2014 | Phillies | 100 | 106 | -6 |
2014 | Dodgers | 96 | 101 | -5 |
2014 | Orioles | 102 | 107 | -5 |
2014 | Blue Jays | 103 | 108 | -5 |
2014 | Astros | 100 | 104 | -4 |
2014 | Indians | 97 | 100 | -3 |
2014 | Padres | 94 | 96 | -2 |
2014 | Mariners | 97 | 97 | 0 |
2014 | Rays | 95 | 95 | 0 |
2014 | Rangers | 106 | 106 | 0 |
2014 | Braves | 99 | 99 | 0 |
2014 | Diamondbacks | 104 | 103 | 1 |
2014 | Cubs | 102 | 101 | 1 |
2014 | Rockies | 117 | 116 | 1 |
2014 | Tigers | 102 | 101 | 2 |
2014 | Nationals | 100 | 97 | 3 |
2014 | Angels | 95 | 92 | 3 |
2014 | Athletics | 97 | 93 | 4 |
2014 | Cardinals | 98 | 94 | 4 |
2014 | Giants | 93 | 88 | 5 |
2014 | Royals | 101 | 96 | 5 |
2014 | Twins | 101 | 95 | 6 |
2014 | Pirates | 97 | 89 | 8 |
2014 | Red Sox | 104 | 96 | 8 |
2014 | Marlins | 101 | 90 | 11 |
In addition, the standard difference between the Basic and FIP park factors was a staggering 5.5. Clearly, using runs park factors on FIP significantly benefits and hurts certain teams’ Pitcher fWAR.
While the Marlins, Red Sox, Pirates, Twins, and Royals benefit from park factors that overestimate their ballpark’s FIP-inflating ability, the Reds, Brewers, White Sox, Yankees and Mets experience the opposite effect, falsely increasing/decreasing these teams’ Pitcher fWAR.
Looking at the team pitching leaderboards, the effect of this mistake is pronounced on several teams’ fWAR. For example, the Mets, despite ranking 9th in the National League in FIP while playing in a ballpark that inflates FIP by 2%, rank dead last in the National League in Pitcher fWAR. Similarly, the Red Sox rank 5th in the AL in Pitcher fWAR despite ranking 10th in the AL in FIP and playing in a ballpark that suppresses FIP by 4%.
Using FIP park factors instead of runs park factors is a simple change that would vastly improve the accuracy of Pitcher fWAR. In the next segment of “Trying to Improve fWAR”, I’ll examine the league adjustments (or lack thereof) in both Position Player and Pitcher fWAR.
Founder of NothingButNumbers.com
Good Stuff. I’ve noticed this before but never really know what to attribute this to. Really interesting that is might be park factors.
I wonder if this is why the Reds routinely have seasonal projections of RA that miss the mark on the high side? I had always chalked that up to an expected regression of team defensive performance that simply didn’t happen. But perhaps it’s also not giving their pitchers enough credit.
The crazy thing about this error is how much it affects the opinions of statistically minded analysts.
While I love Yordano Ventura, he is coming off a season where he had a 3.60 FIP in a ballpark that deflates FIP by 4% in a league where the average FIP is 3.79. Zack Wheeler, on the other hand, had a 3.55 FIP in a ballpark that inflates FIP by 2% in a league where the average FIP is 3.69. Both pitchers threw about 185 IP. Pretty similar pitchers, right?
Zack Wheeler was worth 1.8 fWAR and is projected for 1.3 WAR in 2015. Yordano Ventura was worth 2.8 WAR, is projected for 2.1 WAR, and currently ranks 33rd on FanGraphs’ trade value rankings, ahead of the unranked Wheeler, a much better (and also unranked) Jacob deGrom, and Jose freaking Fernandez.
Great work, Noah.
why didn’t i think of this
The Tigers should be 1, unless it’s a rounding thing, right?
It’s a rounding thing. Their FIP park factor was actually exactly 100.5, rounding up to 101. The difference of 1.5 then rounded up to 2.
If 100.5 rounds to 101, then 1.5 must round down. This will avoid this issue. Evens round up, odds round down. Works just as well the other way too.
This is interesting stuff.
I’m thinking now about what FIP and WAR are supposed to indicate. FIP is supposed to be a statistic that describes how a pitcher has done, and is by design park neutral. Translating that into WAR would then be where park factors are added in to decide the pitcher’s performance value. Yet one problem with FIP is that by being park neutral, it doesn’t account for the possibility that its inputs may be affected by the park. So if Fenway park, for example, increases the total amount of batters that will get on base relative to a different one, the FIP will be different.
Looking at an easy example: Let’s pretend that (apparently superhuman) pitcher A carries a 50 K%, 0 BB%, 0 HR%, and 0 HBP%. All of his innings are pitched at these exact rates. He pitches 1 inning and gives up 1 hit, resulting in 2 K. His FIP would -4 + Constant. Let’s say he pitches another inning and gives up 3 hits, resulting in 3 K. His FIP would be -6 + Constant. In both cases, his rate stats are the same yet something like his ballpark could end up affecting his FIP.
Getting back to park factors translating from FIP to WAR, relying solely on the FIP factors might be less accurate than: adjusting the pitcher’s FIP to match his ballpark (i.e. adjust the K/BB/HR/HBP inputs to match his ballpark, such as a park increasing batters that reach base by 5% so reduce all of those inputs by 5.25%); and then translate FIP into WAR based on how you stated above.
I’m not sure what your point was in the second half of your comment, but I would disagree with your statement that “FIP is designed to be park neutral”.
In reality, FIP is designed to be independent of the defense, not the ballpark. Home runs have the biggest input in FIP and are definitely dependent on the ballpark, and while K and BB park factors are minor they still have a small effect on FIP
What I meant by park neutral is probably not how it is generally. As you said, the inputs of FIP are affected by the park. I meant that the statistic itself does not account for these differences in producing a statistic that can be compared across parks without being affected by them. Poor wording on my behalf.
What I was trying to say in the second half is that the parks can actually affect babip, which affects the total amount of batters faced. Even if a pitcher maintains the exact same rate stats (K%/BB%/HR%/HBP%), their raw totals (K/BB/HR/HBP) will be affected by their babip, and, thus, by their park to some extent. Since FIP deals with raw totals instead of rate stats, and since different raw totals result in different FIP’s despite equivalent rates (i.e. 100 K/50 BB/10 HR/10 HBP produces a different FIP than 50 K/25 BB/5 HR/ 5HBP), reducing the raw totals by how much a park factor adjusts the batters that reach base might be more accurate.
It’s a much more minor quibble than the one you pointed out in this article, and it’s probably a bit pedantic of me to think so much about. In fact, it’s probably more of a critique of FIP than of your evaluation. Regardless, I think it’s something that can be improved upon.
Oh my god. Thank you so much for this. I’ve been trying to figure out why fWAR hates Mets pitchers so much – I assumed it was something wonky in the park factors, but I didn’t know enough about it to research it. I KNEW Zack Wheeler and Jon Niese weren’t below average.
I still want to know why the Mets bullpen is a full 2 wins worse than the Astros (and in the negatives) when they have the same FIP-.
One thing I didn’t mention is that fWAR also considers pop ups to be strikeouts in it’s calculation. The Astros bullpen ranked 6th in the majors in pop ups, the Mets bullpen ranked 26th.
Does that mean FIP ballpark factors take popups into account as well? The amount of foul territory seems pretty significant in that regard.
You’d have to do a separate iFFB park factor calculation, but the significance of IFFB park factors isn’t huge. The main driver of the difference between FIP and runs park factors is the reliance of FIP on home runs.
Another couple big lacunae are handedness and batted ball profile for batters, I hope you look at those!
I have a question, though: if there are effects like those above, let’s say you have a player that is worth 2 WAR in a neutral park, but he was acquired by a team whose stadium makes him a 3 WAR player because of his tendencies. Should we aim to fix WAR by adding in the unaccounted factors until players are suspended in the grey jelly of neutral true talent, or should WAR reflect the fact that some players are better tailored for some environments and teams know how to exploit this? If park factors aren’t just blanket phenomena that affect all players equally, should there be two WARs? one that’s totally context neutral and one that reflects the value that is created through the interaction of player and environment that is specific to that player?
for example Pablo Sandoval. I think there was an Eno Sarris piece recently about him being elite at fouling off the ball in two strike counts. If you play him in a park with a huge foul area, how much more does that affect him compared to the player with the lowest foul ball rate? here’s a hypothetical: he plays for a team with zero foul area, his teammate is the aforementioned player with the lowest foul ball rate, and Sandoval’s offense is better than his teammate’s by exactly the same amount that his foul balls never getting caught improve it, while having exactly the same defensive value. By the current neutralizing teleology of WAR you’d isolate Sandoval’s production from its context and say that he had the same value as his teammate, which seems problematic.
Does this make any sense? I mean it’s probably not a big deal anyway because if these things exist they’re probably marginal.
I think you make a good point about parks effecting different players in different ways. A solution to this problem would be to evaluate players in such a granular way that the ballpark doesn’t effect the evaluation. If you were to use a statistic that doesn’t include HRs (Tango’s bbFIP for example), you could get a reasonable idea of a player’s value without having to use park factors at all, because all of the inputs are essentially ballpark neutral.
Or you could go through the long and tedious process of park adjusting every component of a player’s stat-line. Personally I’d prefer creating a stat that doesn’t really need to be park adjusted.
This is the reason why they use the runs park factors to begin with. WAR is meant to put a value on what actually happened, not on what would happen in a context neutral state.
A player who hits 30 homers contributes 30 homers. When GM’s are putting a value on two players of equal talent, who both would hit 25 homers in a context neutral stadium, the GM will pay more for and choose the player who can turn his true talent 25 homers into an actual 30 homers in his stadium due to extreme home park effects.
The problems illustrated with FIP in this article are misattributing poor roster construction for problems with WAR. WAR does not underrate high fly ball pitchers playing in homer happy stadiums it is baseball itself that is rough on these players.
The bottom line is that WAR is not meant to be predictive of anything. What it is trying to do is act as an accounting tool to simply describe what has happened. Changes to the WAR formula should be things that bring the final numbers closer to lining up with the final totals at seasons end.
I heard you were talking about me?
hey man don’t give me none of that neutralizing teleology
I agree with you. This should be changed immediately.
In addition, while I use IP in the denominator, it would be more precise to use PA instead, namely PA/4.3.
Yeah, I hope this gets changed ASAP.
Thanks for reading! You’re kind of a saber-jdol of mine lol.
*idol
JDOL
FIP is highly flawed, anyway. Why not create a new pitching metric to replace FIP that includes GroundBall%, LineDrive%, and InfieldFly%? Then you can go ahead and use the runs park factor.
…or just use RA9.
SIERA accounts for batted ball types.
In a later editions of this series (two or three parts down the road) I’ll suggest new metrics for evaluating pitchers. For now I’ll work in the context of FIP.
FIP has no flaw. It properly combines the statistics it is interested in.
You may be interested in Batted Ball FIP (bbFIP). That also properly combines the statistics it it interested in.
OBP has no flaw. It also properly combines the stats it’s interested in.
The flaw is the user who tries to use the metrics in unintended ways.
In my opinion the only “flaw” with FIP is that it decides to ignore certain statistics that are relevant when calculating WAR. We know that RA Dickey is going to beat his FIP every year because of his low BABIP, which is something that bbFIP captures (more or less) and FIP ignores.
FIP is making a conscious decision to combine statistics that do not involve fielders at all, or any involvement of the fielders is at a minimum. So, HR that just clear the wall, or could be caught, or a catcher saving a strike, etc. That’s why HR, HB, BB, and SO are included.
In some version, infield flies are also included.
But you CANNOT include any other batted balls. It goes against the idea of the “I” in FIP.
That Fangraphs has created a WAR version that uses FIP has NOTHING to do with FIP. FIP is FIP. It’s what it is.
If we want to reweight OBP so that walks count less and HR more, then we don’t rework OBP: we instead create a new metric called wOBA.
And that’s why I have Batted Ball FIP to address the issue of batted balls.
If someone wants to further make it more complicated by adding different adjustment factors based on the percentage of knuckleballs thrown, then call it FIPknuck.
None of that has any bearing on FIP itself.
Agreed. I have no problem with FIP itself. I do take slight issue, however, with a FIP-based WAR. We don’t base WAR off OBP; we base it off the much smarter wOBA. Similarly, we shouldn’t base WAR off of FIP (at least in my opinion); we should base it off of some variant of bbFIP.
Still, I understand that changing the primary statistic that WAR is based on is difficult, both to the fans (it’s hard enough to explain FIP, let alone introduce another new statistic) and in its implementation (there is no batted ball data prior to 2002). That’s why calculating FIP Park Factors correctly is important, even if basing WAR off of FIP is an iffy proposition.
We’ll definitely incorporate this change in some changes we make to fWAR this off-season. Though we may calculate the FIP park factor from scratch as opposed to using the already yearly regressed components. I’ll have to take a look and see what kind of difference it makes.
We may also make the change Tangotiger suggests using PA.
Just to make note, we definitely wanted to see what the community said about this approach as this issue has occasionally come up and I think this quote from the post more or less summed up our response: “According to the FanGraphs glossary, a basic runs park factor is used when calculating fWAR. Because FIP models ERA, using runs park factors for FIP shouldn’t be a problem.”
Anyway, I’m glad this post was well received and thanks Noah as this will help improve our WAR formula. I’ve actually seen your second post in the hopper and we will publish that too. I am curious to hear feedback on that as well and will hold my own comments on it until later.
Am I missing something?
WAR is designed to measure value to the team. It’s not meant to measure value to the average team, or theoretical context neutral value. We seem to have made a decision in the past to not have the stat be an estimation of true talent level, nor on the other side of things have we wanted the stat to incorporate sequencing or ‘clutch’ aspects of players performance.
That said the reason for using FIP is to factor out the defensive aspect of run prevention for the purpose of allotting that value to defenders. The reason we use park factors in other stats is to measure out true talent performance. Its place in WAR is simply to provide context on the value of a run given the games played. Contributing 3 runs in Coors Field is not as valuable as contributing 3 runs in PetCo.
Tango’s suggestion of using PA’s makes some sense, since the distinction gives credit for caught stealing, and doubles plays to the defense – though pitchers do deserve some credit for that as well (so the change doesn’t seem like too big of a deal).
Moving to FIP based park factors will probably make WAR lineup a bit more with true talent, but then we’ll end up talking about how WAR at the team level is overrating teams with a lot of fly ball pitchers in fly ball parks.
We should be crediting pitchers for all batters faced, giving full credit for strikeouts and walks (until the days we include pitch framing) as well as infield pop ups and hit by pitch. They should be losing credit for every home run (not fly ball), and the value of each home run should be weighted based on the BABIP park factor for the stadium (i.e. the likelihood of additional baserunners) and the pitchers strikeout rate. The run output value should then be scaled based on the runs park factor.
I ended up running just FIP park factors, (not the weighted average of the various park factors that could be involved in FIP) and you actually get a much more compressed numbers. I think this most likely do to FIP having some compression effects on the high / low end, and with HRs perhaps being weighted too heavily in the weighted average:
Terrific.
Coors has a huge BIP park factor, which is why you might see a big difference between the runs factor and the FIP factor.
This is great. I actually realized a few days ago that the way I calculated the FIP park factors was incorrect, but I simply didn’t have the data to calculate them correctly.
Just curious – how did you go about calculating these FIP park factors?
This is wrong:
((13*HRPF)+(3*BBPF)-(2*SOPF))/(14)
You can’t wait the factors this way. You have to weight them by the coeffient times the frequency of those events.
Imagine for example that this is 1908 and there are almost no HR in the league. We obviously would want that to have a tiny impact overall.
Yeah, I realized that when I was thinking about adding infield fly balls into the equation. Either way I made my point, because I didn’t have actual FIP park factors at my disposal in the first place.
I agree, it’s a great point, and that something so obvious (now that it’s pointed out) took this many years to be highlighted shows that we need more careful thinkers out there.
As a matter of curiosity in regard to infield fly balls, how do you separate routine catches from !catches?
What do you do with fly balls that could be caught routinely by either an infielder or an outfielder? You sometimes see three players who could make the catch.
Is there a way to separate outfielders who shallow or deep, possibly effecting the number of infielder chances?
Although it is possible that there won’t be a major bearing on the matter, I think factors like I mentioned above need to be considered.