Weighting Past Results: Starting Pitchers
My article on weighting a hitter’s past results was supposed to be a one-off study, but after reading a recent article by Dave Cameron I decided to expand the study to cover starting pitchers. The relevant inspirational section of Dave’s article is copied below:
“The truth of nearly every pitcher’s performance lies somewhere in between his FIP-based WAR and his RA9-based WAR. The trick is that it’s not so easy to know exactly where on the spectrum that point lies, and its not the same point for every pitcher.”
Dave’s work is consistently great. This, however, is a rather hand-wavy explanation of things. Is there a way that we can figure out where pitchers have typically laid on this scale in the past so that we can make more educated guesses about what a pitcher’s true skill level is? We have the data–so we can try.
So, how much weight should be placed on ERA and FIP respectively? Like Dave said, the answer will be different in every case, but we can establish some solid starting points. Also since we’re trying to predict pitching results and not just historical value we’re going to factor in the very helpful xFIP and SIERA metrics.
Now for the methodology paragraph: In order to test this I’m going to use every pitcher season since 2002 (when FanGraphs starts recording xFIP/SIERA data) where a pitcher had at least 100 innings pitched, and then weight all of the relevant metrics for that season in order to create an ERA prediction for the following season. I’ll then look at the difference between the following season’s predicted and average ERA, and then calculate the average miss. The smaller the average miss, the better the weights. Simple. As an added note, I have weighted the importance of a pitcher’s second (predicted – actual) season by innings pitched so that a pitcher who pitched 160 innings in his second (predicted – actual) season will assume more merit than the pitcher who pitched only 40 innings.
How predictive are each of the relevant stats without weights? I am nothing without my tables, so here we go (There are going to be a lot of tables along the way to our answers. If you’re just interested in the final results, go ahead and skip on down towards the bottom).
Metric | Miss Average |
---|---|
ERA | .8933 |
FIP | .7846 |
xFIP | .7600 |
SIERA | .7609 |
This doesn’t really tell us anything we don’t already know: SIERA and xFIP are similar, and FIP is a better predictor than ERA. Let’s start applying some weights to see if we can increase accuracy, starting with ERA/SIERA combos.
ERA% | SIERA% | Miss Average |
---|---|---|
50% | 50% | .7750 |
75% | 25% | .8218 |
25% | 75% | .7530 |
15% | 85% | .7527 |
10% | 90% | .7543 |
5% | 95% | .7571 |
We can already see that factoring in ERA just a slight amount improves our results substantially. When you’re predicting a pitcher’s future, therefore, you can’t just fully rely on xFIP or SIERA to be your fortune teller. You can’t lean on ERA too hard either, though, since once you start getting up over around 25% your projections begin to go awry. Ok, so we know how SIERA and ERA combine, but what if we use xFIP instead?
ERA% | xFIP% | Average Miss |
---|---|---|
25% | 75% | .7530 |
15% | 85% | .7530 |
10% | 90% | .7549 |
5% | 95% | .7560 |
Using xFIP didn’t really improve our results at all. SIERA consistently outperforms xFIP (or is at worst only marginally beaten by it) throughout pretty much all weighting combinations, and so from this point forward we’re just going to use SIERA. Just know that SIERA is basically xFIP, and that there are only slight differences between them because SIERA makes some (intelligent) assumptions about pitching. Now that we’ve established that, let’s try throwing out ERA and use FIP instead.
FIP% | SIERA% | Average Miss |
---|---|---|
50% | 50% | .7563 |
25% | 75% | .7543 |
15% | 85% | .7560 |
10% | 90% | .7570 |
It’s interesting that ERA/SIERA combos are more predictive than FIP/SIERA combos, even though FIP is more predictive in and of itself. This is likely due to the fact that a lot of pitchers have consistent team factors that show up in ERA but are cancelled out by FIP. We’ll explore that more later, but for now we’re going to try to see if we can use any ERA/FIP/SIERA combos that will give us better results.
ERA% | FIP% | SIERA% | Average Miss |
---|---|---|---|
25% | 25% | 50% | .7570 |
15% | 15% | 70% | .7513 |
10% | 10% | 80% | .7520 |
5% | 15% | 80% | .7532 |
10% | 15% | 75% | .7517 |
15% | 25% | 60% | .7520 |
15% | 25% | 65% | .7517 |
There are three values here that are all pretty good. The important thing to note is that ERA/FIP/SIERA combos offer more consistently good results than any two stats alone. SIERA should be your main consideration, but ERA and FIP should not be discarded since the combo offers a roughly .05 better predictive value towards ERA than SIERA alone. It’s a small difference, but it’s there.
Now I’m going to go back to something that I mentioned previously–should a player be evaluated differently if he isn’t coming back to the same team? The answer to this is a pretty obvious yes, since a pitcher’s defense/park/source of coffee in the morning will change. Let’s narrow down our sample to only pitchers that changed teams, to see if different numbers work better. These numbers will be useful when evaluating free agents, for example.
ERA% | FIP% | SIERA% | Average Miss (changed teams) |
---|---|---|---|
10% | 15% | 80% | .7932 |
5% | 15% | 80% | .7918 |
2.5% | 17.5% | 80% | .7915 |
2.5% | 20% | 77.5% | .7915 |
2.5% | 22.5% | 75% | .7917 |
As suspected ERA loses a lot of it’s usefulness when a player is switching teams, and FIP retains its marginal usefulness while SIERA carries more weight. Another thing to note is that it’s just straight-up harder to predict pitcher performance when a pitcher is changing teams no matter what metric you use. SIERA itself goes down in accuracy to .793 when only dealing with pitchers that change teams, a noticeable difference from the .760 value above for all pitchers.
For those of you who have made it this far, it’s time to join back in with those who have skipped down towards to bottom. Here’s a handy little chart that shows previously found optimal weights for evaluating pitchers:
Optimal Weights
Team | ERA% | FIP% | SIERA% | Average Miss |
---|---|---|---|---|
Same | 10% | 15% | 75% | .7517 |
Different | 2.5% | 17.5% | 80% | .7910 |
Of course, any reasonable projection should take more than just one year of data into account. The point of this article was not to show a complete projection system, but more to explore how much weight to give to each of the different metrics we have available to us when evaluating pitchers. Regardless, I’m going to expand the study a little bit to give us a better idea of weighting years by establishing weights over a two-year period. I’m not going to show my work here mostly out of an honest effort to spare you from having to dissect more tables, so here are the optimal two year weights:
ERA% Year 1 | FIP% Year 1 | SIERA% Year 1 | ERA% Year 2 | FIP% Year 2 | SIERA% Year 2 | Average Miss |
---|---|---|---|---|---|---|
5% | 5% | 30% | 7.5% | 7.5% | 45% | .742 |
As expected using multiple years increases our accuracy (by roughly .15 ERA per pitcher). Also note that these numbers are for evaluating all pitchers, and so if you’re dealing with a pitcher who is changing teams you should tweak ERA down while uptweaking FIP and SIERA. And, again, as Dave stated each pitcher is a case study–each pitcher warrants their own more specific analysis. But be careful when you’re changing weights. When doing so make sure that you have a really solid reason for your tweaks and also make sure that you’re not tweaking the numbers too much, because when you begin to start thinking that you’re significantly smarter than historical tendencies you can start getting in trouble. So these are your starting values–carefully tweak from here. Go forth, smart readers.
As a parting gift to this article, here’s a list of the top 20 predictions for pitchers using the two-year model described above. Note that this will inherently exclude one-year pitchers such as Jose Fernandez and pitchers that failed to meet the 100IP as a starter requirement in either of the past two years. Also note that these numbers do not include any aging curves (aging curves are well outside the scope of this article), which will obviously need to be factored in to any finalized projection system.
# | Pitcher | Weighted ERA prediction |
---|---|---|
1 | Clayton Kershaw | 2.93 |
2 | Cliff Lee | 2.94 |
3 | Felix Hernandez | 2.95 |
4 | Max Scherzer | 3.01 |
5 | Stephen Strasburg | 3.03 |
6 | Adam Wainwright | 3.11 |
7 | A.J. Burnett | 3.22 |
8 | Anibal Sanchez | 3.22 |
9 | David Price | 3.24 |
10 | Madison Bumgarner | 3.33 |
11 | Alex Cobb | 3.36 |
12 | Cole Hamels | 3.36 |
13 | Zack Greinke | 3.41 |
14 | Justin Verlander | 3.41 |
15 | Doug Fister | 3.46 |
16 | Marco Estrada | 3.48 |
17 | Gio Gonzalez | 3.53 |
18 | James Shields | 3.53 |
19 | Homer Bailey | 3.57 |
20 | Mat Latos | 3.60 |
Brandon Reppert is a computer "scientist" who finds talking about himself in the third-person peculiar.
Very interesting. What now needs to be done, though, is to do this for each different type of pitchers: groundball pitchers, flyball pitchers, high-strikeout pitchers, pitchers who outperform their FIP, pitchers who underperform their FIP, young pitchers, old pitchers, Matt Cain, starters, relievers, even power pitchers vs. deceptive pitchers. My guess is that could make it a lot more accurate. Good stuff.
Definitely, and this would more importantly strike at factors such as ‘why’ these factors not terribly predictive. As always, it’s probably because we don’t understand the components that make up these results well enough.
Also I chuckled heavily at Matt Cain’s inclusion in the list.
You need standard errors in here. Mean differences mean nothing without standard error.
It would make it better, yes, but it doesn’t make this mean “nothing”. Next time I do something like this I’ll use RMSE.
The main reason I used mean differences is because it’s in units that we more intuitively understand when interpreting the data.
Interesting research, although wouldn’t it be easier just to figure out how to adjust FIP to include ground ball and line drive rates?
That would essentially be tERA, which is only marginally effective.
Actually that is just bbFIP.