The Best Predictors of Second-Half ERA

I play a lot of fantasy baseball and am always looking for an edge. When scouting possible waiver and trade pitching targets, I normally compare players’ ERA with his FIP and xFIP in order to find pitchers underperforming their peripherals, and are thus undervalued. This is a very common process among fantasy owners. But, when are the peripherals not indicative of future performance? Take, for example, Clay Buchholz, who had a 3.26 ERA but a far better 2.62 FIP through 113.1 innings before the All-Star break. Classic buy low candidate (which I did, and he has a 2.02 ERA in 75.2 IP since I added him on May 15th). However, Steamer has him as a 3.76 ERA/3.54 FIP pitcher, with far different walk and strikeout numbers than those he is currently putting up.

What numbers do I trust? What is the best predictor of second half performance? To answer, I went back and pulled first and second half splits for pitchers from 2010-2014, and kept only those who had the qualifying innings pitched in both halves, leaving 349 pitcher seasons. This methodology was inspired in large part by Jeff Sullivan’s research on team records. I found ERA, FIP, and xFIP for each half, and a Steamer projection for the entire season. Using this data, I found what correlated most with second half ERA. The results are below:

  • 1st Half ERA, 2nd Half ERA: .212
  • 1st Half FIP, 2nd Half ERA: .254
  • 1st Half xFIP, 2nd Half ERA: .307
  • Projected ERA, 2nd Half ERA: .315

This is about what I expected. First and second half ERA had a correlation of .21. As we know, no matter how variable ERA can be, an entire half of ERA can still tell us something about future performance, but it is by no means the best.

FIP had a correlation of .25, while xFIP had one of .30. FIP was always thought of as a retrospective statistic, which is why it is used in the calculation of WAR for pitchers, while xFIP is better for predictions. Both of these statistics perform better than first half ERA, which is a good sanity check for advanced metrics in general: they better out perform basic statistics.

The preseason projection, denoted by ERAp, performs the best, with a correlation of .31. The fact that 3 years of prior data is still better than half a year of present data shouldn’t be surprising, but it sort of is. I went into this exercise thinking xFIP would be the best predictor of the second half, but the preseason projections perform better. This result suggests in season improvements on K% and BB% should be taken with a grain of salt and regressed.

We would expect that some combination of the preseason projection and the updated numbers would perform really well. Fortunately, Steamer is constantly updating their projections and release Rest-of-Season numbers daily. Unfortunately, accessing ROS projections from the All-Star break since 2010 is beyond my coding know-how, so those numbers are unavailable.

We can estimate what those updated ROS projections might look like with a linear regression model. Regressing xFIP1 and ERAp on ERA2 provides the best correlation of .35 (this is the square root of the adjusted R-squared number the model spits out).

It’s amazing just how little we can predict. Our best guess only can account for about 30% of the variation in second half ERA. That’s nothing. This stuff is still really hard to predict.  Half a season of data just isn’t enough to go off of. But these are just the public stats. I always wonder what kind of numbers front offices use, and how much better (if at all) they perform. From a fantasy perspective, if you use this methodology enough, you should end up better off than the alternative. When it comes down to it, the updated rest of season projections should be better than just a single season xFIP number.





Beau played baseball at Williams College and is currently an MBA/MS Sport Management student at UMass Amherst

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
evo34Member since 2023
9 years ago

Nice article. The first step, if you want to improve in-season forecast accuracy, is to adjust xFIP properly. E.g., if a Rockies pitcher and a Mariners pitcher have identical xFIP at the halfway point (and that’s all you know about them), you’d bet heavily that the Mariners pitcher will have the better ERA rest of season. This is because xFIP excludes a decent portion of park effects. [Check xFIP vs. ERA for Rockies team pitching last 10 years if you don’t believe me]. It also excludes a large portion of defense, although that is a much more subtle adjustment to make.

Bottom line: unadjusted xFIP should never be considered an optimal candidate to project RoS ERA.

Mike PodhorzerMember
9 years ago

Why did you exclude SIERA? It’s better than both FIP and xFIP.

Thinking the same thing
9 years ago
Reply to  Mike Podhorzer

I was thinking the same thing. Maybe it’s harder to obtain than the other stats?

NESCAC
9 years ago

Great article, Beau. Really well written.

rconti35
9 years ago

So this might not be the place for this, but i just joined fangraphs…can someone please explain how this works?

BenH
9 years ago
Reply to  rconti35

An English teacher of mine once taught her class to never use the word ‘this’ without a noun proceeding it. It makes sentences clearer and it also just makes them sound better.

It is difficult to reply to your post because it is not clear what you want explained. It sounds like you want the whole website Fangraphs explained.

Matt R
9 years ago
Reply to  BenH

What did she have to say about splitting an infinitive, use of single quotes, and the correct spelling of “preceding?”

BenH
9 years ago
Reply to  Matt R

Unless I am misinformed, it is a common misconception that splitting infinitives is wrong.

I like to use single quotes when I’m not actually quoting something, but rather talking about a particular word or phrase. It’s a stylistic choice, but I hardly think it would cause any confusion. In fact, the reason I like it because I feel like it probably leads to less confusion. But I suppose I could be off base according to most grammarians. If so, I hope you will tell me. I can tell by your use of the quotation mark inside the quotation marks you know a thing or two.

I definitely meant to use the word, ‘proceeding’ because that is the opposite of ‘preceding’ and that is exactly what I meant to say. ‘This’ should be followed by a noun for the sake of clarity, e.g. “this comment,” “this house,”or “this website,” otherwise it might become an unclear antecedent.

I think foremost in any communication should be clarity. After that can come grammar and spelling. Apologies if I came off as rude in my original comment. I did not think of a better way to offer my constructive criticism.

rconti35
9 years ago
Reply to  BenH

My apologies, let me rephrase my post.

This article might not be the appropriate place to ask this question, but I just joined Fangrpahs. Can someone please explain how I can learn more about Sabermetrics and/or contribute to the Sabermetric community?

Jonah Pemstein
9 years ago

“Our best guess only can account for about 30% of the variation in second half ERA” — is this correct? What are you referring to with the numbers you use for correlation, r or r^2? You say “this is the square root of the adjusted R-squared number the model spits out”, which to me implies that you are referring to just r. In that case, you have to square it to get the proportion of variance explained by the explanatory variable. So the best guess really only accounts for only 12.25% of the variance.

Other than that, good article. Interesting stuff.