Introducing xxxFIP

ERA, FIP, xFIP, and beyond…

There are a wide range of pitching stats available to the discerning baseball fan. From Wins and ERA to DRA- and xBACON, there’s something for all tastes. In this post I’ll introduce a new stat, xxxFIP, which is definitely NSFW (Not Safe For Wise decision-making).

Before diving into the details of xxxFIP, let’s discuss its predecessors and what they are trying to measure.

Earned Run Average is the grandfather of pitching statistics, existing since the early 1900s, and is one of the most widely accepted measures of pitcher quality. It is a relatively simple statistic, taking the rate of earned runs given up for every nine innings pitched. An earned run is a run which the pitcher gives up that wasn’t the fault of a defensive error. A low ERA is associated with good pitchers, put simply: giving up less runs helps you win games.

There are plenty of caveats when using ERA to predict future performance. ERA can heavily depend on batted ball luck, performance with runners on base, and the quality of the defense behind the pitcher. In small samples, one bad outing can drastically skew a pitcher’s ERA, which makes it an unreliable statistic for measuring a pitcher’s underlying ability to prevent runs over a season.

This leads us towards Fielding Independent Pitching. In the early 2000s it was realized that pitchers do not have as much control over whether balls in play fall for hits or not as we thought. FIP removes outcomes based on balls in play and judges pitchers on the events they have the most control of: strikeouts, walks, hit by pitches, and home runs. These events are added in a weighted sum to produce a scale similar to ERA. Removing unpredictable batted ball outcomes produces a metric which can predict future ERA better than ERA itself!

Home runs are an important, but volatile, component of FIP. A statistic which attempts to address this is xFIP. Instead of using home runs in the FIP calculation, this statistic replaces them with the expected home run rate based on the number of flyballs hit against the pitcher along with the league-wide home run per flyball rate. The idea behind this is that pitchers have more control over their flyball rate than their home run rate, so using a league-average home run per flyball rate will reduce the variance in home run rate which is mainly caused by the hitter. For a full breakdown, see this article.

xxxFIP

Now we enter uncharted territory. There have been attempts to produce better estimates of home run rate through Statcast data and by breaking down flyballs in the infield vs. the outfield, but I intend to add estimates of strikeouts and walks as well.

Hence the titular statistic: xxxFIP. The three x’s stand for xK, xBB, and xHR, each of which is generated by my machine-learning based pitch-prediction model, PitchingBot. I’ll go through how these predictions are made in the next section.

After creating xxxFIP, I did search online for “xxxFIP”, just to make sure I wasn’t accidentally plagiarizing anyone. I found four results, three of which were jokes on various social media platforms, and one was an archived FanGraphs chat from 2014. In it, Neil Weinberg makes an offhand comment:

Fun fact, I built something called xxxFIP but haven’t tested it’s predictive nature. It uses those xK, xBB, etc stats.

Damn! Not only did Neil get there first, but he chose the same pun name as me!

There’s no other subsequent reference to xxxFIP, so I’m keeping the name for my own version as it’s too good to pass up on, but hat tip to Neil.

The Three X’s

PitchingBot is a machine learning model which attempts to predict pitch outcomes by using pitch characteristics and contextual information. The inputs used by PitchingBot are:

  • Pitch Type
  • Pitch location as it crosses the plate
  • Vertical and horizontal movement
  • Velocity
  • Spin rate
  • Pitcher arm slot (release point x and z)
  • Pitcher handedness
  • Batter handedness
  • Count (balls and strikes)

I’ve tried several different prediction models, attempting to predict run values or specific events, and also selectively excluding some inputs. For xxxFIP, I’m using the model which uses all the input data and tries to predict the probability of the following events on each pitch:

  • Swing
  • Swinging strike
  • Called strike
  • Ball
  • Foul ball
  • Ball in play
  • Contact
  • Groundball
  • Line drive
  • Flyball

To find the predicted number of strikeouts, I simply add up the probabilities of called and swinging strikes for all pitches thrown in two-strike counts. The same is done for walks, summing the probability of a ball being called for every pitch thrown in a three-ball count.

To get the predicted rate of hit by pitches and flyballs, the probabilities of these events are added together over all the pitches thrown by a pitcher, no need to filter by count.

With these predicted statistics, I can plug them into the formula for xFIP and produce the new statistic, xxxFIP. The denominator in the xFIP calculation is usually innings pitched, but here I have used the number of batters faced divided by 4.3 (the average number of batters faced in an inning). This allows for more reliable calculation of the statistic, as an inning pitched is of variable length depending on how many runners reach base. The correlations between xxxFIP and other statistics are not strongly changed when converting between batters faced and innings pitched on the denominator.

Note that none of this requires any knowledge about what happens to the pitches after they have been thrown. The same pitch could be thrown to Mike Trout or Jeff Mathis and it would have the same xxxFIP value. This means that xxxFIP is an attempt to measure batter independent pitching, which comes with benefits and drawbacks.

Pros

A major benefit to using a batter independent metric is that it automatically adjusts for opponent quality, as a pitcher who faces a disproportionate number of home run sluggers will not have his xxxFIP penalized to the same extent as his FIP. This is useful as it allows us to be more confident in the reliability of the xxxFIP over small sample sizes.

Another benefit is that a team does not need to view their pitcher in a real game situation against major league batters to measure xxxFIP. A bullpen session that simulates changes in count based on the pitches thrown could in theory do the job just as well. PitchingBot produces probabilities which would allow this to be performed with good accuracy, as shown in the figures below:

PitchingBot’s predictions of strike rate (including swings) are plotted on the x axis and the error in the predicted strike rate is plotted on the y axis. The size of each point is representative of the number of pitches with that predicted strike rate. The predictions are split by the count. PitchingBot’s predictions of strike rate are accurate to within a 1% margin of error for the majority of situations.

Same as above, but this time for predicted ball in play (BIP) rate. PitchingBot systematically underestimates BIP rate by around 5% for pitches likely to result in a ball in play. These occasions are rare, as the sizes of the data points with large errors are small.

Cons

There are drawbacks to xxxFIP which I should mention before moving on. The inputs that PitchingBot uses do not span the full range of factors which affect pitcher quality. Sequencing of pitches, deceptive deliveries,  tunnelling effects, spin mirroring, and more can make a pitcher’s arsenal greater than the sum of its parts. At this stage, PitchingBot can only measure the parts.

In addition, xxxFIP is still somewhat subject to random variation. xK and xBB depend on the counts which a pitcher finds himself in, and therefore xxxFIP is not completely immune from the results of pitches and the idiosyncratic swing decisions made by batters.

Evaluating xxxFIP

We can compare the rates of xK, xBB and xFB to the actual rates and see if there are any large discrepancies. Only pitchers who faced at least 400 batters in a season were used for the comparison.

In each case the expected rates have a reasonable correlation to the actual rates. However there is an offset, as the actual rates are higher than the predicted rates.

Strikeout rates are around 4% higher than expected, walk rates are 12% higher than expected, and flyball rates are 22% higher than expected. I assumed there would be some offset between the predicted rates and the actual rates, as PitchingBot can only make predictions on pitches with complete tracking data, hence those with incomplete data will need to be thrown away and slightly reduce the expected rates of Ks, BBs, and FBs.

However, I did not expect this difference to vary between Ks, BBs, and FBs. Perhaps pitches which are hit into play are more likely to have incomplete tracking data, and balls in the dirt could have the same problem. Alternatively, PitchingBot might be making poorer predictions on some events compared to others.

Correlations and Use as an ERA Predictor

A useful test is to see how xxxFIP correlates to xFIP, FIP, and ERA. This will tell us whether we should consider testing it as an ERA predictor or throw it straight in the garbage. The following table shows the 2 measure between xxxFIP and the other stats for pitcher seasons from 2015-2020 with a varying cut off for minimum batters faced. 2 can vary between 0 and 1; 0 means that xxxFIP contains no information about the other statistic, while 1 means that xxxFIP is perfectly correlated with the statistic.

Testing R2 of xxxFIP
Minimum batters faced xxxFIP-xFIP R^2 xxxFIP-FIP R^2 xxxFIP-ERA R^2
10 0.49 0.27 0.10
50 0.52 0.30 0.17
100 0.55 0.37 0.22
400 0.65 0.50 0.25

We can compare this to how xFIP correlates to FIP and ERA:

xFIP to FIP and ERA
Minimum batters faced xFIP-FIP R^2 xFIP-ERA R^2
10 0.55 0.40
50 0.60 0.35
100 0.64 0.36
400 0.75 0.44

Clearly xxxFIP does not correlate as well as xFIP with ERA, as it’s easier to get a good correlation based on what did happen rather than what may have happened. However, there is some correlation, even after a very small number of batters faced.

The next test is to see how xxxFIP, xFIP, FIP, and ERA correlate with ERA for the pitcher next year. The idea behind this is that expected stats stabilize more quickly and therefore will contain predictive ability for a statistic dominated by noise such as ERA.

Next Year ERA R2
Statistic Min. 10 batters Min. 100 batters Min. 400 batters
xxxFIP 0.01 0.08 0.13
xFIP 0.02 0.09 0.21
FIP 0.02 0.06 0.15
ERA 0.01 0.03 0.10

Almost nothing can be said about next year’s ERA in small sample sizes. For a minimum of 100 batters faced, xxxFIP does a better job of predicting next year’s ERA than FIP, but xFIP is better still. Finally for large samples of over 400 batters faced, xxxFIP is better than ERA but falls short of FIP and xFIP.

The following table shows the R2 values when comparing each statistic on a yearly basis for each pitcher.

Year-on-Year R2
Statistic Min. 10 batters Min. 100 batters Min. 400 batters
xxxFIP 0.17 0.32 0.45
xFIP 0.08 0.22 0.42
FIP 0.04 0.11 0.23
ERA 0.01 0.03 0.10

xxxFIP shows the greatest correlation year-by-year, especially in small samples. This suggests that it is a more stable measure of pitcher quality than the other metrics.

Reliability

The reliability of a statistic is a useful and often overlooked concept. It has been discussed extensively in a number of articles on FanGraphs. The reliability of a statistic over a number of plate appearances can tell us how much its value is affected by the player’s true talent level vs. noise. I would highly recommend reading the linked articles for a better understanding of reliability and sample size for different statistics.

Reliability goes up with a larger sample size, but this can vary significantly depending on the statistic being measured. A pitcher’s fastball velocity or arm slot is very reliable; after observing only a few plate appearances you know almost everything there is about these metrics for a pitcher. On the other hand, a pitcher’s BABIP allowed is very unreliable, as even after a full season’s worth of pitches a player’s BABIP can vary significantly from their true talent level.

Reliability for a statistic is useful for making predictions. The more reliable it is, the less we have to regress to the mean when projecting changes in the statistic in the future. In addition, a statistic which is reliable over small samples is more informative in situations where only small samples are available. It would be ludicrous if a scout were to judge a pitcher based on his BABIP after watching one outing.

To test xxxFIP’s reliability, I shall be using Cronbach’s Alpha. This is explained in detail in this article (also linked above). Without going into much detail, higher alpha means greater reliability. The values of alpha for ERA, FIP, and xFIP were taken from this article.

Using alpha as a measure of reliability, xxxFIP is incredibly reliable. After around 40 batters faced, we know more about a pitcher’s true talent xxxFIP than we would know about a their true talent ERA over a full season. For the same confidence level in xFIP as xxxFIP, the sample of plate appearances would have to double; for FIP, it would quadruple.

For comparison with other metrics, the reliability of xxxFIP is around the same as K%.

xxxFIP Leaders

Since we are early in the 2021 season, this is the perfect time to start looking at high reliability metrics such as xxxFIP.

At the time of writing (04/08/21) there are 122 pitchers with at least 20 batters faced, and those with the sexiest xxxFIPs are:

xxxFIP Leaders on April 8
Player Name xxxFIP
Zack Wheeler 1.74
Tyler Glasnow 2.43
Corbin Burnes 2.44
Joe Musgrove 2.62
Alex Cobb 2.78

And those with some of the least attractive xxxFIPs include:

xxxFIP Trailers on April 8
Player Name xxxFIP
Daniel Ponce de Leon 5.34
Shohei Ohtani 5.30
Chad Kuhl 5.13
Carlos Rodón 4.95
Jorge López 4.80

I’ve put the current xxxFIP leaderboards online here and will keep them updated throughout the season.

Summary

Using predicted pitch outcomes, I’ve created a metric, xxxFIP, which attempts to predict ERA by isolating the quality of the pitches that a pitcher throws along with the count that they are thrown in. This metric is more reliable than xFIP and FIP and has the potential to be calculated without needing pitches to be thrown against real batters.

There are limitations to xxxFIP. Firstly, it only measures individual pitch quality, which ignores other important factors such as pitch sequencing and the relationship between the pitches in a pitcher’s arsenal. xxxFIP has lower accuracy than FIP and xFIP when predicting ERA on full season sample sizes. Additionally, xxxFIP is produced by a machine learning model which requires a vast quantity of detailed pitch tracking data, meaning that the predictions can lack explainability and there is limited scope to apply xxxFIP in leagues beyond MLB.

Considering that this is a stat which I created purely because I thought the name would be funny, it turned out to be surprisingly successful, and I know I’ll be following the xxxFIP leaderboards closely this season.

This post is adapted from my blog, which can be found here.





11 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Bubbamember
2 years ago

This is great work!

Myfanwy
2 years ago

Can we see xxxFIP here or do we need to subscribe to your OnlyFanGraphs?

pedeysRSox
2 years ago

How much would removing intentional walks affect those values?

zanderstroud
2 years ago

Is there any issue with tabulating expected K/BB/BIP on a per-pitch basis instead of a per-PA basis? I imagine for something like Belt’s 21-pitch at bat, if you added up the expected outcomes pitch-by-pitch, they would sum up to much more than 1, despite Barria only facing one batter over the course of those 21 pitches.

Overall though, great stuff! Very enjoyable and informative read.

richwp01member
2 years ago

Is there a reason SIERA was not also compared? I like that xxxFIP is already effective at telling us talent after 40 batters face,, even if xFIP is better over a full season worth of data.

Mean Mr. Mustard
2 years ago

My understanding of what you’re doing may be inaccurate, but this would seem to have interesting implications for monitoring pitchers rehabbing injuries. Once you have a baseline for a pitcher at “full health”, couldn’t this be used as a progress outline for ramping up from, say, a forearm strain?

Additionally, it seems like it could potentially be used in conjunction with motion sensors or high-speed cameras to help a pitcher find his theoretical best version of a particular pitch.

Perhaps you or someone else in the commentariat could speak as to these ideas?

Mean Mr. Mustard
2 years ago
Reply to  Cameron Grove

Hey, thanks for the quick response.

My thinking as far as rehab progress wasn’t necessarily with an eye towards whether a pitcher has their velocity or movement back, but whether this might point to something like, “Pitcher X’s slider this week was right back within his norms but something’s off with the change – let’s go look at the video.”

Having said that, perhaps a better use along the lines of finding a pitcher’s best version of their pitches is in problem diagnosis – if someone’s fastball is a little flat or the slider isn’t breaking right compared to their baseline, this is in some ways a doctor’s chart of their abilities.