## An Attempt at Modeling Pitcher xBABIP Allowed

Despite an influx of information resulting from the advent of Baseball Info Solution’s batted-ball data and the world’s introduction to Statcast, surprisingly little remains known about pitchers’ control over the contact quality that they allow. Public consensus seems to settle on “some,” yet in a field so hungry for quantitative measures, our inability to come to a concrete conclusion is maddeningly unsatisfying. In the nearly 20 years since Voros McCracken first proposed the idea that pitchers have no control over the results of batted balls, a tug-of-war has ensued, between those that support Defensive Independent Pitching Statistics (DIPS) and those that staunchly argue that contact quality is a skill that can be measured using ERA. Although it seems as if the former may prevail, the latter seems resurgent in recent years, as some pitchers have consistently been able to outperform DIPS, hinting at the possibility of an under-appreciated skill.

It is also widely assumed that a hitter’s BABIP will randomly fluctuate during the season, and that changes in this measure often help to explain a prolonged slump or a hot streak at the plate. Hitters’ BABIPs can also vary drastically from year to year, making it difficult to gauge their true-talent levels. Research in this field has been done, however, and there have been numerous attempts to develop a predictive model for this statistic, one that projects how a player should have performed, or perhaps more succinctly, his expected BABIP, or xBABIP. Inspired by the progress, and albeit limited, success of these models, I embarked upon a similar project, instead focusing on the BABIP allowed by pitchers, rather than that produced by batters. What began as a rather cursory look at exit velocity evolved into a much deeper look, and with this expansion of scope, I achieved some success, though not as much as I had hoped.

My research began with a perusal of Statcast data, and I began to use scatter plots in R to visualize each statistic’s relationship to BABIP. Most of the plots looked something like this:

In the majority of plots, it seemed as if there may have been some signal, but there was quite a bit of noise, making it difficult to detect anything of significance. This perhaps explains the lack of progress in projecting BABIP: after looking at these plots, it appears quite simply difficult to do. Despite these obvious challenges, I remained hopeful that I could perhaps develop something worthwhile with enough data. Therefore, I began aggregating information, collecting individual pitcher-seasons from FanGraphs, Baseball Savant, Brooks Baseball, and ESPN, then manipulating and storing the data in a workable format using SQL. Since Statcast data only became available to the public in 2015, my sample size is unfortunately a bit limited. I also wanted to incorporate the defense that pitchers had behind them along with park factors when creating my model, so I removed all pitchers that had changed teams mid-season from my records. This left me with a grand total of 641 pitcher-seasons (323 from 2015, 318 from 2016), and 188 pitchers showed up in both years. For the remainder of my study, I used the 641 pitcher-seasons to develop the model, but when checking its year-to-year stability and predictive value, I could only use the 188 common data points.

To begin, I fed 29 variables into R: K/9, BB/9, GB%, average exit velocity, average FB/LD exit velocity, average GB exit velocity, the pitcher’s team’s UZR, the pitcher’s home park’s park factor, his Pull/Cent/Oppo and Soft/Med/Hard percentages, and an indicator variable for every PITCHf/x pitch classification. (Looking back on this, I wish I included more data in my analysis to truly “throw the kitchen sink” at this problem, perhaps including pitch velocity, horizontal and vertical movement, and interaction terms to more accurately represent each individual’s repertoire. Alas, I plan on keeping this in mind and possibly revisiting the topic, especially as more Statcast data becomes available.) This resulted in an initial model with an adjusted R-squared of about 0.3; I then ran a backwards stepwise regression with a cutoff p-value of 0.01 to determine which variables were most statistically significant. Here is the R output:

For clarity, the formula: xBABIP = -0.157 + 0.005684 * BB/9 + 0.0009797 * GB% + 0.003142 * GB Exit Velocity – 0.0001483 * Team UZR + 0.005751 * LD%

I again obtain an adjusted R-squared of about 0.3, and I don’t find any of these results to be overly surprising, but to be fair, I had little idea of what to expect. Before examining the accuracy of my entire model, I checked each variable’s individual relationship to BABIP, along with the year-to-year stability of each. These can be found below in pairs:

I was most perplexed by the statistical significance of BB/9, and even after completing my research, I still find no entirely compelling explanation for its inclusion. Typically, BB/9 is considered a measure of control rather than command, but intuitively, these skills seem to be linked, and perhaps pitchers with better command and control are able to paint edges more effectively, thus avoiding the barrel and preventing strong contact. I was disappointed that its relationship to BABIP appeared so weak, but because of its relative year-to-year stability, I hoped that it would retain some predictive power.

Previous research has indicated that ground-ball hitters are able to sustain higher-than-average BABIPs, and thus, its inclusion in my model should not come as a shock. Again, it would have been nice to see a stronger correlation between GB% and BABIP, but there is obviously quite a bit of noise. However, it does seem that generating ground balls is a repeatable skill, which lends itself nicely to the long-term predictive nature of an xBABIP model.

Again, as previous research has suggested, the inclusion of GB exit velocity is to be expected. However, its correlation with BABIP is not as high as I would have hoped; I suspect this may be a result of the unfair nature of ground balls. In a vacuum, one would expect that low exit velocities are always superior, yet a fortunately-placed chopper may actually have better results than a well-struck ground ball hit right at a fielder, and thus, exit velocity’s signal may be dampened. There does appear to be some year-to-year correlation though, which offers some promise of an unappreciated skill.

Here, I’m surprised by the lack of correlation between UZR and BABIP; I collected this data to control for the quality of the defense behind a pitcher, assuming that this could be a pretty significant factor, and although it did remain in my model, the relationship appears to be quite weak. We should expect a very low year-to-year correlation between UZR, as pitchers that changed teams in the offseason were included in my study, and even if they remained on the same roster, teams’ defensive makeups can change drastically from one season to the next. Thus, the latter graph is rather useless, but I chose to include it for consistency.

Unsurprisingly, LD% has the strongest relationship to BABIP, checking in with an R-squared of about 0.15. I obviously wish that there were a stronger correlation between the two, yet despite the noise, when looking at the data, I think it is fairly evident that there is a signal. And although I have read that LD% fluctuates wildly from year to year, I was shocked by the latter graph. It seems as if this is entirely random, and that this portion of a pitcher’s batted-ball profile can be simply chalked up to luck. This revelation is a bit discouraging, as it suggests that my model may struggle with predictive power, since its most significant variable is almost entirely unpredictable.

I anticipated that more variables would be statistically significant, and I am surprised by their disappearance from the model. I assumed that Hard% would be highly correlated with BABIP, but it disappeared from my formula rather quickly. I also assumed that pitchers who generated a high true IFFB% would exhibit suppressed BABIPs, but nothing turned up in the data. And finally, I thought that K/9 may have been significant; it can be considered a rough estimate of a pitcher’s “stuff,” and I speculated that pitchers with high K/9 probably throw pitches with more movement than usual, perhaps making them harder to square up, but my model found nothing.

After considering each of the significant variables individually, I wanted to examine the overall accuracy of my entire model. To do so, I plotted pitchers’ xBABIPs vs. their actual BABIPs, along with the difference:

As mentioned earlier, after incorporating all of the statistically significant variables in my model, I achieve an R-squared of about 0.3, a result that I find satisfying. I obviously wish that my model could have done a better job explaining some of the variation in the data, and I suspect my model could be improved, although I have no idea by how much. There is an inherent amount of luck involved in BABIP, and it is entirely plausible that pitching and defense can in fact account for only 30% of the observed variance, and the rest can only be explained by chance. Despite the lower-than-desired R-squared, I do believe it still verifies the validity of my model, if only for determining which pitchers over- or under-performed their peripherals, saying nothing about why they did so or if they can be expected to do so again in the future. The lack of correlation in the difference plot indicates that pitchers have been unable to systematically over- or under-perform their xBABIP from year to year, and along with the residual plot, suggests that my model is relatively unbiased and doesn’t appear to miss any other variables that obviously contribute to BABIP.

After determining that my metric had some value in a retrospective sense, I set out to determine whether it had any predictive power. Because of the lack of year-to-year correlation for most of the statistically significant variables included in the model, I was quite pessimistic, although still hopeful. I first checked the year-to-year stability of both BABIP and xBABIP:

It seems that both measures are almost entirely random, although xBABIP is perhaps just a bit more stable from season to season. Despite this, comparing 2015 BABIP to 2016 xBABIP revealed that, as expected, my model holds little to no predictive power:

Again, although disappointing, this result was to be expected, as the most powerful variable in my model, LD%, fluctuates wildly. Despite this lack of predictive power, I stand by my model’s validity when considering past performance, and as more data accumulates, perhaps it can be adopted in a stronger predictive form.

Even after concluding that my metric has little predictive value, I thought it would be interesting to look at some of the biggest outliers. 2015’s biggest under- and over-achievers (with their 2016 seasons included as well), along with 2016’s luckiest and unluckiest pitchers can be found below:

Although the model holds no predictive power after quantitative analysis, anecdotally, it appears to do a decent job. Each of the 10 pitchers featured as an over- or under-achiever in 2015 saw the absolute value of their difference fall in 2016 (although the sign did change in some cases); in no way am I suggesting that the model is predictive, I just find this to be an odd quirk. I also find it perplexing that George Kontos appears an over-achiever in both years and can think of no explanation for this. Along with outperforming xBABIP, his ERA has also beaten FIP and xFIP in each of the last two seasons and five of the last six, suggesting a wonderful streak of luck, or perhaps hinting that the peripheral metrics are missing something.

Ultimately, although it would have been nice to draw stronger conclusions from my research, I am mostly satisfied with the results. When developing his own model for hitter BABIP, Alex Chamberlain achieved an R-squared of about 0.4 when examining the correlation between BABIP and xBABIP, the highest I have found. However, his model included speed score, a seemingly crucial variable that I was unable to account for when analyzing pitcher’s BABIPs. With this in mind, I find an R-squared of 0.3 for my model entirely reasonable, and despite its lack of predictive power, I consider it to be a worthy endeavor. As the sample size grows and more Statcast data is released, I plan to revisit my formula in coming offseasons, perhaps refining and improving it.