# Introducing Probabilistic Pitch Scores and xWhiff Metrics

With the advent of the Statcast era, a lot of research has been done in attempts to measure the effectiveness of a particular pitch based on its flight characteristics. As has been noted in the past, quantifying a pitcher’s stuff and command is no easy task. However, over the past few months I have worked to build my own models in an attempt to evaluate the “filth” of any given pitch, taking more of a probability-based approach. I introduce to you my Probabilistic Pitch Scores and xWhiff metrics.

When evaluating the quality of a particular pitch, I focused my interest on three different binary outcome variables: whether or not the batter swung at a pitch, whether or not the batter whiffed on a pitch, and whether or not a pitch was thrown for a strike. Thus, my goal was to train three different types of classification models corresponding to each of these variables: a swing, a miss, and a called strike. For the actual outcomes of these models, I was less interested in the model’s decision and more interested in the predicted probability. For example, if a batter swings on a pitch with given flight characteristics, what is the probability that he will whiff? These probabilities were utilized as the basis of my metrics.

### Data Preprocessing & Variable Selection:

To train the models, I used all available Statcast data since 2015. After some initial preprocessing, I partitioned the data for modeling based on pitcher handedness, batter handedness, and pitch group. Pitch groups were defined by arranging different pitch types together in the following groups based on similar velocity/movement profiles (knuckleballs were ignored in this research).

**Fastball:** FF, FT, SI

**Slider/Cut:** SL, FC

**Curve:** CU, KC, EP

**Offspeed:** CH, FS, SC

*Offspeed Pitches:* Instead of using release speed, I used difference between release speed and that pitcher’s average fastball velocity. The actual speed is not as important for an offspeed pitch, as the most effective aspect of its velocity comes from being significantly slower relative to the fastball.

*Swing Models:* I included a categorical variable for count in the swing models, because the count is a significant factor in determining a batter’s swing decision. For example, a fastball thrown right over the heart of the plate should have an extremely high swing probability, but if it is thrown in a 3-0 count, many batters may choose not to swing there. So the model decisions should be adjusted for those types of situations.

*Whiff models:* I wanted to be able to partially adjust for the batter here. The pitch characteristics are obviously important for inducing a whiff, but it’s also helpful to know whether the batter in question is very effective at making contact or not so much. With that in mind, I included seasonal batter whiff rate, regressed toward the mean depending on how many swings they had taken that year.

### Model Training:

Here is the actual methodology behind the models. Swing models were trained on all pitches, whiff models were trained on pitches swung at, and called strike models were trained on pitches taken. The swing and whiff models were trained in the same way. With the partitioned data and each binary dependent variable, I trained Gradient Boosting Machines (GBM) using 5-fold adaptive cross-validation. Hyperparameters for the GBM models were tuned by randomly sampling from a custom grid of values. Finally, I decided to evaluate performance with the ROC score metric instead of accuracy. When dealing with class imbalance in machine learning, accuracy can often be misleading. For an exaggerated example, if a model predicts that 100% of batted balls don’t result in a whiff, but 90% of all pitches are not a whiff in reality, you get a model that is 90% accurate but not one that is necessarily doing a good job. ROC score resolves this issue by choosing a model that optimizes the true positive and true negative rates simultaneously.

The called strike models were trained differently. Here, I trained a Generalized Additive Model (GAM), again with 5-fold adaptive cross-validation, with a smoothing parameter over horizontal and vertical plate location. The smoothing parameter was allowed to vary by count, adjusting for the changing sizes of the strike zone depending on the ball-strike situation (constructed like BP’s called strike probability model). I used ROC score as a performance metric, as well.

Below are the results of each of the models. Each one varied slightly depending on pitch type and handedness, but they all produce similar results. The average ROC score for the swing GBMs was 0.85, while the score for the whiff GBMs was 0.78. I also included the test set ROC for each model, with the 2019 season serving as the testing data. Ultimately, the output of each prediction is represented as a probability and intends to describe the likelihood of one of these outcomes occurring on a particular pitch.

### xWhiff Rate:

The xWhiff metric is based solely on the whiff probability models, which serve as a subset of the Probabilistic Pitch Score metric. Expected Whiff Rate is calculated by summing all of the whiff probabilities for a particular pitcher and dividing by the number of swings against.

*xWhiff Rate: sum(Whiff probability) / Swings*

The interpretation of this metric (based on the premise of the original expected statistics) is to estimate what should have happened. In this case, the percentage of whiffs a pitcher should have generated on the swings against him. It is not 100% precise, but xWhiff Rate gives a good approximation of a pitcher’s stuff. Here are the individual pitch leaders in xWhiff Rate from 2019, along with the actual whiff rates on those pitches (Min. 40 swings).

Some names may appear surprising, but a majority of the xWhiff Rates align pretty well with observed performance in 2019. The most surprising appearance on this list might be T.J. McFarland’s slider at No. 4, as his actual slider whiff rate was more than 17 points lower than expected. However, it’s clear he can generate whiffs with this type of movement.

Here are the top 10 lists for a few other pitch types, as well as the overall xWhiff Rates for each pitch type overall. By nature, breaking pitches tend to have significantly higher whiff probabilities than fastballs. Full lists can be found here.

As an aside, here are a few of the guys who overperformed and underperformed the most on pitches resulting in a swing, according to xWhiffs* (Note: a lot of the drastic over/under-performers are due to smaller samples)*.

Finally, the overall xWhiff leaders:

### Probabilistic Pitch Scores:

The Probabilistic Pitch Score (PPS) Metric builds on the xWhiff model by adding an aspect for command. First, here is how it is calculated on an individual pitch basis:

*PPS on Takes = (1-Swing Probability)*(Called Strike Probability)*(Run Value of Called Strike) – C*_{i}

*PPS on Swings = (Whiff Probability)*(Run Value of Whiff) – C*_{i}

The score is calculated differently for pitches resulting in a swing as opposed to pitches resulting in no swing. For a taken pitch, I want to know the likelihood that it will result in a called strike. However, I did not want to reward pitchers too much for simply throwing right down the middle. Throwing a meatball over the heart of the plate (good control) is not more valuable than a painted strike on the edge of the zone (good command). Thus, I adjusted the score by accounting for the likelihood that a pitch would be swung at. By including a term for the complement of the pitch swing probability, a pitch with 50% swing probability and 50% called strike probability, likely a well-located pitch on the corner, becomes more valuable than a pitch with 90% swing probability and 90% called strike probability, likely a pitch thrown down the middle. This term is then multiplied by the run value of a called strike, relative to a called ball.

Inspired by this article analyzing the run value of a strike, I took the difference between the average run expectancy of a strike and a ball and multiplied by the wOBA scale to convert this to a run value. Finally, a constant is subtracted from each score. The constant (C_{i}) is simply the average pitch score for each pitch type/pitcher handedness combination.

For a pitch resulting in a swing, we build on the whiff probability used in xWhiffs by multiplying it by the run value of a whiff. This run value was calculated by taking the difference between the average run expectancy of a whiff and a ball in play and then multiplying by the wOBA scale to convert to a run value. The same constant is subtracted from each score.

Here is the way to interpret a pitch score. If a given fastball thrown by Gerrit Cole has a score of 0.02, that pitch is worth about 0.02 runs above the average RHP fastball.

To get a measure of a pitcher’s performance for the season, I simply add up all of the scores for each pitch. Here are the highest qualified* scorers for individual pitches in 2019, as well as leaders for some specific pitch types:

**Qualified pitchers included guys who threw three pitches per team game. An individual pitch was qualified if a pitcher threw it at least 5% of the time.*It may seem strange that there are a lot of fastballs at the top of the overall leaderboards, but that is just because fastballs are thrown the most often, thus accumulating more run value on average. The *average* score on a pitch-by-pitch basis, however, appears much different.

Looking at scores on a rate basis gives more credit to pitchers with less volume. Here are the leaders in average PPS per 100 pitches:

Lastly, we can add up a pitcher’s scores across *all* pitches to get an **Arsenal Score** value. This is meant to be an approximation of how well a pitcher’s stuff played over a season, but it may not exactly align with how the results played out on the field.

### Descriptive Power:

To wrap this all up, I wanted to check the reliability of these metrics. First, I estimated the descriptive power of them, i.e. how well do pitch/arsenal scores and xWhiff Rates describe player performance in the current year.

*Overall xWhiff rate*

A trend I notice with the arsenal scores is that their explanatory power increases with more predictive ERA estimators (i.e. xFIP, SIERA, kwERA). It makes sense that the Arsenal Scores are most highly correlated with kwERA because that estimator focuses solely on strikeouts and walks. I sought to measure a pitcher’s swing-and-miss ability along with command with these pitch scores, so that is a good sign.

The second chart above shows the relationship of overall xWhiff Rates to other metrics. xWhiff Rate has a high correlation with actual whiff rate, another nice thing to see. At the individual pitch level, the R^{2} for xWhiff Rate and actual whiff rate is 0.709, even better.

### Predictive Power:

Additionally, I estimated the predictive power of the metrics, i.e. how much can pitch scores and xWhiff Rates tell us about future performance. I also checked the stickiness of the metrics (year-over-year consistency).

*Overall xWhiff rate*

*Individual Pitch xWhiff rate*

As noted above, the xWhiff Rate is pretty consistent from year to year on an individual pitch basis, with an R^{2} of 0.83. On the other hand, the 2018-19 consistency for overall xWhiff Rate was not as high. It does moderately well in predicting actual whiff and strikeout rates. However, this was with a small group of qualified pitchers from 2018 and 2019 consecutively. I would like to get more accurate assessments of the predictive power of xWhiff Rate as I am able to train more models and use more data.

For the pitch scores, there exists more variability. The average individual pitch score has an R^{2} with 2018 pitch scores of 0.41, while the aggregate pitch score has an R^{2} of 0.48. The Arsenal Score falls in between the two measures, with an R^{2} of 0.46.

The Arsenal Scores were not as strong for predicting future performance through ERA estimators. Again, we see that it performed best on kwERA. I would certainly like to see the results of these tests with more than two years of data, but there is certainly some fine tuning that can be done to make these metrics more predictive of future performance.

### Snell’s Stuff:

One stud that my metrics absolutely love is Blake Snell. In 2019, his arsenal ranked fourth among qualified pitchers according to PPS. Here is a breakdown of all of his qualified pitches and their rankings.

*Blake Snell, 2019 Qualified Pitches*

My metrics grade out Snell’s arsenal as one of the most complete in the game. Although his slider and changeup are not ranked as high in terms of pitch scores, they are both firmly in the top ten in Expected Whiff Rate. They still have very high potential to miss bats when he locates them well. As for the fastball and curveball, there is not much else to say; they are flat out dominant. Here was his highest graded curveball last year.

### The Jacob deGrom problem:

The most questionable aspect of my metrics is how they value Jacob deGrom. The back-to-back NL Cy Young award winner is arguably the most dominant starting pitcher in baseball, but I don’t feel this is accurately reflected in the table below. He ranked seventh in overall Arsenal Score for 2019, but I don’t feel very confident in saying there were six pitchers better than deGrom last year.

*Jacob deGrom, 2019 Qualified Pitches*

His four-seamer is the one pitch that I would say that is properly valued, as it sits inside the top 10 across all categories. His slider though, considered one of the nastier ones in the game, seems to get buried on all of these lists. I was curious to see why this was so.

deGrom’s slider is certainly unique. It is over 7 mph faster than the average right-handed slider, with about half as much horizontal movement and over four times as much vertical movement! Looking at the chart below, this pitch behaves more similarly to a cutter:

*Comparing deGrom’s slider to the average slider and cutter*

Cutters tend to get less swings-and-misses than sliders, so naturally they would have a lower Expected Whiff Rate on average. Even though deGrom’s slider misses a ton of bats, the inherent qualities that make it more similar to a cutter could be causing his pitch scores to be negatively impacted relative to other sliders. As an experiment, I tried classifying all of deGrom’s sliders as cutters and comparing them to average. His “cutter” would rank second (!) in overall PPS (7.8 expected runs above average) and 55^{th} in PPS/100 (0.008 expected runs above average), certainly a marked improvement in ranking. It’s possible that the uniqueness of deGrom’s slider results in the model undervaluing its effectiveness. Although this could serve as a reasonable explanation, it does not solve the problem with deGrom’s scores. I hope to find a way to address this issue in the future.

### Guys To Follow Based On 2019:

Freddy Peralta, a reliever and spot-starter for the Brewers in 2019, showed some flashes last season despite the overarching numbers. He pitched to a 5.29 ERA but ranked 31st in Arsenal Score. The more predictive stats supported this notion, as he sported a 4.15 xFIP and a 3.80 SIERA. Even more impressive was his fastball, which ranked 18th in the league with a 0.86 PPS/100 and a 28.99% xWhiff Rate. He had a 29.5% whiff rate on the pitch last year, but that number went up to 36.8% in 2020! And indeed, this season ended up going much better for Peralta across the board.

Meanwhile Corbin Burnes, another Brewer, had his slider ranked inside the top three of my highest graded pitches on average in 2019 — 2.19 PPS/100, 48.39% xWhiff Rate. Those are crazy numbers for a relatively under-the-radar relief pitcher. He induced a 58% whiff rate on the slider last year though, so these results actually make sense! He kept it up in 2020, getting 60.3% whiffs and allowing just three hits in 128 offerings of the slider. My model correctly identified the nastiness of this pitch for Burnes based on its exceptional movement profile. The results were clearly not there for Corbin Burnes in 2019 (8.82 ERA, 6.09 FIP), but the 25-year-old turned that around in a big way this season as he helped Milwaukee reach the playoffs. The stuff is very clearly there, and he could be a very solid pitcher for years to come.

The final pitcher I wanted to highlight was Lucas Sims of the Reds. He ranked 11th in overall xWhiff Rate, and his slider ranked 10th with a 1.57 PPS/100. Command may still be an issue for Sims, but he has the wipeout stuff to be very effective. His overall strikeout rate did not end up budging much between 2019-20, but his results were much improved; the 26-year-old almost halved his ERA (4.60 to 2.45) while his advanced metrics glowed. Sims was a stud this season after just an average campaign in 2019, and my metrics suggest he should continue to have a lot to offer in his arsenal.

### Final Thoughts, Future Improvements:

Although I feel that my metrics give a fairly good gauge of a pitcher’s stuff, they are far from perfect. One thing that could have potentially limited my models was computational complexity and time. Due to the amount of data I was working with, it was impossible for me to tune every GBM hyperparameter to the right values (I randomly sampled from a custom tuning grid). With more time, I would be interested to see if I could improve model accuracy by finding more optimal hyperparameter values. It would also be worth exploring whether I can manipulate these probabilities in different ways to get a more accurate measure of pitcher stuff and command. Any metric that isn’t truly capturing deGrom’s greatness can certainly be improved. Finally, I would like to try and incorporate an element for quality of contact allowed. It is really difficult to try and predict results on balls in play based on the quality of a pitch, but including a factor that could account for this may provide a more accurate overall picture of a pitcher’s stuff. Thanks for reading, and I appreciate any feedback!

Love it – nice work!

This is really really good

Great findings. Some good thought was put into this