Calculating the Odds of Mike Brosseau’s Magic Moment

After watching the great matchup between the Yankees and Rays in the 2020 ALDS, including Mike Brosseau’s epic at-bat against Aroldis Chapman in the deciding Game 5 of that series, I couldn’t help but take a look at the characteristics of the pitch he hit. Chapman is known as having one of the best fastballs in the game and a long track record of success as a closer. After battling back from 0-2, on the 10th pitch of the at-bat, Brosseau hit a 100.2-mph fastball thrown with 2386 rpms and 7.4 feet of extension over the left-field wall, allowing the Rays to advance to the ALCS.

This pitch was 6.9 mph, 80 rpms, and 1.1 feet above the average velocity, spin rate, and extension for four-seam fastballs in 2020. Given the same location, if the pitch was a little faster, had more RPMs, or was thrown even closer to home plate, would the result have changed? The aim of this article is to create a model to determine what the exact chances were of Mike Brosseau hitting that home run.

Using Baseball Savant and its wealth of Statcast data and more typical statistics, we can select all the four-seam fastballs thrown in 2020 and their related metrics. The data was cleaned for missing values, four-seam fastballs thrown by position players, eephus pitches, and four-seamers that may have been mislabeled as sliders or changeups. For the latter category, a minimum velocity of 87 mph was used to remove these potential label errors, and pitches with negative pfx_z values were removed as four-seam fastballs are expected to drop less relative to gravity. For pfx_x, the absolute value of the given value was used, as I want to look at the magnitude of the horizontal break as opposed to which side of the plate the movement is going towards.

The variables selected for the model were HR (as a binary), vertical movement in inches (VertMov), horizontal movement in inches (HoriMov), the number of pitch the fastball was in the given at-bat (Pitchnum), spin rate (Spin), extension of release (Ext), and whether the pitch was classified as Shadow as given by the zone number in the Statcast output (Shadow: Yes=1, No=0). All pitches in the sample were Shadow or Heart, meaning on the edge of the plate or in the middle of the plate. These variables provide some measure of whether a given pitch will be hit and hit for a home run. Hitters should have issues hitting increased velocity and pitches with more movement. The impact of spin rate and extension on perceived velocity is also known to make it harder for hitters to square up a baseball.

Also, as a hitter sees more pitches in a given at-bat, he gets more information about how the pitches move and what the pitcher’s strategy is against him. The more pitches a hitter sees, the better chance he has of making hard contact. Finally, location is supremely important in what outcome will result from a given pitch. The closer toward the edges of the plate, the harder a pitch is to hit. There is a reason why pitches in the center of the plate located belt-high are called “meatballs.”

While the above-mentioned variables likely have an impact on whether a pitch is hit for a home run, there are other factors that have a strong impact that weren’t accounted for. Things like the weather, what park the pitch was thrown in, wind, humidity, the sequence of pitches thrown, to if the pitcher was tipping a certain pitch and the batter was picking up on it, all impact if the result of a pitch will be a home run. For simplicity’s sake, the data was limited to the above variables, and a summary of each is provided in the table below. The sample includes 87,793 four-seam fastballs, off of which 921 home runs were hit.

The 87,793 Four-Seamers Thrown in 2020
HR VertMov HoriMov Pitchnum Spin Ext Shadow Velocity
Min 0.00 0.00 0.00 1 1,620 3.20 0.00 87.0
Max 1.00 25.92 23.64 19 3,599 8.50 1.00 102.2
Avg 0.01 16.07 7.36 2.94 2,310 6.37 0.46 93.5
StdDev 0.10 2.93 3.60 1.77 217 0.46 0.50 2.5
Total 921 40,648
SOURCE: Baseball Savant

Because I am attempting to determine the probability of a home run being hit, and HR is a binary variable, a logistic regression will be utilized. This type of regression can be used to estimate the probability that an event will occur and how variables in the regression contribute to this probability. Before selecting which variables to include in the model, it is necessary to check for collinearity between the variables. A correlation matrix is presented.

A Correlation Matrix on 2020 Four-Seamers
HR VertMov HoriMov Pitchnum Spin Ext Heart Shadow Velo
HR 1
VertMov 0.000 1.000
HoriMov 0.011 -0.113 1.000
Pitchnum 0.026 -0.013 -0.009 1.000
Spin -0.006 0.202 -0.022 0.019 1.000
Ext -0.001 0.068 -0.088 0.028 -0.054 1.000
Heart 0.079 0.005 0.018 0.008 0.003 0.022 1.000
Shadow -0.079 -0.005 -0.018 -0.008 -0.003 -0.022 -1.000 1.000
Velocity -0.015 0.014 0.088 0.088 0.243 0.107 0.009 -0.009 1

We can observe a strong correlation between velocity/spin-rate and spin-rate/vertical movement. We can also see that extension is correlated with horizontal movement and spin rate. To improve the possibility of statistically significant terms, vertical movement, extension, and spin rate have been omitted from the model equation. It is likely that velocity captures a lot of the impact that these variables would have on HR, so they would in essence be redundant.

The equation that will estimate the probability of a home run being hit on a given four-seam fastball is:

HR = B1Velocity +B2Pitchnum +B3HoriMov +B4Shadow + Intercept

The results of the estimated model are presented below:

The Probability of a HR on a Four-Seamer
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 2.262 1.257 1.8 0.072 .
Velocity -0.074223 0.014 -5.464 4.66e-08 ***
Pitchnum 0.135152 0.016806 8.042 8.84e-16 ***
HoriMov 0.031381 0.009178 3.419 0.000628 ***
Shadow -2.299768 0.12144 -18.938 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 • Null deviance: 10226.8 n 87792 degrees of freedom
Residual deviance: 9477.4 on 87788 degrees of freedom • AIC: 9487.4 • McFadden pseudo R2 0.07

The resulting coefficients are all significant to the 99% level, with the exception of the intercept, which is significant to the 10% level. All variables with the exception of horizontal movement have expected signs. An increase in velocity will decrease the chance of a home run, as pitch number increases the chance of a home run increases, and the likeliness of a HR indeed goes down if a pitch crosses home plate around its edges vs. in the middle.

Horizontal movement has a position sign attached to it, but one would expect that if a ball moves more then it would be harder to hit. Despite this, the magnitude of the sign on HoriMov is very small relative to the other terms, and also significant, so it was kept in the model although there could be a confounding factor that is not being accounted for resulting in the unexpected sign.

Also important is the estimate of how well the model explains the data. With logistic regressions, there is no agreed-upon way to estimate the explanatory power of a model, whereas in a linear regression, this number would be the R2. In its place, the McFadden pseudo R2 (derived from the residual and null deviance) is used and the resulting value is .07. This would be an extremely low R2 for any model, but context must be considered. If 7% of the chance of a HR occurring on a four-seam fastball was explained, then that would explain a significant amount given the total number of four-seam fastballs thrown during a season, even in an abbreviated season such as in 2020.

Additionally, it could be that predicting home runs is very difficult, as there are endless factors which impact whether a HR will occur on a given pitch. An additional measure of model significance would be if the probabilities predicted for HR were greater than the probabilities predicted for non-HR within the data set. After exponentiating the coefficients and multiplying by the given variable inputs to get the predicted probability of a HR occurring on a given four-seam fastball, the predicted probability averages for both HR and non-HR fastballs were calculated. The results are presented in this chart.

Predicting Probability of HR Based on Pitch Type
PP (HR) PP (-HR) Difference
Avg 0.01816 0.0104 0.0078
Std 0.007509 0.0091
n 921 86872
StdErr 0.000247 3E-05
StdError[Diff] 0.000249
Confidence 99% • Multiplier 2.58 • Margin of Error 0.0006 • z-score 31.1 • p-value 0.00

By taking the difference in expected means of the average probability for both HR and non-HR in the sample and finding the standard error of the difference, we can test whether this difference is different from 0. The z-score for this difference in means is 31.1, making the p-value essentially 0, and the difference statistically significant. While this is by no mean a perfect model, its does have qualities that are useful for our task of determining the probability of Mike Brosseau hitting a home run off of the four-seam fastball Aroldis Chapman threw him in Game 5 of the 2020 ALDS.

Using the model described above, we can determine what the chances of Brosseau’s home run were. The results are presented below:

Chances Mike Brosseau Hits That Homer
Intercept Velocity Pitchnum HoriMov Shadow Log of Odds Ratio Predicted Probability
Coefficients 2.26 -0.074 0.135 0.031 -2.3
HR Pitch to Brosseau 100.2 10 7.2 Yes
On 10th Pitch 2.26 -7.44 1.35 0.226 -2.3 -5.8976814 0.27%
On First Pitch 2.26 -7.44 0.135 0.226 -2.3 -7.1140494 0.08%

The model predicts that the home run Mike Brosseau hit had a .27% chance of happening. Everyone watching that moment could feel how special it was, and this number now provides a way to quantify it. The model also shows how Brosseau tripled his odds of hitting the home run, from .08% to .27%, by seeing nine pitches prior to hitting the home run on the 10th pitch. This evidence supports the classic notion that hitters tend to fair better the more pitches they see from a given pitcher and the importance for pitchers to retire batters on as few pitches as possible. Also, the magnitude of the coefficient on Shadow shows how important location is for pitching and hitting outcomes.

Future research could improve upon the explanatory power of the model. I wanted to focus only on variables pitchers could control, but the inclusion of launch angle or exit velocity could provide more accurate results. Regardless, using logistic regression to predict an outcome that rarely occurs is difficult. In this data set, 921 homers and 87,793 fastballs were observed for a HR% of .01%.

Additionally, a way to better classify location without adding too many variables would likely improve the model and provide a better picture of how location such up/down or in/out might affect the probability of a homer vs. the shadow/heart method used in this model. Finally, this model focused solely on four-seam fastballs, and it would be interesting to see if certain factors affected different pitch types more than others.





1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Cameron Grovemember
3 years ago

This is really cool and interesting, I love the detail you’ve provided for how you made your model! I’ve actually been doing something similar, looking at predicted outcomes based on pitch characteristics but using a machine learning model.

I wrote this up on the community blog a while ago ( https://community.fangraphs.com/pitchingbot-using-machine-learning-to-understand-what-makes-a-good-pitch/ ) but since then I’ve added predictions for specific events from my model.

While I haven’t tried to predict home runs, for the 10th pitch of this at bat my model gave probabilities of:

Swinging strike: 9%

Called strike: 13%

Ball: 0.2%

Foul ball: 36%

Contact: 77%

Groundball: 17%

Line drive: 13%

Flyball: 11%

If we apply a HR/FB rate of around 10% then that gives a HR probability of 1%, not too different from your prediction!

I’ve made an app for looking at my model’s predictions on any pitch if you (or anyone else) wants to check it out:
https://mlpitchquality.shinyapps.io/pitch_tester/

But I don’t want to take anything away from your post, great work!