The Season’s Least Likely Home Run

Jeff recently ran two articles about the season’s worst and best home runs, as measured by exit velocity.  As a small addendum to that, I’d like to include both exit velocity and launch angle to try to determine the season’s least likely home run.  So how do we do such a thing?  Warning!  I’m going to spend a bunch of time talking about R code and machine learning.  If you want to skip all that, feel free to scroll down a bit.  If, on the other hand, you’d like a more in-depth look at running machine learning on Statcast data, hit me up in the comments and I’ll write some more fleshed-out pieces.

As usual, we’re going to rely heavily on Baseball Savant.  Thanks to their Statcast tool, we can download enough information to blindly feed into a machine-learning model to see how exit velocity and launch angle affect the probability of getting a home run.  For instance, if we wanted to make a simple decision tree, we could do something like this.

# Read the data
my_csv <- 'hr_data.csv'
data_raw <- read.csv(my_csv)
# Make training and test sets
inTrain <- createDataPartition(data_raw$HR,p=0.7,list=FALSE)
training <- data_raw[inTrain,]
testing <- data_raw[-inTrain,]
# rpart == decision tree
method <- 'rpart'
# train the model
modelFit <- train(HR ~ ., method=method, data=training)
# Show the decision tree


That looks like what we would expect.  To hit a home run, you want to hit the ball really hard (over 100 MPH) and at the right angle (between 20 and 40 degrees).  So far so good.

Now, decision trees are pretty and easy to interpret but they’re no good for what we want to do because (a) they’re not as accurate as other, more sophisticated methods and (b) they don’t give meaningful probability values.  Let’s instead use boosting and see how well we did on our test set.

method <- 'gbm' # boosting
modelFit <- train(HR ~ ., method=method, data=training)
# How did this work on the test set?
predicted <- predict(modelFit,newdata=testing)
# Accuracy, precision, recall, F1 score
accuracy <- sum(predicted == testing$HR)/length(predicted)
precision <- posPredValue(predicted,testing$HR)
recall <- sensitivity(predicted,testing$HR)
F1 <- (2 * precision * recall)/(precision + recall)

print(accuracy) # 0.973
print(precision) # 0.792
print(recall) # 0.657
print(F1) # 0.718

The accuracy number looks nice, but the precison and recall show that this is far from an amazingly predictive algorithm.  Still, it’s decent, and all we really want is a starting point for the conversation I started in the title, so let’s apply this prediction to all home runs hit in 2016.

Once you throw out some fairly clear blips in the Statcast data, the “winner”, with a 0.3% chance of turning into a home run, is this beauty from Darwin Barney.*  This baby had an exit velocity of 91 MPH and launch angle of 40.7 degrees.  For fun, let’s look at where similarly-struck balls in the Rogers Centre ended up this year.

* I’m no bat-flip expert, but I believe you can see more of a flip of “I’m disgusted” than “yay” in that clip.

Congrats Darwin Barney!  There are no-doubters, then there are maybes, and then there are wall-scrapers.  They all look the same in the box score, but you can’t fool Statcast.

The Kudzu Kid does not believe anyone actually reads these author bios.

newest oldest most voted

Awesome read! So did he backspin the crap out of it or what? What was the range of velocities/angles you used to identify the similarly struck balls? It’s pretty amazing how much further the Barney ball carried given an environment that should have few other confounding variables (dome @ Rogers Centre).