An Attempt to Predict Hits With Statcast

by tb.25

February 13, 2018

Most of what happens in a baseball game are influenced by chance. A ball hit on the screws can end up in the outstretched glove of a diving fielder. The outfield wall could be just six inches too tall, keeping a home run in the park. Strike three could be called ball four by the home plate umpire. Traditional statistics can’t account for all of this, hence why sabermetricians have developed context-specific statistics like DIPS (defense independent pitching statistics) or wRC (weighted runs created). These stats try to explain the outcomes of batted balls while controlling for defense and ballparks.

I sought out to try and create a model that controls for defense, but from the hitter’s perspective. A model that could predict batted ball outcomes could be used to better evaluate hitters and their quality of contact. Using 2017 MLB pitch-by-pitch Statcast data’s batted ball statistics (launch angle, exit velocity, outcome and spray angle), I used a random forest to model whether a batted ball would be a hit or an out. I trained my model on 20% of the data, and felt confident the training set and test set were identical, with similar means and standard deviations for launch angle, speed and spray angle.

I chose to use a random forest because it runs multiple decision trees on subsets of the training set and averages the results across the sets. A Random Forest model uses k-decision trees, or binary ‘decision’ or outcome model, to model the data. Random forest algorithms minimize variance and bias through averaging; a random forest helps prevent overfitting, something I was afraid of doing. Using the Random Forest provided much better accuracy than running a Logistic Regression, my alternative hypothesized model, due to the number of trees (10) and the nature of a decision tree versus a regression.

Without further ado, the results (in visual form):

Actual Hits & Outs.jpg Predicted Hits & Outs

There’s quite a bit going on in these plots. Let me break it down.

These plots are of every fair ball hit (with a few misclassifications) in 2017 and their landing (or caught) locations. The dark blue balls in play are hits, while the light blue balls are outs. On the left are the actual hits and outs, while on the right are the predicted hits and outs. There are almost a hundred thousand points on these plots, making it difficult to sift through. Here is an explanation of these plots in tabular form:

correct

My model does a much better job at predicting outs than hits. It was correct almost 90% of the time at predicting outs, compared to merely 66% of the time predicting hits. From From the perspective of hits being good (the batter’s perspective), 10% of outs were false positives, and 34% of hits were false negatives. I believe my model did better with outs because there are many more outs than hits – league-average BABIP is .300, or 30% of the time a ball in play is a hit, 70% of the time it’s an out. The model was accurate 81.4% of the time. Despite the high accuracy, the model only ran a .1769 R-Squared. That is, the model was able to describe 17.7% of the variance in batted ball results.

Overall, I feel this model can help predict batted ball results. Two main drawbacks of the model are that it only predicts hits instead of the type of hit and that it requires more data to increase accuracy. I believe having fielder data, such as shifts and defensive capabilities, would greatly increase the accuracy of the model, though at the risk of overfitting (given the small samples of fielded balls in certain areas).

I plan to explore this model further and look at individual batters to compare their actual hits to the predicted ones.

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

johndw28

7 years ago

Fascinating, thanks. Did you produce the above chart with R?

tb.25

Reply to johndw28

Thanks, trying to improve my self-taught modeling skills in hopes of working in baseball one day.

I considered using R, but because I used python to manage the data, I modeled in python and used Tableau.

Michael Augustine

Nice work, man! Looking forward to seeing what else you can come up with.

skilled_sox

Nice article. I’d like to discuss some baseball modeling with you. If you’re interested message me on skype, XChamps4ever

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG