Building a Hitting Prospect Projection Model

by Joshua Mould

February 1, 2021

How well do you think you can predict the future of a minor leaguer? My computer may be able to help. Towards the end of the regular season, I found the prospects page at FanGraphs and started experimenting with it. I have always had a lot of fun thinking about the future and predicting outcomes, so I decided to try to build a model to predict whether or not a prospect would make it to the majors. I had all the data I needed thanks to FanGraphs, and I had recently been looking into similar models built by others to figure out how I could accomplish this project. I realized that all these articles I was reading detailed the results of their models, but not the code and behind-the-scenes work that goes into creating them.

With that in mind, I decided to figure it out on my own. I had a good idea of what statistics I wanted to use, but there were a few issues I needed to consider before I started throwing data around:

Prospects can play multiple years at a single level.
Not all prospects play at all levels of the minor leagues.
What do I do with players who skipped levels?
How can I make this model useful and practical?

Prospects playing multiple years at a single level isn’t too difficult to deal with because I can just aggregate the stats from those seasons. The fact that not all prospects play in every level of the minor leagues before reaching the majors is tough, however, because that makes for a lot of missing data that needs to be handled before building the model. I decided to replace all the missing values with the means of the existing data, and I created variables to indicate whether or not a player’s season stats for that particular level of the minor leagues were real. To make this model useful, I would want to take out certain variables. For example, I figured I wouldn’t need or want Triple-A stats included in the model because typically once a player has reached that level of the minors, you are more interested in how well they will do in the majors.

I defined a prospect as any player under the age of 26 in the minor leagues, and I only used stats from the following minor league levels: rookie, Low-A, Single-A, High-A, and Double-A. I also needed to get rid of current prospects from my training data (because they haven’t been given the chance to get to the majors yet) so I removed the names who currently exist on the FanGraphs prospect page. I also took out players who didn’t get past rookie league, because I wasn’t nearly as interested and they made up a huge majority of the seasons that I collected. Side note: It would be really cool project to take college stats and use them to project whether or not a player would make it past rookie ball given the number of washouts who made up my data… maybe a future project.

I used a logistic regression model because of the binary outcome variable that I wanted to predict: Whether a player makes it to the majors or not. The predictor variables I wanted to use were age, BB%, K%, BABIP, ISO, GB%, SwStr%, and SB%. I chose these because these stats appealed to me as variables that isolate certain skills and could have predictive power. I calculated SB% a little differently than the conventional calculation, using it the way Chris Mitchell did in his article at The Hardball Times. He calculated the proportion of times that a player would attempt to steal based on opportunities given with the following formula:

SB% = (SB+CS) / (Singles + Walks + HBP)

For the response variable — major leaguer or not major leaguer – I classified a major leaguer as being a player who had 600 plate appearances or more in the bigs.

Once I had these logistics figured out, I began to put the model together. I started off with every statistic listed above from each level of the minor leagues, and I slowly narrowed it down to only variables of significance that would reduce overfitting. I used the AUC/ROC curve to evaluate the model, but because the data is so lopsided in that nearly 95% of players in my training set “never made it to the majors,” I needed to use a different baseline than conventional AUCs for the model. The exact percentage of players who “never made it to the majors” in my training set was 93.52%, so a model that simply says none of the players ever make it would have a 93.52% accuracy. That means I was shooting for an AUC of 93.52 or higher so that I could match the default model while also providing some insight on which players actually do make it to the majors.

I ended up including the following variables/stats in the final model:

SB% in Low-A
BB% in High-A
ISO in High-A
Age in Double-A
SO% in Double-A
BABIP in Double-A
ISO in Double-A
SB% in Double-A
“Real or Not” in Rookie
“Real or Not” in Double-A

These variables make sense given there’s mostly representation from the higher levels of the minors. If you get to Double-A, there’s a much greater chance that you get to Triple-A or even go straight to the majors. It also of course bodes very well if you do well at Double-A, since this is much closer to playing against major league competition. I also wasn’t expecting GB% and SwStr% to be hugely significant predictors either because they are mostly made up for in BABIP and SO%.

In my training set I ended up with an AUC of .952 and the following plot of the AUC-ROC curve:

In the test set I ended up with an even better AUC of .955 and the following plot of the AUC-ROC curve:

At the beginning of this project, I didn’t have much experience with the AUC-ROC curve. The only method of evaluating a logistic model that I had known of was the confusion matrix, which I now know is not always helpful, especially when a continuous predictor variable is more valid. In addition, when my AUC first appeared very high, I was surprised because I didn’t think I’d be able to build such a strong model that simply. I looked into it more and found that the reason it was so high is because the default model that predicts all prospects to fail would be extremely accurate, so unfortunately the high AUC was not because I was a natural modeling guru. The imbalance between those who make the majors and those who don’t in my training set is so lopsided that it makes my model seem extremely effective right off the bat before I had even taken out any variables.

Having 50 variables and the accuracy appear so high is a tricky illusion, and in the case of a model like this, leads to overfitting. I knew from the beginning that I would need to reduce the variables in the model, but it was very interesting to see the results of the overfitting appear in the data. With my overfitted predictions, there was a large difference between the predictions for those who made it to the majors and those who didn’t. This resulted in a much lower AUC in the test set and made for a poor model. In playing around with it at some points I found the model ending up with an AUC of 1, which meant perfect prediction in the training set and isn’t really useful on any other data.

Let’s take a look at the data. Here are the predictions from the top 25 players in WAR between 2006 and 2019:

Prospect Model for Top 25 WAR Players, 2006-19

Rank	Name	Predictions	WAR
1	Mike Trout	.99336898	76.0
2	Buster Posey	.01242905	52.7
3	Andrew McCutchen	.86624516	49.7
4	Ryan Braun	.8791191	43.9
5	Josh Donaldson	.31081991	42.3
6	Paul Goldschmidt	.7544491	41.4
7	Mookie Betts	.99238981	40.2
8	Giancarlo Stanton	.96553409	39.7
9	Freddie Freeman	.86814365	38.0
10	Brett Gardner	.62614691	37.6
11	Jonathan Lucroy	.24545985	37.0
12	Justin Upton	.99592404	36.8
13	Bryce Harper	.98098739	36.7
14	Manny Machado	.96577834	35.8
15	Jose Altuve	.97283983	35.2
16	Yasmani Grandal	.60096726	34.3
17	Christian Yelich	.88193566	34.3
18	Jason Heyward	.99870464	32.9
19	Kyle Seager	.59719139	32.3
20	Nolan Arenado	.7804618	32.2
21	Anthony Rizzo	.87364824	30.3
22	Jacoby Ellsbury	.88251289	30.3
23	Matt Carpenter	.20256203	30.2
24	Lorenzo Cain	.19193733	29.0
25	Francisco Lindor	.95153247	28.9

The most obvious thing to point out is Buster Posey’s abysmal prediction of around 1% chance to make it to the show. This makes sense given his very short minor league career before getting to the majors, which consisted of about 10 games in rookie ball and Low-A combined before 80 games in High-A. He went to Triple-A for a little bit (which my model doesn’t take into account) and then went to San Francisco after just a couple seasons in the minors. Despite the fact that he’s a former MVP and three-time World Series champ, I found that this is a win for the model because in general guys who play very few games in the lower minors and don’t play Double-A probably aren’t going to make it to the majors, and Posey was a true outlier and top prospect. This is a place where it might help to factor in college stats to a model so that it might start to see standouts like Posey rush to the majors.

Another thing to note is that just like there needed to be a baseline for the AUC because of the lopsided results, there also needs to be a baseline for the predictions. We need to understand that the average minor league player’s chance to make it to the big leagues is around 10%. That means that those who have predictions far above that are likely to be very, very good.

Next I took the top 500 prospects or so from FanGraphs and applied the model to them. The top 50 predictions from the model are below. There are a few names you might notice are missing from the top predictions, including players like Wander Franco, who simply hasn’t had enough playing time at levels the model likes the most, such as Double-A. We may never even see him play at Double-A because of this past minor league season’s cancellation and the possibility he is called up early in 2021.

Top 50 Prospect Model Projections

Rank	Name	Age	wRC+	Prediction
1	Jarred Kelenic	20.3	142.958722	.9778224
2	Luis Robert	22.2	150.628621	.94175248
3	Isaac Paredes	20.7	132.283928	.94124406
4	Jo Adell	20.6	137.274165	.93936551
5	Nick Madrigal	22.6	119.707385	.93719679
6	Dylan Carlson	21.0	130.198868	.93300778
7	Andrés Giménez	21.2	110.71173	.90559978
8	Gavin Lux	21.9	156.305912	.89909189
9	Vidal Bruján	21.7	134.759559	.89442025
10	Keibert Ruiz	21.3	94.7038924	.88379281
11	Daulton Varsho	23.3	145.465397	.8596463
12	Taylor Walls	23.3	133.86938	.83855902
13	Brendan Rodgers	23.2	123.813903	.83243674
14	Jorge Mateo	24.4	80.0309606	.82703768
15	Heliot Ramos	20.2	119.050502	.80713648
16	Luis Garcia	19.5	95.1615171	.79559999
17	Cristian Pache	21.0	114.751941	.78913128
18	Yusniel Diaz	23.1	135.15202	.76919141
19	Drew Waters	20.8	132.118829	.757519
20	Mauricio Dubón	25.3	103.601284	.75656152
21	Alec Bohm	23.2	146.943753	.74234879
22	Austin Hays	24.3	92.4417251	.73148647
23	Anthony Alford	25.3	94.0627314	.72035194
24	Leody Taveras	21.1	95.1196241	.71424263
25	Carter Kieboom	22.2	124.636012	.71109895
26	Oneil Cruz	21.1	137.430851	.70463246
27	Jonathan Araúz	21.2	98.2112133	.70432122
28	Lucius Fox	22.3	99.4395713	.69819464
29	Abraham Toro	22.9	137.205164	.6836528
30	Jason Martin	24.2	99.7303846	.68025898
31	Luis Barrera	24.0	122.854608	.67851418
32	Joey Bart	22.9	140.693228	.67710447
33	Khalil Lee	21.3	117.048416	.66450715
34	Royce Lewis	20.4	111.059658	.64479302
35	Ryan Mountcastle	22.7	118.774006	.64439561
36	Ke’Bryan Hayes	22.8	110.540248	.64018493
37	Brandon Marsh	21.9	118.81056	.61958271
38	Thairo Estrada	23.7	74.5093524	.61888417
39	Luis Santana	20.3	126.071404	.61490306
40	Daz Cameron	22.8	98.7950409	.60065321
41	Jorge Oña	22.8	103.290552	.59374655
42	Josh Lowe	21.7	113.863437	.58122182
43	Randy Arozarena	24.7	133.414235	.56952746
44	Yonny Hernandez	21.5	118.231353	.56648887
45	Domingo Leyba	24.1	107.683533	.5605087
46	Alex Kirilloff	22.0	150.321047	.55860194
47	Omar Estévez	21.7	113.840938	.53150457
48	Sheldon Neuse	24.9	99.6412136	.52316645
49	Connor Wong	23.5	131.336263	.50716012
50	Jahmai Jones	22.2	93.252696	.49505114

It’s really cool to see this kind of thing work in a fortune-telling way. Predicting the future is fascinating to me, and even more satisfying when predicted correctly. Many of these players have already been brought up to the majors and have performed well. Others haven’t performed as well but are destined to turn it around.

Overall, I was very happy with this project because of the fact that it’s not straightforward and that I had to experience a few things along the way. I learned about the different options that I could use when I encounter missing data, like the missing seasons of minor leaguers, and I also know more about how to evaluate logistic models. In addition, I now have a handy dandy tool for evaluating minor leaguers’ potential success in the future.

10 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

DdogMember since 2020

4 years ago

Joshua, very impressive use of statistics! Keep up the good work and go Wildcats!

matt_heckman

This work is amazing! Have you considered adding an extra step in to find somewhat of the expected value for the wRC+ of each player? Possibly taking the %chance given by the model of making it to the bigs and multiplying that by the estimated wRC+?

Joshua Mould

Reply to matt_heckman

I had considered adding some aspect of wRC+ and that’s a really interesting idea. I might try that out!

Broken BatMember since 2020

Very neat and the start of some real insight. Can you list some of the players in their 2-4 years in the bigs like Ian Happ, Almora of Cubs? Maybe Quinn and Halsey of Phil’s? McNeil of Mets? Etc? Does a high ranking ( your formula) perhaps help us to wait a little longer on break through status for some guys?

Reply to Broken Bat

Yes the model does help us wait longer for breakthrough status. I looked at a few of those players and it turns out the McNeil had a very low prediction and just goes to show that despite predictions there are a few diamonds in the rough. In addition, Ian Happ had a pretty high prediction and the Cubs have been patient with him and this past year he put up good numbers.

Josh, thank you for the response to my question.

channelclemente

Does the modeling reveal nonlinearities among variables? Are the groups/variables orthogonal?

Can you go back to peak at Colin Moran. He was valued as prospect but his power was suspect. Now he ssemingky has shown signs of developing power.how did he rank?m

The model gave him a prediction of .37 which is actually very good so it seems like the model probably saw something in him that the Pirates maybe had trouble bringing out in the majors at first. He did have solid minor league numbers with the Astros so it could be an issue with the Pirates development there.

jrogersMember since 2017

Do you think the model has trouble with catchers in general, given the also relatively-low values for Lucroy and Grandal? Or unable to say that on such a sample size?

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG