Building a Hitting Prospect Projection Model

How well do you think you can predict the future of a minor leaguer? My computer may be able to help. Towards the end of the regular season, I found the prospects page at FanGraphs and started experimenting with it. I have always had a lot of fun thinking about the future and predicting outcomes, so I decided to try to build a model to predict whether or not a prospect would make it to the majors. I had all the data I needed thanks to FanGraphs, and I had recently been looking into similar models built by others to figure out how I could accomplish this project. I realized that all these articles I was reading detailed the results of their models, but not the code and behind-the-scenes work that goes into creating them.

With that in mind, I decided to figure it out on my own. I had a good idea of what statistics I wanted to use, but there were a few issues I needed to consider before I started throwing data around:

  1. Prospects can play multiple years at a single level.
  2. Not all prospects play at all levels of the minor leagues.
  3. What do I do with players who skipped levels?
  4. How can I make this model useful and practical?

Prospects playing multiple years at a single level isn’t too difficult to deal with because I can just aggregate the stats from those seasons. The fact that not all prospects play in every level of the minor leagues before reaching the majors is tough, however, because that makes for a lot of missing data that needs to be handled before building the model. I decided to replace all the missing values with the means of the existing data, and I created variables to indicate whether or not a player’s season stats for that particular level of the minor leagues were real. To make this model useful, I would want to take out certain variables. For example, I figured I wouldn’t need or want Triple-A stats included in the model because typically once a player has reached that level of the minors, you are more interested in how well they will do in the majors.

I defined a prospect as any player under the age of 26 in the minor leagues, and I only used stats from the following minor league levels: rookie, Low-A, Single-A, High-A, and Double-A. I also needed to get rid of current prospects from my training data (because they haven’t been given the chance to get to the majors yet) so I removed the names who currently exist on the FanGraphs prospect page. I also took out players who didn’t get past rookie league, because I wasn’t nearly as interested and they made up a huge majority of the seasons that I collected. Side note: It would be really cool project to take college stats and use them to project whether or not a player would make it past rookie ball given the number of washouts who made up my data… maybe a future project. 

I used a logistic regression model because of the binary outcome variable that I wanted to predict: Whether a player makes it to the majors or not. The predictor variables I wanted to use were age, BB%, K%, BABIP, ISO, GB%, SwStr%, and SB%. I chose these because these stats appealed to me as variables that isolate certain skills and could have predictive power. I calculated SB% a little differently than the conventional calculation, using it the way Chris Mitchell did in his article at The Hardball Times. He calculated the proportion of times that a player would attempt to steal based on opportunities given with the following formula:

SB% = (SB+CS) / (Singles + Walks + HBP)

For the response variable — major leaguer or not major leaguer – I classified a major leaguer as being a player who had 600 plate appearances or more in the bigs.

Once I had these logistics figured out, I began to put the model together. I started off with every statistic listed above from each level of the minor leagues, and I slowly narrowed it down to only variables of significance that would reduce overfitting. I used the AUC/ROC curve to evaluate the model, but because the data is so lopsided in that nearly 95% of players in my training set “never made it to the majors,” I needed to use a different baseline than conventional AUCs for the model. The exact percentage of players who “never made it to the majors” in my training set was 93.52%, so a model that simply says none of the players ever make it would have a 93.52% accuracy. That means I was shooting for an AUC of 93.52 or higher so that I could match the default model while also providing some insight on which players actually do make it to the majors. 

I ended up including the following variables/stats in the final model:

  • SB% in Low-A
  • BB% in High-A
  • ISO in High-A
  • Age in Double-A
  • SO% in Double-A
  • BABIP in Double-A
  • ISO in Double-A
  • SB% in Double-A
  • “Real or Not” in Rookie
  • “Real or Not” in Double-A

These variables make sense given there’s mostly representation from the higher levels of the minors. If you get to Double-A, there’s a much greater chance that you get to Triple-A or even go straight to the majors. It also of course bodes very well if you do well at Double-A, since this is much closer to playing against major league competition. I also wasn’t expecting GB% and SwStr% to be hugely significant predictors either because they are mostly made up for in BABIP and SO%. 

In my training set I ended up with an AUC of .952 and the following plot of the AUC-ROC curve:

In the test set I ended up with an even better AUC of .955 and the following plot of the AUC-ROC curve:

At the beginning of this project, I didn’t have much experience with the AUC-ROC curve. The only method of evaluating a logistic model that I had known of was the confusion matrix, which I now know is not always helpful, especially when a continuous predictor variable is more valid. In addition, when my AUC first appeared very high, I was surprised because I didn’t think I’d be able to build such a strong model that simply. I looked into it more and found that the reason it was so high is because the default model that predicts all prospects to fail would be extremely accurate, so unfortunately the high AUC was not because I was a natural modeling guru. The imbalance between those who make the majors and those who don’t in my training set is so lopsided that it makes my model seem extremely effective right off the bat before I had even taken out any variables.

Having 50 variables and the accuracy appear so high is a tricky illusion, and in the case of a model like this, leads to overfitting. I knew from the beginning that I would need to reduce the variables in the model, but it was very interesting to see the results of the overfitting appear in the data. With my overfitted predictions, there was a large difference between the predictions for those who made it to the majors and those who didn’t. This resulted in a much lower AUC in the test set and made for a poor model. In playing around with it at some points I found the model ending up with an AUC of 1, which meant perfect prediction in the training set and isn’t really useful on any other data. 

Let’s take a look at the data. Here are the predictions from the top 25 players in WAR between 2006 and 2019:

Prospect Model for Top 25 WAR Players, 2006-19
Rank Name Predictions WAR
1 Mike Trout .99336898 76.0
2 Buster Posey .01242905 52.7
3 Andrew McCutchen .86624516 49.7
4 Ryan Braun .8791191 43.9
5 Josh Donaldson .31081991 42.3
6 Paul Goldschmidt .7544491 41.4
7 Mookie Betts .99238981 40.2
8 Giancarlo Stanton .96553409 39.7
9 Freddie Freeman .86814365 38.0
10 Brett Gardner .62614691 37.6
11 Jonathan Lucroy .24545985 37.0
12 Justin Upton .99592404 36.8
13 Bryce Harper .98098739 36.7
14 Manny Machado .96577834 35.8
15 Jose Altuve .97283983 35.2
16 Yasmani Grandal .60096726 34.3
17 Christian Yelich .88193566 34.3
18 Jason Heyward .99870464 32.9
19 Kyle Seager .59719139 32.3
20 Nolan Arenado .7804618 32.2
21 Anthony Rizzo .87364824 30.3
22 Jacoby Ellsbury .88251289 30.3
23 Matt Carpenter .20256203 30.2
24 Lorenzo Cain .19193733 29.0
25 Francisco Lindor .95153247 28.9

The most obvious thing to point out is Buster Posey’s abysmal prediction of around 1% chance to make it to the show. This makes sense given his very short minor league career before getting to the majors, which consisted of about 10 games in rookie ball and Low-A combined before 80 games in High-A. He went to Triple-A for a little bit (which my model doesn’t take into account) and then went to San Francisco after just a couple seasons in the minors. Despite the fact that he’s a former MVP and three-time World Series champ, I found that this is a win for the model because in general guys who play very few games in the lower minors and don’t play Double-A probably aren’t going to make it to the majors, and Posey was a true outlier and top prospect. This is a place where it might help to factor in college stats to a model so that it might start to see standouts like Posey rush to the majors. 

Another thing to note is that just like there needed to be a baseline for the AUC because of the lopsided results, there also needs to be a baseline for the predictions. We need to understand that the average minor league player’s chance to make it to the big leagues is around 10%. That means that those who have predictions far above that are likely to be very, very good.

Next I took the top 500 prospects or so from FanGraphs and applied the model to them. The top 50 predictions from the model are below. There are a few names you might notice are missing from the top predictions, including players like Wander Franco, who simply hasn’t had enough playing time at levels the model likes the most, such as Double-A. We may never even see him play at Double-A because of this past minor league season’s cancellation and the possibility he is called up early in 2021.

Top 50 Prospect Model Projections
Rank Name Age wRC+ Prediction
1 Jarred Kelenic 20.3 142.958722 .9778224
2 Luis Robert 22.2 150.628621 .94175248
3 Isaac Paredes 20.7 132.283928 .94124406
4 Jo Adell 20.6 137.274165 .93936551
5 Nick Madrigal 22.6 119.707385 .93719679
6 Dylan Carlson 21.0 130.198868 .93300778
7 Andrés Giménez 21.2 110.71173 .90559978
8 Gavin Lux 21.9 156.305912 .89909189
9 Vidal Bruján 21.7 134.759559 .89442025
10 Keibert Ruiz 21.3 94.7038924 .88379281
11 Daulton Varsho 23.3 145.465397 .8596463
12 Taylor Walls 23.3 133.86938 .83855902
13 Brendan Rodgers 23.2 123.813903 .83243674
14 Jorge Mateo 24.4 80.0309606 .82703768
15 Heliot Ramos 20.2 119.050502 .80713648
16 Luis Garcia 19.5 95.1615171 .79559999
17 Cristian Pache 21.0 114.751941 .78913128
18 Yusniel Diaz 23.1 135.15202 .76919141
19 Drew Waters 20.8 132.118829 .757519
20 Mauricio Dubón 25.3 103.601284 .75656152
21 Alec Bohm 23.2 146.943753 .74234879
22 Austin Hays 24.3 92.4417251 .73148647
23 Anthony Alford 25.3 94.0627314 .72035194
24 Leody Taveras 21.1 95.1196241 .71424263
25 Carter Kieboom 22.2 124.636012 .71109895
26 Oneil Cruz 21.1 137.430851 .70463246
27 Jonathan Araúz 21.2 98.2112133 .70432122
28 Lucius Fox 22.3 99.4395713 .69819464
29 Abraham Toro 22.9 137.205164 .6836528
30 Jason Martin 24.2 99.7303846 .68025898
31 Luis Barrera 24.0 122.854608 .67851418
32 Joey Bart 22.9 140.693228 .67710447
33 Khalil Lee 21.3 117.048416 .66450715
34 Royce Lewis 20.4 111.059658 .64479302
35 Ryan Mountcastle 22.7 118.774006 .64439561
36 Ke’Bryan Hayes 22.8 110.540248 .64018493
37 Brandon Marsh 21.9 118.81056 .61958271
38 Thairo Estrada 23.7 74.5093524 .61888417
39 Luis Santana 20.3 126.071404 .61490306
40 Daz Cameron 22.8 98.7950409 .60065321
41 Jorge Oña 22.8 103.290552 .59374655
42 Josh Lowe 21.7 113.863437 .58122182
43 Randy Arozarena 24.7 133.414235 .56952746
44 Yonny Hernandez 21.5 118.231353 .56648887
45 Domingo Leyba 24.1 107.683533 .5605087
46 Alex Kirilloff 22.0 150.321047 .55860194
47 Omar Estévez 21.7 113.840938 .53150457
48 Sheldon Neuse 24.9 99.6412136 .52316645
49 Connor Wong 23.5 131.336263 .50716012
50 Jahmai Jones 22.2 93.252696 .49505114

It’s really cool to see this kind of thing work in a fortune-telling way. Predicting the future is fascinating to me, and even more satisfying when predicted correctly. Many of these players have already been brought up to the majors and have performed well. Others haven’t performed as well but are destined to turn it around.

Overall, I was very happy with this project because of the fact that it’s not straightforward and that I had to experience a few things along the way. I learned about the different options that I could use when I encounter missing data, like the missing seasons of minor leaguers, and I also know more about how to evaluate logistic models. In addition, I now have a handy dandy tool for evaluating minor leaguers’ potential success in the future.





Sophomore Computer Science and Statistics double major at Villanova. I am a Red Sox Diehard. I have used R to analyze baseball stats and am also proficient in Java and familiar with Python.

10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Ddogmember
3 years ago

Joshua, very impressive use of statistics! Keep up the good work and go Wildcats!

matt_heckman
3 years ago

This work is amazing! Have you considered adding an extra step in to find somewhat of the expected value for the wRC+ of each player? Possibly taking the %chance given by the model of making it to the bigs and multiplying that by the estimated wRC+?

Broken Batmember
3 years ago

Very neat and the start of some real insight. Can you list some of the players in their 2-4 years in the bigs like Ian Happ, Almora of Cubs? Maybe Quinn and Halsey of Phil’s? McNeil of Mets? Etc? Does a high ranking ( your formula) perhaps help us to wait a little longer on break through status for some guys?

Broken Batmember
3 years ago

Josh, thank you for the response to my question.

channelclemente
3 years ago

Does the modeling reveal nonlinearities among variables? Are the groups/variables orthogonal?

Broken Batmember
3 years ago

Can you go back to peak at Colin Moran. He was valued as prospect but his power was suspect. Now he ssemingky has shown signs of developing power.how did he rank?m

jrogersmember
3 years ago

Do you think the model has trouble with catchers in general, given the also relatively-low values for Lucroy and Grandal? Or unable to say that on such a sample size?