Building a Hitting Prospect Projection Model
How well do you think you can predict the future of a minor leaguer? My computer may be able to help. Towards the end of the regular season, I found the prospects page at FanGraphs and started experimenting with it. I have always had a lot of fun thinking about the future and predicting outcomes, so I decided to try to build a model to predict whether or not a prospect would make it to the majors. I had all the data I needed thanks to FanGraphs, and I had recently been looking into similar models built by others to figure out how I could accomplish this project. I realized that all these articles I was reading detailed the results of their models, but not the code and behind-the-scenes work that goes into creating them.
With that in mind, I decided to figure it out on my own. I had a good idea of what statistics I wanted to use, but there were a few issues I needed to consider before I started throwing data around:
- Prospects can play multiple years at a single level.
- Not all prospects play at all levels of the minor leagues.
- What do I do with players who skipped levels?
- How can I make this model useful and practical?
Prospects playing multiple years at a single level isn’t too difficult to deal with because I can just aggregate the stats from those seasons. The fact that not all prospects play in every level of the minor leagues before reaching the majors is tough, however, because that makes for a lot of missing data that needs to be handled before building the model. I decided to replace all the missing values with the means of the existing data, and I created variables to indicate whether or not a player’s season stats for that particular level of the minor leagues were real. To make this model useful, I would want to take out certain variables. For example, I figured I wouldn’t need or want Triple-A stats included in the model because typically once a player has reached that level of the minors, you are more interested in how well they will do in the majors.
I defined a prospect as any player under the age of 26 in the minor leagues, and I only used stats from the following minor league levels: rookie, Low-A, Single-A, High-A, and Double-A. I also needed to get rid of current prospects from my training data (because they haven’t been given the chance to get to the majors yet) so I removed the names who currently exist on the FanGraphs prospect page. I also took out players who didn’t get past rookie league, because I wasn’t nearly as interested and they made up a huge majority of the seasons that I collected. Side note: It would be really cool project to take college stats and use them to project whether or not a player would make it past rookie ball given the number of washouts who made up my data… maybe a future project.
I used a logistic regression model because of the binary outcome variable that I wanted to predict: Whether a player makes it to the majors or not. The predictor variables I wanted to use were age, BB%, K%, BABIP, ISO, GB%, SwStr%, and SB%. I chose these because these stats appealed to me as variables that isolate certain skills and could have predictive power. I calculated SB% a little differently than the conventional calculation, using it the way Chris Mitchell did in his article at The Hardball Times. He calculated the proportion of times that a player would attempt to steal based on opportunities given with the following formula:
SB% = (SB+CS) / (Singles + Walks + HBP)
For the response variable — major leaguer or not major leaguer – I classified a major leaguer as being a player who had 600 plate appearances or more in the bigs.
Once I had these logistics figured out, I began to put the model together. I started off with every statistic listed above from each level of the minor leagues, and I slowly narrowed it down to only variables of significance that would reduce overfitting. I used the AUC/ROC curve to evaluate the model, but because the data is so lopsided in that nearly 95% of players in my training set “never made it to the majors,” I needed to use a different baseline than conventional AUCs for the model. The exact percentage of players who “never made it to the majors” in my training set was 93.52%, so a model that simply says none of the players ever make it would have a 93.52% accuracy. That means I was shooting for an AUC of 93.52 or higher so that I could match the default model while also providing some insight on which players actually do make it to the majors.
I ended up including the following variables/stats in the final model:
- SB% in Low-A
- BB% in High-A
- ISO in High-A
- Age in Double-A
- SO% in Double-A
- BABIP in Double-A
- ISO in Double-A
- SB% in Double-A
- “Real or Not” in Rookie
- “Real or Not” in Double-A
These variables make sense given there’s mostly representation from the higher levels of the minors. If you get to Double-A, there’s a much greater chance that you get to Triple-A or even go straight to the majors. It also of course bodes very well if you do well at Double-A, since this is much closer to playing against major league competition. I also wasn’t expecting GB% and SwStr% to be hugely significant predictors either because they are mostly made up for in BABIP and SO%.
In my training set I ended up with an AUC of .952 and the following plot of the AUC-ROC curve:
In the test set I ended up with an even better AUC of .955 and the following plot of the AUC-ROC curve:
At the beginning of this project, I didn’t have much experience with the AUC-ROC curve. The only method of evaluating a logistic model that I had known of was the confusion matrix, which I now know is not always helpful, especially when a continuous predictor variable is more valid. In addition, when my AUC first appeared very high, I was surprised because I didn’t think I’d be able to build such a strong model that simply. I looked into it more and found that the reason it was so high is because the default model that predicts all prospects to fail would be extremely accurate, so unfortunately the high AUC was not because I was a natural modeling guru. The imbalance between those who make the majors and those who don’t in my training set is so lopsided that it makes my model seem extremely effective right off the bat before I had even taken out any variables.
Having 50 variables and the accuracy appear so high is a tricky illusion, and in the case of a model like this, leads to overfitting. I knew from the beginning that I would need to reduce the variables in the model, but it was very interesting to see the results of the overfitting appear in the data. With my overfitted predictions, there was a large difference between the predictions for those who made it to the majors and those who didn’t. This resulted in a much lower AUC in the test set and made for a poor model. In playing around with it at some points I found the model ending up with an AUC of 1, which meant perfect prediction in the training set and isn’t really useful on any other data.
Let’s take a look at the data. Here are the predictions from the top 25 players in WAR between 2006 and 2019:
Rank | Name | Predictions | WAR |
---|---|---|---|
1 | Mike Trout | .99336898 | 76.0 |
2 | Buster Posey | .01242905 | 52.7 |
3 | Andrew McCutchen | .86624516 | 49.7 |
4 | Ryan Braun | .8791191 | 43.9 |
5 | Josh Donaldson | .31081991 | 42.3 |
6 | Paul Goldschmidt | .7544491 | 41.4 |
7 | Mookie Betts | .99238981 | 40.2 |
8 | Giancarlo Stanton | .96553409 | 39.7 |
9 | Freddie Freeman | .86814365 | 38.0 |
10 | Brett Gardner | .62614691 | 37.6 |
11 | Jonathan Lucroy | .24545985 | 37.0 |
12 | Justin Upton | .99592404 | 36.8 |
13 | Bryce Harper | .98098739 | 36.7 |
14 | Manny Machado | .96577834 | 35.8 |
15 | Jose Altuve | .97283983 | 35.2 |
16 | Yasmani Grandal | .60096726 | 34.3 |
17 | Christian Yelich | .88193566 | 34.3 |
18 | Jason Heyward | .99870464 | 32.9 |
19 | Kyle Seager | .59719139 | 32.3 |
20 | Nolan Arenado | .7804618 | 32.2 |
21 | Anthony Rizzo | .87364824 | 30.3 |
22 | Jacoby Ellsbury | .88251289 | 30.3 |
23 | Matt Carpenter | .20256203 | 30.2 |
24 | Lorenzo Cain | .19193733 | 29.0 |
25 | Francisco Lindor | .95153247 | 28.9 |
The most obvious thing to point out is Buster Posey’s abysmal prediction of around 1% chance to make it to the show. This makes sense given his very short minor league career before getting to the majors, which consisted of about 10 games in rookie ball and Low-A combined before 80 games in High-A. He went to Triple-A for a little bit (which my model doesn’t take into account) and then went to San Francisco after just a couple seasons in the minors. Despite the fact that he’s a former MVP and three-time World Series champ, I found that this is a win for the model because in general guys who play very few games in the lower minors and don’t play Double-A probably aren’t going to make it to the majors, and Posey was a true outlier and top prospect. This is a place where it might help to factor in college stats to a model so that it might start to see standouts like Posey rush to the majors.
Another thing to note is that just like there needed to be a baseline for the AUC because of the lopsided results, there also needs to be a baseline for the predictions. We need to understand that the average minor league player’s chance to make it to the big leagues is around 10%. That means that those who have predictions far above that are likely to be very, very good.
Next I took the top 500 prospects or so from FanGraphs and applied the model to them. The top 50 predictions from the model are below. There are a few names you might notice are missing from the top predictions, including players like Wander Franco, who simply hasn’t had enough playing time at levels the model likes the most, such as Double-A. We may never even see him play at Double-A because of this past minor league season’s cancellation and the possibility he is called up early in 2021.
Rank | Name | Age | wRC+ | Prediction |
---|---|---|---|---|
1 | Jarred Kelenic | 20.3 | 142.958722 | .9778224 |
2 | Luis Robert | 22.2 | 150.628621 | .94175248 |
3 | Isaac Paredes | 20.7 | 132.283928 | .94124406 |
4 | Jo Adell | 20.6 | 137.274165 | .93936551 |
5 | Nick Madrigal | 22.6 | 119.707385 | .93719679 |
6 | Dylan Carlson | 21.0 | 130.198868 | .93300778 |
7 | Andrés Giménez | 21.2 | 110.71173 | .90559978 |
8 | Gavin Lux | 21.9 | 156.305912 | .89909189 |
9 | Vidal Bruján | 21.7 | 134.759559 | .89442025 |
10 | Keibert Ruiz | 21.3 | 94.7038924 | .88379281 |
11 | Daulton Varsho | 23.3 | 145.465397 | .8596463 |
12 | Taylor Walls | 23.3 | 133.86938 | .83855902 |
13 | Brendan Rodgers | 23.2 | 123.813903 | .83243674 |
14 | Jorge Mateo | 24.4 | 80.0309606 | .82703768 |
15 | Heliot Ramos | 20.2 | 119.050502 | .80713648 |
16 | Luis Garcia | 19.5 | 95.1615171 | .79559999 |
17 | Cristian Pache | 21.0 | 114.751941 | .78913128 |
18 | Yusniel Diaz | 23.1 | 135.15202 | .76919141 |
19 | Drew Waters | 20.8 | 132.118829 | .757519 |
20 | Mauricio Dubón | 25.3 | 103.601284 | .75656152 |
21 | Alec Bohm | 23.2 | 146.943753 | .74234879 |
22 | Austin Hays | 24.3 | 92.4417251 | .73148647 |
23 | Anthony Alford | 25.3 | 94.0627314 | .72035194 |
24 | Leody Taveras | 21.1 | 95.1196241 | .71424263 |
25 | Carter Kieboom | 22.2 | 124.636012 | .71109895 |
26 | Oneil Cruz | 21.1 | 137.430851 | .70463246 |
27 | Jonathan Araúz | 21.2 | 98.2112133 | .70432122 |
28 | Lucius Fox | 22.3 | 99.4395713 | .69819464 |
29 | Abraham Toro | 22.9 | 137.205164 | .6836528 |
30 | Jason Martin | 24.2 | 99.7303846 | .68025898 |
31 | Luis Barrera | 24.0 | 122.854608 | .67851418 |
32 | Joey Bart | 22.9 | 140.693228 | .67710447 |
33 | Khalil Lee | 21.3 | 117.048416 | .66450715 |
34 | Royce Lewis | 20.4 | 111.059658 | .64479302 |
35 | Ryan Mountcastle | 22.7 | 118.774006 | .64439561 |
36 | Ke’Bryan Hayes | 22.8 | 110.540248 | .64018493 |
37 | Brandon Marsh | 21.9 | 118.81056 | .61958271 |
38 | Thairo Estrada | 23.7 | 74.5093524 | .61888417 |
39 | Luis Santana | 20.3 | 126.071404 | .61490306 |
40 | Daz Cameron | 22.8 | 98.7950409 | .60065321 |
41 | Jorge Oña | 22.8 | 103.290552 | .59374655 |
42 | Josh Lowe | 21.7 | 113.863437 | .58122182 |
43 | Randy Arozarena | 24.7 | 133.414235 | .56952746 |
44 | Yonny Hernandez | 21.5 | 118.231353 | .56648887 |
45 | Domingo Leyba | 24.1 | 107.683533 | .5605087 |
46 | Alex Kirilloff | 22.0 | 150.321047 | .55860194 |
47 | Omar Estévez | 21.7 | 113.840938 | .53150457 |
48 | Sheldon Neuse | 24.9 | 99.6412136 | .52316645 |
49 | Connor Wong | 23.5 | 131.336263 | .50716012 |
50 | Jahmai Jones | 22.2 | 93.252696 | .49505114 |
It’s really cool to see this kind of thing work in a fortune-telling way. Predicting the future is fascinating to me, and even more satisfying when predicted correctly. Many of these players have already been brought up to the majors and have performed well. Others haven’t performed as well but are destined to turn it around.
Overall, I was very happy with this project because of the fact that it’s not straightforward and that I had to experience a few things along the way. I learned about the different options that I could use when I encounter missing data, like the missing seasons of minor leaguers, and I also know more about how to evaluate logistic models. In addition, I now have a handy dandy tool for evaluating minor leaguers’ potential success in the future.
Sophomore Computer Science and Statistics double major at Villanova. I am a Red Sox Diehard. I have used R to analyze baseball stats and am also proficient in Java and familiar with Python.
Joshua, very impressive use of statistics! Keep up the good work and go Wildcats!
This work is amazing! Have you considered adding an extra step in to find somewhat of the expected value for the wRC+ of each player? Possibly taking the %chance given by the model of making it to the bigs and multiplying that by the estimated wRC+?
I had considered adding some aspect of wRC+ and that’s a really interesting idea. I might try that out!
Very neat and the start of some real insight. Can you list some of the players in their 2-4 years in the bigs like Ian Happ, Almora of Cubs? Maybe Quinn and Halsey of Phil’s? McNeil of Mets? Etc? Does a high ranking ( your formula) perhaps help us to wait a little longer on break through status for some guys?
Yes the model does help us wait longer for breakthrough status. I looked at a few of those players and it turns out the McNeil had a very low prediction and just goes to show that despite predictions there are a few diamonds in the rough. In addition, Ian Happ had a pretty high prediction and the Cubs have been patient with him and this past year he put up good numbers.
Josh, thank you for the response to my question.
Does the modeling reveal nonlinearities among variables? Are the groups/variables orthogonal?
Can you go back to peak at Colin Moran. He was valued as prospect but his power was suspect. Now he ssemingky has shown signs of developing power.how did he rank?m
The model gave him a prediction of .37 which is actually very good so it seems like the model probably saw something in him that the Pirates maybe had trouble bringing out in the majors at first. He did have solid minor league numbers with the Astros so it could be an issue with the Pirates development there.
Do you think the model has trouble with catchers in general, given the also relatively-low values for Lucroy and Grandal? Or unable to say that on such a sample size?