Building a Hitting Prospect Projection Model by Joshua Mould February 1, 2021 How well do you think you can predict the future of a minor leaguer? My computer may be able to help. Towards the end of the regular season, I found the prospects page at FanGraphs and started experimenting with it. I have always had a lot of fun thinking about the future and predicting outcomes, so I decided to try to build a model to predict whether or not a prospect would make it to the majors. I had all the data I needed thanks to FanGraphs, and I had recently been looking into similar models built by others to figure out how I could accomplish this project. I realized that all these articles I was reading detailed the results of their models, but not the code and behind-the-scenes work that goes into creating them. With that in mind, I decided to figure it out on my own. I had a good idea of what statistics I wanted to use, but there were a few issues I needed to consider before I started throwing data around: Prospects can play multiple years at a single level. Not all prospects play at all levels of the minor leagues. What do I do with players who skipped levels? How can I make this model useful and practical? Prospects playing multiple years at a single level isn’t too difficult to deal with because I can just aggregate the stats from those seasons. The fact that not all prospects play in every level of the minor leagues before reaching the majors is tough, however, because that makes for a lot of missing data that needs to be handled before building the model. I decided to replace all the missing values with the means of the existing data, and I created variables to indicate whether or not a player’s season stats for that particular level of the minor leagues were real. To make this model useful, I would want to take out certain variables. For example, I figured I wouldn’t need or want Triple-A stats included in the model because typically once a player has reached that level of the minors, you are more interested in how well they will do in the majors. I defined a prospect as any player under the age of 26 in the minor leagues, and I only used stats from the following minor league levels: rookie, Low-A, Single-A, High-A, and Double-A. I also needed to get rid of current prospects from my training data (because they haven’t been given the chance to get to the majors yet) so I removed the names who currently exist on the FanGraphs prospect page. I also took out players who didn’t get past rookie league, because I wasn’t nearly as interested and they made up a huge majority of the seasons that I collected. Side note: It would be really cool project to take college stats and use them to project whether or not a player would make it past rookie ball given the number of washouts who made up my data… maybe a future project. I used a logistic regression model because of the binary outcome variable that I wanted to predict: Whether a player makes it to the majors or not. The predictor variables I wanted to use were age, BB%, K%, BABIP, ISO, GB%, SwStr%, and SB%. I chose these because these stats appealed to me as variables that isolate certain skills and could have predictive power. I calculated SB% a little differently than the conventional calculation, using it the way Chris Mitchell did in his article at The Hardball Times. He calculated the proportion of times that a player would attempt to steal based on opportunities given with the following formula: SB% = (SB+CS) / (Singles + Walks + HBP) For the response variable — major leaguer or not major leaguer – I classified a major leaguer as being a player who had 600 plate appearances or more in the bigs. Once I had these logistics figured out, I began to put the model together. I started off with every statistic listed above from each level of the minor leagues, and I slowly narrowed it down to only variables of significance that would reduce overfitting. I used the AUC/ROC curve to evaluate the model, but because the data is so lopsided in that nearly 95% of players in my training set “never made it to the majors,” I needed to use a different baseline than conventional AUCs for the model. The exact percentage of players who “never made it to the majors” in my training set was 93.52%, so a model that simply says none of the players ever make it would have a 93.52% accuracy. That means I was shooting for an AUC of 93.52 or higher so that I could match the default model while also providing some insight on which players actually do make it to the majors. I ended up including the following variables/stats in the final model: SB% in Low-A BB% in High-A ISO in High-A Age in Double-A SO% in Double-A BABIP in Double-A ISO in Double-A SB% in Double-A “Real or Not” in Rookie “Real or Not” in Double-A These variables make sense given there’s mostly representation from the higher levels of the minors. If you get to Double-A, there’s a much greater chance that you get to Triple-A or even go straight to the majors. It also of course bodes very well if you do well at Double-A, since this is much closer to playing against major league competition. I also wasn’t expecting GB% and SwStr% to be hugely significant predictors either because they are mostly made up for in BABIP and SO%. In my training set I ended up with an AUC of .952 and the following plot of the AUC-ROC curve: In the test set I ended up with an even better AUC of .955 and the following plot of the AUC-ROC curve: At the beginning of this project, I didn’t have much experience with the AUC-ROC curve. The only method of evaluating a logistic model that I had known of was the confusion matrix, which I now know is not always helpful, especially when a continuous predictor variable is more valid. In addition, when my AUC first appeared very high, I was surprised because I didn’t think I’d be able to build such a strong model that simply. I looked into it more and found that the reason it was so high is because the default model that predicts all prospects to fail would be extremely accurate, so unfortunately the high AUC was not because I was a natural modeling guru. The imbalance between those who make the majors and those who don’t in my training set is so lopsided that it makes my model seem extremely effective right off the bat before I had even taken out any variables. Having 50 variables and the accuracy appear so high is a tricky illusion, and in the case of a model like this, leads to overfitting. I knew from the beginning that I would need to reduce the variables in the model, but it was very interesting to see the results of the overfitting appear in the data. With my overfitted predictions, there was a large difference between the predictions for those who made it to the majors and those who didn’t. This resulted in a much lower AUC in the test set and made for a poor model. In playing around with it at some points I found the model ending up with an AUC of 1, which meant perfect prediction in the training set and isn’t really useful on any other data. Let’s take a look at the data. Here are the predictions from the top 25 players in WAR between 2006 and 2019: Prospect Model for Top 25 WAR Players, 2006-19 Rank Name Predictions WAR 1 Mike Trout .99336898 76.0 2 Buster Posey .01242905 52.7 3 Andrew McCutchen .86624516 49.7 4 Ryan Braun .8791191 43.9 5 Josh Donaldson .31081991 42.3 6 Paul Goldschmidt .7544491 41.4 7 Mookie Betts .99238981 40.2 8 Giancarlo Stanton .96553409 39.7 9 Freddie Freeman .86814365 38.0 10 Brett Gardner .62614691 37.6 11 Jonathan Lucroy .24545985 37.0 12 Justin Upton .99592404 36.8 13 Bryce Harper .98098739 36.7 14 Manny Machado .96577834 35.8 15 Jose Altuve .97283983 35.2 16 Yasmani Grandal .60096726 34.3 17 Christian Yelich .88193566 34.3 18 Jason Heyward .99870464 32.9 19 Kyle Seager .59719139 32.3 20 Nolan Arenado .7804618 32.2 21 Anthony Rizzo .87364824 30.3 22 Jacoby Ellsbury .88251289 30.3 23 Matt Carpenter .20256203 30.2 24 Lorenzo Cain .19193733 29.0 25 Francisco Lindor .95153247 28.9 The most obvious thing to point out is Buster Posey’s abysmal prediction of around 1% chance to make it to the show. This makes sense given his very short minor league career before getting to the majors, which consisted of about 10 games in rookie ball and Low-A combined before 80 games in High-A. He went to Triple-A for a little bit (which my model doesn’t take into account) and then went to San Francisco after just a couple seasons in the minors. Despite the fact that he’s a former MVP and three-time World Series champ, I found that this is a win for the model because in general guys who play very few games in the lower minors and don’t play Double-A probably aren’t going to make it to the majors, and Posey was a true outlier and top prospect. This is a place where it might help to factor in college stats to a model so that it might start to see standouts like Posey rush to the majors. Another thing to note is that just like there needed to be a baseline for the AUC because of the lopsided results, there also needs to be a baseline for the predictions. We need to understand that the average minor league player’s chance to make it to the big leagues is around 10%. That means that those who have predictions far above that are likely to be very, very good. Next I took the top 500 prospects or so from FanGraphs and applied the model to them. The top 50 predictions from the model are below. There are a few names you might notice are missing from the top predictions, including players like Wander Franco, who simply hasn’t had enough playing time at levels the model likes the most, such as Double-A. We may never even see him play at Double-A because of this past minor league season’s cancellation and the possibility he is called up early in 2021. Top 50 Prospect Model Projections Rank Name Age wRC+ Prediction 1 Jarred Kelenic 20.3 142.958722 .9778224 2 Luis Robert 22.2 150.628621 .94175248 3 Isaac Paredes 20.7 132.283928 .94124406 4 Jo Adell 20.6 137.274165 .93936551 5 Nick Madrigal 22.6 119.707385 .93719679 6 Dylan Carlson 21.0 130.198868 .93300778 7 Andrés Giménez 21.2 110.71173 .90559978 8 Gavin Lux 21.9 156.305912 .89909189 9 Vidal Bruján 21.7 134.759559 .89442025 10 Keibert Ruiz 21.3 94.7038924 .88379281 11 Daulton Varsho 23.3 145.465397 .8596463 12 Taylor Walls 23.3 133.86938 .83855902 13 Brendan Rodgers 23.2 123.813903 .83243674 14 Jorge Mateo 24.4 80.0309606 .82703768 15 Heliot Ramos 20.2 119.050502 .80713648 16 Luis Garcia 19.5 95.1615171 .79559999 17 Cristian Pache 21.0 114.751941 .78913128 18 Yusniel Diaz 23.1 135.15202 .76919141 19 Drew Waters 20.8 132.118829 .757519 20 Mauricio Dubón 25.3 103.601284 .75656152 21 Alec Bohm 23.2 146.943753 .74234879 22 Austin Hays 24.3 92.4417251 .73148647 23 Anthony Alford 25.3 94.0627314 .72035194 24 Leody Taveras 21.1 95.1196241 .71424263 25 Carter Kieboom 22.2 124.636012 .71109895 26 Oneil Cruz 21.1 137.430851 .70463246 27 Jonathan Araúz 21.2 98.2112133 .70432122 28 Lucius Fox 22.3 99.4395713 .69819464 29 Abraham Toro 22.9 137.205164 .6836528 30 Jason Martin 24.2 99.7303846 .68025898 31 Luis Barrera 24.0 122.854608 .67851418 32 Joey Bart 22.9 140.693228 .67710447 33 Khalil Lee 21.3 117.048416 .66450715 34 Royce Lewis 20.4 111.059658 .64479302 35 Ryan Mountcastle 22.7 118.774006 .64439561 36 Ke’Bryan Hayes 22.8 110.540248 .64018493 37 Brandon Marsh 21.9 118.81056 .61958271 38 Thairo Estrada 23.7 74.5093524 .61888417 39 Luis Santana 20.3 126.071404 .61490306 40 Daz Cameron 22.8 98.7950409 .60065321 41 Jorge Oña 22.8 103.290552 .59374655 42 Josh Lowe 21.7 113.863437 .58122182 43 Randy Arozarena 24.7 133.414235 .56952746 44 Yonny Hernandez 21.5 118.231353 .56648887 45 Domingo Leyba 24.1 107.683533 .5605087 46 Alex Kirilloff 22.0 150.321047 .55860194 47 Omar Estévez 21.7 113.840938 .53150457 48 Sheldon Neuse 24.9 99.6412136 .52316645 49 Connor Wong 23.5 131.336263 .50716012 50 Jahmai Jones 22.2 93.252696 .49505114 It’s really cool to see this kind of thing work in a fortune-telling way. Predicting the future is fascinating to me, and even more satisfying when predicted correctly. Many of these players have already been brought up to the majors and have performed well. Others haven’t performed as well but are destined to turn it around. Overall, I was very happy with this project because of the fact that it’s not straightforward and that I had to experience a few things along the way. I learned about the different options that I could use when I encounter missing data, like the missing seasons of minor leaguers, and I also know more about how to evaluate logistic models. In addition, I now have a handy dandy tool for evaluating minor leaguers’ potential success in the future.