Using Rookie League Stats to Predict Future Performance

Over the last couple of weeks, I’ve been looking into how a player’s stats, age, and prospect status can be used to predict whether he’ll ever play in the majors. I used a methodology that I named KATOH (after Yankees prospect Gosuke Katoh), which consists of running a probit regression analysis. In a nutshell, a probit regression tells us how a variety of inputs can predict the probability of an event that has two possible outcomes — such as whether or not a player will make it to the majors. While KATOH technically predicts the likelihood that a player will reach the majors, I’d argue it can also serve as a decent proxy for major league success. If something makes a player more likely to make the majors, there’s a good chance it also makes him more likely to succeed there. In the future, I plan to engineer an alternative methodology to go along with this one, that takes into account how a player performs in the majors, rather than his just getting there.

For hitters in Low-A and High-A, age, strikeout rate, ISO, BABIP, and whether or not he was deemed a top 100 prospect by Baseball America all played a role in forecasting future success. And walk rate, while not predictive for players in A-ball, added a little bit to the model for Double-A and Triple-A hitters. Today, I’ll look into what KATOH has to say about players in Rookie leagues. Due to varying offensive environments in different years and leagues, all players’ stats were adjusted to reflect his league’s average for that year. For those interested, here’s the R output based on all players with at least 200 plate appearances in a season in Rookie ball from 1995-2007.

Rookie Output

Just like we saw with hitters in the A-ball leagues, a player’s walk rate is not at all predictive of whether or not he’ll crack the majors. Unlike all of the other levels I’ve looked at so far, a player’s Baseball America prospect status couldn’t tell us anything about his future as a big-leaguer. This was entirely due the scarcity of top-100 prospects in the sample, as only a handful of players spent the year in rookie ball after making BA’s top-100 list.

The season is less than 40 games old for most rookie league teams, which makes it a little premature to start analyzing players’ stats. But just for kicks, here’s a look at what KATOH says about this year’s crop of rookie-ballers with at least 80 plate appearances through July 28th. This only considers players in the American rookie leagues — the Appalachian, Arizona, Gulf Coast, and Pioneer Leagues, meaning it excludes the Dominican and Venezuelan Summer Leagues. The full list of players can be found here, and you’ll find an excerpt of those who broke the 40% barrier below:

Player Organization Age MLB Probability
Kevin Padlo COL 17 73%
Bobby Bradley CLE 18 67%
Alex Verdugo LAD 18 65%
Luke Dykstra ATL 18 64%
Yu-Cheng Chang CLE 18 59%
Magneuris Sierra STL 18 56%
Juan Santana HOU 19 54%
Joshua Morgan TEX 18 50%
Jason Martin HOU 18 49%
Edmundo Sosa STL 18 48%
Oliver Caraballo TEX 19 46%
Sthervin Matos MIL 20 46%
Alexander Palma NYY 18 45%
Eloy Jimenez CHC 17 45%
Javier Guerra BOS 18 44%
Zach Shepherd DET 18 44%
Tito Polo PIT 19 44%
Jose Godoy STL 19 43%
Henry Castillo ARI 19 42%
David Gonzalez DET 20 42%
Dan Jansen TOR 19 42%
Max George COL 18 42%
Gleyber Torres CHC 17 42%
Luis Guzman WSN 18 41%
Jose Martinez KCR 17 41%
Alex Jackson SEA 18 40%
Emmanuel Tapia CLE 18 40%

What stands out most is that KATOH doesn’t think any of these players are shoo-ins to make it to the majors. Even those who are hitting the snot out of the ball get probabilities that fall short of what we saw for unremarkable performances in Double-A. Kevin Padlo, for example, gets just a 73%, despite hitting a ridiculous .317/.463/.619 as a 17-year-old. Its hard to do much better than that. I think this really speaks to how little rookie ball stats matter in the grand scheme of things. A good offensive showing is obviously better than a poor one, but numbers from this level need to be taken with a huge grain of salt. A hitter’s performance against pitchers who are fresh out of high school just can’t tell us much about how he’ll fare when matched up against more advanced pitching at the higher levels.

Next up, I’ll complete the series by looking at stats from short-season A-ball. Teams at that level are also only a few weeks into their season, but at the very least, it will be interesting to see how KATOH feels about SS A-ballers in general. Next week, I’ll apply the KATOH model to historical prospects and highlight some of its biggest “hits” and “misses” from the past.

Statistics courtesy of FanGraphs, Baseball-Reference, and The Baseball Cube; Pre-season prospect lists courtesy of Baseball America.

Chris works in economic development by day, but spends most of his nights thinking about baseball. He writes for Pinstripe Pundits, FanGraphs and The Hardball Times. He's also on the twitter machine: @_chris_mitchell None of the views expressed in his articles reflect those of his daytime employer.

newest oldest most voted

Wow Chris. I’ve enjoyed this series, but going back to Rookie-league data as a predictor is pretty bold. A couple of questions: 1. How does the “error bar” for the Rookie-league KATOH model compare with the ones you’ve done for higher levels? 2. How much better could the predictions be if you used the last TWO years of data in the KATOH model? 3. Is there a methodology between a probit regression (on a binary outcome) and a “least-squares” regression (on a continuously variable outcome) that might allow you to bucket MLB success into a variable with say 4 defined… Read more »

Chris Mitchell

Glad you’re enjoying it! I really appreciate the feedback — KATOH’s a work in progress, and the comments on these articles have given me some good ideas on how to improve the model in the future. 1. I haven’t looked too deeply into that yet, but its on my radar. I’d imagine the lower-level models are a lot more fluky. Not only are these players far from the majors, but its also a smaller sample for R and A- players, who’s seasons don’t start up until June. 2. Not a clue. Worth looking into though. 3. I’m planning on rolling… Read more »


Interesting post! This is the first one I’ve read in the series. There is a methodology called multinomial logistic regression that allows you to have more than two categories as your dependent variable. It is a nice extension to standard logistic regression. Here’s a pretty nice intro to the topic:

Chris Mitchell

Thanks! I’ll be sure to read up on that.