Using Rookie League Stats to Predict Future Performance

August 2, 2014

Over the last couple of weeks, I’ve been looking into how a player’s stats, age, and prospect status can be used to predict whether he’ll ever play in the majors. I used a methodology that I named KATOH (after Yankees prospect Gosuke Katoh), which consists of running a probit regression analysis. In a nutshell, a probit regression tells us how a variety of inputs can predict the probability of an event that has two possible outcomes — such as whether or not a player will make it to the majors. While KATOH technically predicts the likelihood that a player will reach the majors, I’d argue it can also serve as a decent proxy for major league success. If something makes a player more likely to make the majors, there’s a good chance it also makes him more likely to succeed there. In the future, I plan to engineer an alternative methodology to go along with this one, that takes into account how a player performs in the majors, rather than his just getting there.

For hitters in Low-A and High-A, age, strikeout rate, ISO, BABIP, and whether or not he was deemed a top 100 prospect by Baseball America all played a role in forecasting future success. And walk rate, while not predictive for players in A-ball, added a little bit to the model for Double-A and Triple-A hitters. Today, I’ll look into what KATOH has to say about players in Rookie leagues. Due to varying offensive environments in different years and leagues, all players’ stats were adjusted to reflect his league’s average for that year. For those interested, here’s the R output based on all players with at least 200 plate appearances in a season in Rookie ball from 1995-2007.

Just like we saw with hitters in the A-ball leagues, a player’s walk rate is not at all predictive of whether or not he’ll crack the majors. Unlike all of the other levels I’ve looked at so far, a player’s Baseball America prospect status couldn’t tell us anything about his future as a big-leaguer. This was entirely due the scarcity of top-100 prospects in the sample, as only a handful of players spent the year in rookie ball after making BA’s top-100 list.

The season is less than 40 games old for most rookie league teams, which makes it a little premature to start analyzing players’ stats. But just for kicks, here’s a look at what KATOH says about this year’s crop of rookie-ballers with at least 80 plate appearances through July 28th. This only considers players in the American rookie leagues — the Appalachian, Arizona, Gulf Coast, and Pioneer Leagues, meaning it excludes the Dominican and Venezuelan Summer Leagues. The full list of players can be found here, and you’ll find an excerpt of those who broke the 40% barrier below:

Player	Organization	Age	MLB Probability
Kevin Padlo	COL	17	73%
Bobby Bradley	CLE	18	67%
Alex Verdugo	LAD	18	65%
Luke Dykstra	ATL	18	64%
Yu-Cheng Chang	CLE	18	59%
Magneuris Sierra	STL	18	56%
Juan Santana	HOU	19	54%
Joshua Morgan	TEX	18	50%
Jason Martin	HOU	18	49%
Edmundo Sosa	STL	18	48%
Oliver Caraballo	TEX	19	46%
Sthervin Matos	MIL	20	46%
Alexander Palma	NYY	18	45%
Eloy Jimenez	CHC	17	45%
Javier Guerra	BOS	18	44%
Zach Shepherd	DET	18	44%
Tito Polo	PIT	19	44%
Jose Godoy	STL	19	43%
Henry Castillo	ARI	19	42%
David Gonzalez	DET	20	42%
Dan Jansen	TOR	19	42%
Max George	COL	18	42%
Gleyber Torres	CHC	17	42%
Luis Guzman	WSN	18	41%
Jose Martinez	KCR	17	41%
Alex Jackson	SEA	18	40%
Emmanuel Tapia	CLE	18	40%

What stands out most is that KATOH doesn’t think any of these players are shoo-ins to make it to the majors. Even those who are hitting the snot out of the ball get probabilities that fall short of what we saw for unremarkable performances in Double-A. Kevin Padlo, for example, gets just a 73%, despite hitting a ridiculous .317/.463/.619 as a 17-year-old. Its hard to do much better than that. I think this really speaks to how little rookie ball stats matter in the grand scheme of things. A good offensive showing is obviously better than a poor one, but numbers from this level need to be taken with a huge grain of salt. A hitter’s performance against pitchers who are fresh out of high school just can’t tell us much about how he’ll fare when matched up against more advanced pitching at the higher levels.

Next up, I’ll complete the series by looking at stats from short-season A-ball. Teams at that level are also only a few weeks into their season, but at the very least, it will be interesting to see how KATOH feels about SS A-ballers in general. Next week, I’ll apply the KATOH model to historical prospects and highlight some of its biggest “hits” and “misses” from the past.

Statistics courtesy of FanGraphs, Baseball-Reference, and The Baseball Cube; Pre-season prospect lists courtesy of Baseball America.

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

10 years ago

Wow Chris. I’ve enjoyed this series, but going back to Rookie-league data as a predictor is pretty bold.

A couple of questions:

1. How does the “error bar” for the Rookie-league KATOH model compare with the ones you’ve done for higher levels?

2. How much better could the predictions be if you used the last TWO years of data in the KATOH model?

3. Is there a methodology between a probit regression (on a binary outcome) and a “least-squares” regression (on a continuously variable outcome) that might allow you to bucket MLB success into a variable with say 4 defined states? For example:

0 = never reached MLB,
1 = MLB but less than 150 PA or 50 IP,
2 = too much MLB time for status 1, but < 15 WAR in first 7 seasons,
3 = 15+ WAR in first 7 seasons.

And would such a regression create a less "noisy" model than one that just regressed on say MLB WAR?

Great stuff – I'm eagerly anticipating the upcoming installments!

Chris Mitchell

Reply to tz

Glad you’re enjoying it! I really appreciate the feedback — KATOH’s a work in progress, and the comments on these articles have given me some good ideas on how to improve the model in the future.

1. I haven’t looked too deeply into that yet, but its on my radar. I’d imagine the lower-level models are a lot more fluky. Not only are these players far from the majors, but its also a smaller sample for R and A- players, who’s seasons don’t start up until June.

2. Not a clue. Worth looking into though.

3. I’m planning on rolling out something like that in November with the final 2014 stats. I’m still pondering how to go about doing it. I was thinking of using a linear regression to get an “expected” WAR over his first X seasons, but something like that could be interesting too.

David

Interesting post! This is the first one I’ve read in the series. There is a methodology called multinomial logistic regression that allows you to have more than two categories as your dependent variable. It is a nice extension to standard logistic regression. Here’s a pretty nice intro to the topic: http://www.ats.ucla.edu/stat/r/dae/mlogit.htm.

Reply to David

Thanks! I’ll be sure to read up on that.