Modeling Walk Rate Between Minor League Levels

by datadidit

October 1, 2016

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is walk rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results

Examples

Raw Data

For my model I used data from Baseball Reference and am using the last seven years of minor league data(2009-2015). Accounting for the Short-Season A (SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each player’s career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The minimum plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full-season leagues (A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

A to A+ : 1129
A+ to A: 1023
AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

A to A+ : .6301594
A+ to AA: .6141332
AA to AAA: .620662

Here’s the graphs for each level:

atoaplusbbrategraph

aplustoaamaporig

aatoaaabbrategraph

Model and Results

The linear models for each level are:

A to A+: A+ BB% = .63184*(A BB%) + .02882
A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. FanGraphs has a great rating scale for walk rate at the major league level:

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the FanGraphs table is .0183. I ended up rounding and made my margin for error .02. So if my predicted value for a player’s walk rate was within .02 of being correct I counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

A to A+
- Incorrect: 450
- Correct: 679
- Percentage Correct: ~.6014
A+ to A
- Incorrect: 445
- Correct: 578
- Percentage Correct: ~.565
AA to AAA
- Incorrect: 278
- Correct: 427
- Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the model’s results drastically improve:

A to A+
- Incorrect: 228
- Correct: 901
- Percentage Correct: ~.798
A+ to AA
- Incorrect: 246
- Correct: 777
- Percentage Correct: ~.7595
AA to AAA
- Incorrect: 144
- Correct: 561
- Percentage Correct: ~.7957

Examples

Numbers are cool but where are the actual examples? OK, let’s start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10% (~.1105). The player in this case was Joey Gallo. A quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this? Currently the model is only accounting for the previous season’s walk rate, but what if the player is getting a lot of hits at one level and stops swinging as much at the next? In Gallo’s case he only had a .245 BA his year at A-ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

galloatoasnippet — *Gallo Dataframe Snippet*

The lowest I was able to set the error to and still come back with results was ~.00004417. That very close prediction belongs to Erik Gonzalez. I don’t know Erik Gonzalez, so I continued to look for results. Setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hardcore Nats fans (like myself) but I wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astros multi-time All-Star 2B Jose Altuve’s walk rate within a .003 margin of error.

altuvedfsnippet — *Altuve Dataframe snippet*

What’s Next:

Improve algorithm for generating combined season dataframe
Improve model to get a lower error rate
Predict strikeout rate between levels
Eventually would like to predict more advanced statistics like wOBA/OPS/wRC+

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG