Basic Machine Learning With R (Part 2)

(For part 1 of this series, click here)

Last time, we learned how to run a machine-learning algorithm in just a few lines of R code. But how can we apply that to actual baseball data? Well, first we have to get some baseball data. There are lots of great places to get some — Bill Petti’s post I linked to last time has some great resources — but heck, we’re on FanGraphs, so let’s get the data from here.

You probably know this, but it took forever for me to learn it — you can make custom leaderboards here at FanGraphs and export them to CSV. This is an amazing resource for machine learning, because the data is nice and clean, and in a very user-friendly format. So we’ll do that to run our model, which today will be to try to predict pitcher WAR from the other counting stats. I’m going to use this custom leaderboard (if you’ve never made a custom leaderboard before, play around there a bit to see how you can customize things). If you click on “Export Data” on that page you can download the CSV that we’ll be using for the rest of this post.

View post on imgur.com


Let’s load this data into R. Just like last time, all the code presented here is on my GitHub. Reading CSVs is super easy — assuming you named your file “leaderboard.csv”, it’s just this:

pitcherData <- read.csv('leaderboard.csv',fileEncoding = "UTF-8-BOM")

Normally you wouldn’t need the “fileEncoding” bit, but for whatever reason FanGraphs CSVs use a particularly annoying character encoding. You may also need to use the full path to the file if your working directory is not where the file is.

Let’s take a look at our data. Remember the “head” function we used last time? Let’s change it up and use the “str” function this time.

> str(pitcherData)
'data.frame':	594 obs. of  16 variables:
 $ Season  : int  2015 2015 2014 2013 2015 2016 2014 2014 2013 2014 ...
 $ Name    : Factor w/ 231 levels "A.J. Burnett",..: 230 94 47 ...
 $ Team    : Factor w/ 31 levels "- - -","Angels",..: 11 9 11  ...
 $ W       : int  19 22 21 16 16 16 15 12 12 20 ...
 $ L       : int  3 6 3 9 7 8 6 4 6 9 ...
 $ G       : int  32 33 27 33 33 31 34 26 28 34 ...
 $ GS      : int  32 33 27 33 33 30 34 26 28 34 ...
 $ IP      : num  222 229 198 236 232 ...
 $ H       : int  148 150 139 164 163 142 170 129 111 169 ...
 $ R       : int  43 52 42 55 62 53 68 48 47 69 ...
 $ ER      : int  41 45 39 48 55 45 56 42 42 61 ...
 $ HR      : int  14 10 9 11 15 15 16 13 10 22 ...
 $ BB      : int  40 48 31 52 42 44 46 39 58 65 ...
 $ SO      : int  200 236 239 232 301 170 248 208 187 242 ...
 $ WAR     : num  5.8 7.3 7.6 7.1 8.6 4.5 6.1 5.2 4.1 4.6 ...
 $ playerid: int  1943 4153 2036 2036 2036 12049 4772 10603 ...

Sometimes the CSV needs cleaning up, but this one is not so bad. Other than “Name” and “Team”, everything shows as a numeric data type, which isn’t always the case. For completeness, I want to mention that if a column that was actually numeric showed up as a factor variable (this happens A LOT), you would convert it in the following way:

pitcherData$WAR <- as.numeric(as.character(pitcherData$WAR))

Now, which of these potential features should we use to build our model? One quick way to explore good possibilities is by running a correlation analysis:

cor(subset(pitcherData, select=-c(Season,Name,Team,playerid)))

Note that in this line, we’ve removed the columns that are either non-numeric or are totally uninteresting to us. The “WAR” column in the result is the one we’re after — it looks like this:

            WAR
W    0.50990268
L   -0.36354081
G    0.09764845
GS   0.20699173
IP   0.59004342
H   -0.06260448
R   -0.48937468
ER  -0.50046647
HR  -0.47068461
BB  -0.24500566
SO   0.74995296
WAR  1.00000000

Let’s take a first crack at this prediction with the columns that show the most correlation (both positive and negative): Wins, Losses, Innings Pitched, Earned Runs, Home Runs, Walks, and Strikeouts.

goodColumns <- c('W','L','IP','ER','HR','BB','SO','WAR')
library(caret)
inTrain <- createDataPartition(pitcherData$WAR,p=0.7,list=FALSE)
training <- data[inTrain,goodColumns]
testing <- data[-inTrain,goodColumns]

You should recognize this setup from what we did last time. The only difference here is that we’re choosing which columns to keep; with the iris data set we didn’t need to do that. Now we are ready to run our model, but which algorithm do we choose? Lots of ink has been spilled about which is the best model to use in any given scenario, but most of that discussion is wasted. As far as I’m concerned, there are only two things you need to weigh:

  1. how *interpretable* you want the model to be
  2. how *accurate* you want the model to be

If you want interpretability, you probably want linear regression (for regression problems) and decision trees or logistic regression (for classification problems). If you don’t care about other people being able to make heads or tails out of your results, but you want something that is likely to work well, my two favorite algorithms are boosting and random forests (these two can do both regression and classification). Rule of thumb: start with the interpretable ones. If they work okay, then there may be no need to go to something fancy. In our case, there already is a black-box algorithm for computing pitcher WAR, so we don’t really need another one. Let’s try for interpretability.

We’re also going to add one other wrinkle: cross-validation. I won’t say too much about it here except that in general you’ll get better results if you add the “trainControl” stuff. If you’re interested, please do read about it on Wikipedia.

method = 'lm' # linear regression
ctrl <- trainControl(method = 'repeatedcv',number = 10, repeats = 10)
modelFit <- train(WAR ~ ., method=method, data=training, trControl=ctrl)

Did it work? Was it any good? One nice quick way to tell is to look at the summary.

> summary(modelFit)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.38711 -0.30398  0.01603  0.31073  1.34957 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.6927921  0.2735966  -2.532  0.01171 *  
W            0.0166766  0.0101921   1.636  0.10256    
L           -0.0336223  0.0113979  -2.950  0.00336 ** 
IP           0.0211533  0.0017859  11.845  < 2e-16 ***
ER           0.0047654  0.0026371   1.807  0.07149 .  
HR          -0.1260508  0.0048609 -25.931  < 2e-16 ***
BB          -0.0363923  0.0017416 -20.896  < 2e-16 ***
SO           0.0239269  0.0008243  29.027  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4728 on 410 degrees of freedom
Multiple R-squared:  0.9113,	Adjusted R-squared:  0.9097 
F-statistic: 601.5 on 7 and 410 DF,  p-value: < 2.2e-16

Whoa, that’s actually really good. The adjusted R-squared is over 0.9, which is fantastic. We also get something else nice out of this, which is the significance of each variable, helpfully indicated by a 0-3 star system. We have four variables that were three-stars; what would happen if we built our model with just those features? It would certainly be simpler; let’s see if it’s anywhere near as good.

> model2 <- train(WAR ~ IP + HR + BB + SO, method=method, data=training, trControl=ctrl)
> summary(model2)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.32227 -0.27779 -0.00839  0.30686  1.35129 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.8074825  0.2696911  -2.994  0.00292 ** 
IP           0.0228243  0.0015400  14.821  < 2e-16 ***
HR          -0.1253022  0.0039635 -31.614  < 2e-16 ***
BB          -0.0366801  0.0015888 -23.086  < 2e-16 ***
SO           0.0241239  0.0007626  31.633  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4829 on 413 degrees of freedom
Multiple R-squared:  0.9067,	Adjusted R-squared:  0.9058 
F-statistic:  1004 on 4 and 413 DF,  p-value: < 2.2e-16

Awesome! The results still look really good. But of course, we need to be concerned about overfitting, so we can’t be 100% sure this is a decent model until we evaluate it on our test set. Let’s do that now:

# Apply to test set
predicted2 <- predict(model2,newdata=testing)
# R-squared
cor(testing$WAR,predicted2)^2 # 0.9108492
# Plot the predicted values vs. actuals
plot(testing$WAR,predicted2)

View post on imgur.com


Fantastic! This is as good as we could have expected from this, and now we have an interpretable version of pitcher WAR, specifically,

WAR = -0.8 + 0.02 * IP + -0.13 * HR + -0.04 * BB + 0.02 * K

Most of the time, machine learning does not come out as nice as it has in this post and the last one, so don’t expect miracles every time out. But you can occasionally get some really cool results if you know what you’re doing, and at this point, you kind of do! I have a few ideas about what to write about for part 3 (likely the final part), but if there’s something you really would like to know how to do, hit me up in the comments.





The Kudzu Kid does not believe anyone actually reads these author bios.

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Matt Malkus
7 years ago

Thanks for this series, I definitely plan to play around with this approach in the future. I’m curious if you’ve ever played around with principal component analysis or other forms of multivariate analysis? Similar in terms of the somewhat “black box” nature, taking large datasets and throwing them in and getting intuitive results on the way out. I’m sure R has packages for it (I was using SAS the last time I played around with it).

Colinmember
7 years ago

Pretty cool stuff. I’ve been waiting for people to start including their code with articles for a while now. Looking forward to what you do for part 3