Basic Machine Learning With R (Part 1) by The Kudzu Kid February 8, 2017 You’ve heard of machine learning. How could you not have? It’s absolutely everywhere, and baseball is no exception. It’s how Gameday knows how to tell a fastball from a cutter and how the advanced pitch-framing metrics are computed. The math behind these algorithms can go from the fairly mundane (linear regression) to seriously complicated (neural networks), but good news! Someone else has wrapped up all the complex stuff for you. All you need is a basic understanding of how to approach these problems and some rudimentary programming knowledge. That’s where this article comes in. So if you like the idea of predicting whether a batted ball will become a home run or predicting time spent on the DL, this post is for you. We’re going to use R and RStudio to do the heavy lifting for us, so you’ll have to download them (they’re free!). The download process is fairly painless and well-documented all over the internet. If I were you, I’d start with this article. I highly recommend reading at least the beginning of that article; it not only has an intro to getting started with R, but information on getting baseball-related data, as well as some other indispensable links. Once you’ve finished downloading RStudio and reading that article head back here and we’ll get started! (If you don’t want to download anything for now you can run the code from this first part on R-Fiddle — though you’ll want to download R in the long run if you get serious.) Let’s start with some basic machine-learning concepts. We’ll stick to supervised learning, of which there are two main varieties: regression and classification. To know what type of learning you want, you need to know what problem you’re trying to solve. If you’re trying to predict a number — say, how many home runs a batter will hit or how many games a team will win — you’ll want to run a regression. If you’re trying to predict an outcome — maybe if a player will make the Hall of Fame or if a team will make the playoffs — you’d run a classification. These classification algorithms can also give you probabilities for each outcome, instead of just a binary yes/no answer (so you can give a probability that a player will make the Hall of Fame, say). Okay, so the first thing to do is figure out what problem you want to solve. The second part is figuring out what goes into the prediction. The variables that go into the prediction are called “features,” and feature selection is one of the most important parts of creating a machine-learning algorithm. To predict how many home runs a batter will hit, do you want to look at how many triples he’s hit? Maybe you look at plate appearances, or K%, or handedness … you can go on and on, so choose wisely. Enough theory for now — let’s look at a specific example using some real-life R code and the famous “iris” data set. This code and all subsequent code will be available on my GitHub. data(iris) library('caret') inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE) training <- iris[inTrain,] model <- train(Species~.,data=training,method='rf') Believe it or not, in those five lines of code we have run a very sophisticated machine-learning model on a subset of the iris data set! Let’s take a more in-depth look at what happened here. data(iris) This first line loads the iris data set into a data frame — a variable type in R that looks a lot like an Excel spreadsheet or CSV file. The data is organized into columns and each column has a name. That first command loaded our data into a variable called “iris.” Let’s actually take a look at it; the “head” function in R shows the first five rows of the dataset — type head(iris) into the console. > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa As you hopefully read in the Wikipedia page, this data set consists of various measurements of three related species of flowers. The problem we’re trying to solve here is to figure out, given the measurements of a flower, which species it belongs to. Loading the data is a good first step. library(caret) If you’ve been running this code while reading this post, you may have gotten the following error when you got here: Error in library(caret) : there is no package called 'caret' This is because, unlike the iris data set, the “caret” library doesn’t ship with R. That’s too bad, because the caret library is the reason we’re using R in the first place, but fear not! Installing missing packages is dead easy, with just the following command install.packages('caret') or, if you have a little time and want to ensure that you don’t run into any issues down the road: install.packages("caret", dependencies = c("Depends", "Suggests")) The latter command installs a bunch more stuff than just the bare minimum, and it takes a while, but it might be worth it if you’re planning on doing a lot with this package. Note: you should be planning to do a lot with it — this library is a catch-all for a bunch of machine-learning tools and makes complicated processes look really easy (again, see above: five lines of code!). inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE) We never want to train our model on the whole data set, a concept I’ll get into more a little later. For now, just know that this line of code randomly selects 70% of our data set to use to train the model. Note also R’s “<-” notation for assigning a value to a variable. training <- iris[inTrain,] Whereas the previous line chose which rows we’d use to train our model, this line actually creates the training data set. The “training” variable now has 105 randomly selected rows from the original iris data set (you can again use the “head” function to look at the top 5). model <- train(Species~.,data=training,method='rf') This line of code runs the actual model! The “train” function is the model-building one. “Species~.” means we want to predict the “Species” column from all the others. “data=training” means the data set we want to use is the one we assigned to the “training” variable earlier. And “method=’rf'” means we will use the very powerful and very popular random-forest method to do our classification. If, while running this command, R tells you it needs to install something, go ahead and do it. R will run its magic and create a model for you! Now, of course, a model is no good unless we can apply it to data that the model hasn’t seen before, so let’s do that now. Remember earlier when we only took 70% of the data set to train our model? We’ll now run our model on the other 30% to see how good it was. # Create the test set to evaluate the model # Note that "-inTrain" with the minus sign pulls everything NOT in the training set testing <- iris[-inTrain,] # Run the model on the test set predicted <- predict(model,newdata=testing) # Determine the model accuracy accuracy <- sum(predicted == testing$Species)/length(predicted) # Print the model accuracy print(accuracy) Pretty good, right? You should get a very high accuracy doing this, likely over 95%*. And it was pretty easy to do! If you want some homework, type the following command and familiarize yourself with all its output by Googling any words you don’t know: confusionMatrix(predicted, testing$Species) *I can’t be sure because of the randomness that goes into both choosing the training set and building the model. Congratulations! You now know how to do some machine learning, but there’s so much more to do. Next time we’ll actually play around with some baseball data and explore some deeper concepts. In the meantime, play around with the code above to get familiar with R and RStudio. Also, if there’s anything you’d specifically like to see, leave me a comment and I’ll try to get to it.