Basic Machine Learning With R (Part 3) by The Kudzu Kid March 7, 2017 Previous parts in this series: Part 1 | Part 2 If you’ve read the first two parts of this series, you already know how to do some pretty cool machine-learning stuff, but there’s still a lot to learn. Today, we will be updating this nearly seven-year-old chart featured on Tom Tango’s website. We haven’t done anything with Statcast data yet, so that will be cool. More importantly, though, this will present us with a good opportunity to work with an imperfect data set. My motto is “machine learning is easy — getting the data is hard,” and this exercise will prove it. As always, the code presented here is on my GitHub. The goal today is to take exit velocity and launch angle, and then predict the batted-ball type from those two features. Hopefully by now you can recognize that this is a classification problem. The question becomes, where do we get the data we need to solve it? Let’s head over to the invaluable Statcast search at Baseball Savant to take care of this. We want to restrict ourselves to just balls in play, and to simplify things, let’s just take 2016 data. You can download the data from Baseball Savant in CSV format, but if you ask it for too much data, it won’t let you. I recommend taking the data a month at a time, like in this example page. You’ll want to scroll down and click the little icon in the top right of the results to download your CSV. View post on imgur.com Go ahead and do that for every month of the 2016 season and put all the resulting CSVs in the same folder (I called mine statcast_data). Once that’s done, we can begin processing it. Let’s load the data into R using a trick I found online (Google is your friend when it comes to learning a new programming language — or even using one you’re already pretty good at!). filenames <- list.files(path = "statcast_data", full.names=TRUE) data_raw <- do.call("rbind", lapply(filenames, read.csv, header = TRUE)) The columns we want here are “hit_speed”, “hit_angle”, and “events”, so let’s create a new data frame with only those columns and take a look at it. data <- data_raw[,c("hit_speed","hit_angle","events")] str(data) 'data.frame': 127325 obs. of 3 variables: $ hit_speed: Factor w/ 883 levels "100.0","100.1",..: 787 11 643 ... $ hit_angle: Factor w/ 12868 levels "-0.01 ",..: 7766 1975 5158 ... $ events : Factor w/ 25 levels "Batter Interference",..: 17 8 11 ... Well, it had to happen eventually. See how all of these columns are listed as “Factor” even though some of them are clearly numeric? Let’s convert those columns to numeric values. data$hit_speed <- as.numeric(as.character(data$hit_speed)) data$hit_angle <- as.numeric(as.character(data$hit_angle)) There is also some missing data in this data set. There are several ways to deal with such issues, but we’re just simply going to remove any rows with missing data. data <- na.omit(data) Let’s next take a look at the data in the “events” column, to see what we’re dealing with there. unique(data$events)  Field Error Flyout Single  Pop Out Groundout Double Play  Lineout Home Run Double  Forceout Grounded Into DP Sac Fly  Triple Fielders Choice Out Fielders Choice  Bunt Groundout Sac Bunt Sac Fly DP  Triple Play Fan interference Bunt Pop Out  Batter Interference 25 Levels: Batter Interference Bunt Groundout ... Sacrifice Bunt DP The original classification from Tango’s site had only five levels — POP, GB, FLY, LD, HR — but we’ve got over 20. We’ll have to (a) restrict to columns that look like something we can classify and (b) convert them to the levels we’re after. Thanks to another tip I got from Googling, we can do it like this: library(plyr) data$events <- revalue(data$events, c("Pop Out"="Pop", "Bunt Pop Out"="Pop","Flyout"="Fly","Sac Fly"="Fly", "Bunt Groundout"="GB","Groundout"="GB","Grounded Into DP"="GB", "Lineout"="Liner","Home Run"="HR")) # Take another look to be sure unique(data$events) # The data looks good except there are too many levels. Let's re-factor data$events <- factor(data$events) # Re-index to be sure rownames(data) <- NULL # Make 100% sure! str(data) Oof! See how much work that was? We’re several dozen lines of code into this problem and we haven’t even started the machine learning yet! But that’s fine; the machine learning itself is the easy part. Let’s do that now. library(caret) inTrain <- createDataPartition(data$events,p=0.7,list=FALSE) training <- data[inTrain,] testing <- data[-inTrain,] method <- 'rf' # sure, random forest again, why not # train the model ctrl <- trainControl(method = 'repeatedcv', number = 5, repeats = 5) modelFit <- train(events ~ ., method=method, data=training, trControl=ctrl) # Run the model on the test set predicted <- predict(modelFit,newdata=testing) # Check out the confusion matrix confusionMatrix(predicted, testing$events) Prediction GB Pop Fly HR Liner GB 9059 5 4 1 244 Pop 3 1156 123 0 20 Fly 6 152 5166 367 457 HR 0 0 360 1182 85 Liner 230 13 449 77 2299 We did it! And the confusion matrix looks pretty good. All we need to do now is view it, and we can make a very pretty visualization of this data with the amazing Plotly package for R: #install.packages('plotly') library(plotly) # Exit velocities from 40 to 120 x <- seq(40,120,by=1) # Hit angles from 10 to 50 y <- seq(10,50,by=1) # Make a data frame of the relevant x and y values plotDF <- data.frame(expand.grid(x,y)) # Add the correct column names colnames(plotDF) <- c('hit_speed','hit_angle') # Add the classification plotPredictions <- predict(modelFit,newdata=plotDF) plotDF$pred <- plotPredictions p <- plot_ly(data=plotDF, x=~hit_speed, y = ~hit_angle, color=~pred, type="scatter", mode="markers") %>% layout(title = "Exit Velocity + Launch Angle = WIN") p View post on imgur.com Awesome! It’s a *little* noisy, but overall not too bad. And it does kinda look like the original, which is reassuring. That’s it! That’s all I have to say about machine learning. At this point, Google is your friend if you want to learn more. There are also some great classes online you can try, if you’re especially motivated. Enjoy, and I look forward to seeing what you can do with this!