Author Archive

Fixing “On Pace” Numbers

Suppose I tell you that a baseball team has just started the season 10-0. You literally know nothing about the team besides this information. What is a reasonable expectation for the number of games this team will win? Even if you don’t know the answer offhand, you probably know that the answer is not “162.” Tom Tango has been taking to Twitter recently to mock these “on-pace” numbers, and for good reason — saying the above hypothetical team is “on pace” for 162 wins has no real meaning in reality. So how do we fix it? I’m going to proceed in a way that a Bayesian statistician might, but mostly explaining the logic behind the reasoning, rather than going through any complicated math. So follow me if you want to see how a statistician thinks.
Read the rest of this entry »

Don’t Give Up on Devon Travis

Devon Travis is having a rough start to the 2017 season. As I’m writing this, he has “hit” .148/.207/.222, good for a wRC+ of 16 and WAR of -0.5. Fans are openly wondering if he should be sent back to triple-A. But all is not lost! If you look past the surface stats, there is hope for the young Blue Jay. Let’s explore.
Read the rest of this entry »

Important Michael Lorenzen Update

Michael Lorenzen has made some headlines this year as a hitter. He hit a home run! He has a 232 wRC+! Wow! But here’s an interesting tidbit about him you may not have known — he also pitches occasionally! As I’m typing this, he has pitched 11 innings with a 9.82 K/9 and a 2.45 BB/9, a pretty good start which suggests that his 5.73 ERA will come down soon. The most interesting part to me, though, is that he is getting these results in a completely different manner than the way he pitched in 2016.

We took a look at Michael Lorenzen’s 2016 season a little while back and noted that while he was throwing his sliders very hard, he simply wasn’t getting any results with them. Thanks to the Statcast search at Baseball Savant we can take a look at his 17 sliders this year. The first thing that jumps out is the velocity — he is averaging 86 MPH on his slider, significantly down from the 91.5 MPH average in 2016. In fact, he’s maxed out at 89 MPH this year, which means his average slider velocity from a year ago is two ticks higher than his maximum slider velocity this year. And that’s with everyone’s velocity looking higher this year.

But the second thing that you’ll notice is that the results when he throws the slider are really good:

Michael Lorenzen Sliders by Result, (early) 2017
Result Count
Ball 3
Called Strike 4
Foul 2
Groundout 3
Flyout 1
Swinging Strike 4
SOURCE: Baseball Savant

In 2016 Lorenzen threw 67 sliders of 94+ MPH and got four swinging strikes. He already has that many swinging strikes on his sliders in 2017, in only 17 pitches. He’s yet to allow a base hit on the pitch, and has only missed the zone three times. This is a pitch that’s really become a weapon for him, after being a serious liability last year.

Now if that were the only thing that’s different about Lorenzen, it would be fairly interesting. But it’s not. Based on results alone, he looks like a completely different pitcher than last year:

View post on

Last year his cutter was his best pitch; this year it’s been his worst. Last year his slider was his worst pitch; this year it’s been slightly above-average. Now, just like in every other baseball article you’ll read this month, I will include the caveat that it’s early. But to see this kind of a swing in results is intriguing. I would suspect he’ll be going to his off-speed stuff a bit more in the coming months. It’s working for him, and it might help make his fastball look even faster. Don’t get distracted by his hitting — his pitching is the thing to keep an eye on. He’s put bits and pieces of it together in the past, and if he can put it all together now, watch out.

Let’s Build Our Own Catch Probability Metric

By now you’ve seen the Statcast Catch Probabilities. They’re great! Or, at the very least, they’re a shiny new toy to play with until the regular season rolls around. But, as you may have noticed, there are a few frustrating details about it — namely, the actual math behind the statistic is completely opaque, and the details about when an individual catch happened are hard to find. So let’s fix those two problems! We’ll create a catch probability metric that anyone can compute in Excel, using data that anyone can download easily.

You may have noticed a problem with this plan, though — the data that is used for the official Statcast catch probability isn’t easily accessible. We’ll have to make do with what we can get from the Statcast search at Baseball Savant. Specifically, instead of using hang time and distance traveled, we’ll use exit velocity and launch angle. Note that this completely disregards defensive positioning and it even disregards the horizontal angle off the bat*! It’s going to make for a less perfect metric, of course, but (spoiler alert) it will turn out okay.

*This really makes more sense if you think about it in terms of probability of the hitter making an out. The old saying goes “hit ’em where they ain’t” but in recent years we’ve come to understand that it’s really “hit it hard and in the air.”

I’m not going to go into the details of how I computed this metric; it’s standard machine learning stuff. If you want to follow along with the computation, I’ve put my code up on GitHub. Instead of going through all that here, I’ll just jump to the finish line: the formula for catch probability ends up being

1/(1+exp(-(-10.152 + 0.057 * hit_speed + 0.218 * hit_angle)))

Now you might be worried that such a simple formula, excluding tons of information, might be totally worthless. I was worried about that too! But applying this formula to a test set revealed this formula to be surprisingly accurate:

Catch Probability Assessment
Statistic Value
Accuracy 0.8385
Precision 0.8338
Recall 0.8671
F1 0.8501

(if you’ve never seen those numbers before — closer to 1 is better. Trust me, it’s pretty good.)

Well, that’s all well and good, but how can you get this for yourself and play around with it? Start by downloading the data you’re interested in from Baseball Savant. For instance, you can get all the data from, say, May 1 of last year by going here. Download the CSV with the link at the bottom and then you can simply add the above formula in a new column in Excel. If you need a concrete example of how this looks in Google Sheets, I’ve put one here.

Okay, now you’ve got this, but what are you going to do with it? One possibility is to use this to try to figure out which plays the official metric estimated as being difficult. For instance, let’s say you’ve noticed that Miguel Sano made two highlight-quality plays but you don’t know Mike Petriello well enough to ask him which ones those are. Just compute your own probabilities and you’re off! Although, as expected, the numbers differ. Our numbers do have Sano making two plays in the 0-25% range, but they’re not the same ones that Statcast flagged (sorry about the quality of the GIFs).

Catch #1: estimated catch probability 18.3%

Catch #2: estimated catch probability 21.3%

The Twins announcers praised his first step in the former video, while in the second they talked about how the ball “hung up” for Sano to be able to catch it. Not spectacular plays by any means, but neither were the other two, of course.

Finally, because I’m sure you’re curious, here’s the top catch of 2016 according to this metric (estimated catch probability: 8.6%).

Of course it’s a Kevin Kiermaier catch. Hey, at least we know we’re doing something right.

Basic Machine Learning With R (Part 3)

Previous parts in this series: Part 1 | Part 2

If you’ve read the first two parts of this series, you already know how to do some pretty cool machine-learning stuff, but there’s still a lot to learn. Today, we will be updating this nearly seven-year-old chart featured on Tom Tango’s website. We haven’t done anything with Statcast data yet, so that will be cool. More importantly, though, this will present us with a good opportunity to work with an imperfect data set. My motto is “machine learning is easy — getting the data is hard,” and this exercise will prove it. As always, the code presented here is on my GitHub.

The goal today is to take exit velocity and launch angle, and then predict the batted-ball type from those two features. Hopefully by now you can recognize that this is a classification problem. The question becomes, where do we get the data we need to solve it? Let’s head over to the invaluable Statcast search at Baseball Savant to take care of this. We want to restrict ourselves to just balls in play, and to simplify things, let’s just take 2016 data. You can download the data from Baseball Savant in CSV format, but if you ask it for too much data, it won’t let you. I recommend taking the data a month at a time, like in this example page. You’ll want to scroll down and click the little icon in the top right of the results to download your CSV.

View post on

Go ahead and do that for every month of the 2016 season and put all the resulting CSVs in the same folder (I called mine statcast_data). Once that’s done, we can begin processing it.

Let’s load the data into R using a trick I found online (Google is your friend when it comes to learning a new programming language — or even using one you’re already pretty good at!).

filenames <- list.files(path = "statcast_data", full.names=TRUE)
data_raw <-"rbind", lapply(filenames, read.csv, header = TRUE))

The columns we want here are “hit_speed”, “hit_angle”, and “events”, so let’s create a new data frame with only those columns and take a look at it.

data <- data_raw[,c("hit_speed","hit_angle","events")]


'data.frame':	127325 obs. of  3 variables:
 $ hit_speed: Factor w/ 883 levels "100.0","100.1",..: 787 11 643 ...
 $ hit_angle: Factor w/ 12868 levels "-0.01               ",..: 7766 1975 5158  ...
 $ events   : Factor w/ 25 levels "Batter Interference",..: 17 8 11 ...

Well, it had to happen eventually. See how all of these columns are listed as “Factor” even though some of them are clearly numeric? Let’s convert those columns to numeric values.

data$hit_speed <- as.numeric(as.character(data$hit_speed))
data$hit_angle <- as.numeric(as.character(data$hit_angle))

There is also some missing data in this data set. There are several ways to deal with such issues, but we’re just simply going to remove any rows with missing data.

data <- na.omit(data)

Let’s next take a look at the data in the “events” column, to see what we’re dealing with there.



 [1] Field Error         Flyout              Single             
 [4] Pop Out             Groundout           Double Play        
 [7] Lineout             Home Run            Double             
[10] Forceout            Grounded Into DP    Sac Fly            
[13] Triple              Fielders Choice Out Fielders Choice    
[16] Bunt Groundout      Sac Bunt            Sac Fly DP         
[19] Triple Play         Fan interference    Bunt Pop Out       
[22] Batter Interference
25 Levels: Batter Interference Bunt Groundout ... Sacrifice Bunt DP

The original classification from Tango’s site had only five levels — POP, GB, FLY, LD, HR — but we’ve got over 20. We’ll have to (a) restrict to columns that look like something we can classify and (b) convert them to the levels we’re after. Thanks to another tip I got from Googling, we can do it like this:

data$events <- revalue(data$events, c("Pop Out"="Pop",
      "Bunt Pop Out"="Pop","Flyout"="Fly","Sac Fly"="Fly",
      "Bunt Groundout"="GB","Groundout"="GB","Grounded Into DP"="GB",
      "Lineout"="Liner","Home Run"="HR"))
# Take another look to be sure
# The data looks good except there are too many levels.  Let's re-factor
data$events <- factor(data$events)
# Re-index to be sure
rownames(data) <- NULL
# Make 100% sure!

Oof! See how much work that was? We’re several dozen lines of code into this problem and we haven’t even started the machine learning yet! But that’s fine; the machine learning itself is the easy part. Let’s do that now.

inTrain <- createDataPartition(data$events,p=0.7,list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]

method <- 'rf' # sure, random forest again, why not
# train the model
ctrl <- trainControl(method = 'repeatedcv', number = 5, repeats = 5)
modelFit <- train(events ~ ., method=method, data=training, trControl=ctrl)

# Run the model on the test set
predicted <- predict(modelFit,newdata=testing)
# Check out the confusion matrix
confusionMatrix(predicted, testing$events)


Prediction   GB  Pop  Fly   HR Liner
     GB    9059    5    4    1   244
     Pop      3 1156  123    0    20
     Fly      6  152 5166  367   457
     HR       0    0  360 1182    85
     Liner  230   13  449   77  2299

We did it! And the confusion matrix looks pretty good. All we need to do now is view it, and we can make a very pretty visualization of this data with the amazing Plotly package for R:

# Exit velocities from 40 to 120
x <- seq(40,120,by=1)
# Hit angles from 10 to 50
y <- seq(10,50,by=1)
# Make a data frame of the relevant x and y values
plotDF <- data.frame(expand.grid(x,y))
# Add the correct column names
colnames(plotDF) <- c('hit_speed','hit_angle')
# Add the classification
plotPredictions <- predict(modelFit,newdata=plotDF)
plotDF$pred <- plotPredictions

p <- plot_ly(data=plotDF, x=~hit_speed, y = ~hit_angle, color=~pred, type="scatter", mode="markers") %>%
    layout(title = "Exit Velocity + Launch Angle = WIN")

View post on

Awesome! It’s a *little* noisy, but overall not too bad. And it does kinda look like the original, which is reassuring.

That’s it! That’s all I have to say about machine learning. At this point, Google is your friend if you want to learn more. There are also some great classes online you can try, if you’re especially motivated. Enjoy, and I look forward to seeing what you can do with this!

The Least Interesting Player of 2016

Baseball is great! We all love baseball. That’s why we’re here. We love everything about it, but we especially love the players who stick out. You know, the ones who’ve done something we’ve never seen before, or the ones that make us think, “Wow, I didn’t know that could happen.” It’s fun to look at players who are especially good — or, let’s face it, especially bad — at some aspect of this game. They’re the most interesting part of this game we love.

But not everyone can be interesting. Some players are just plain uninteresting! Like this guy.

OMG taking a pitch? That’s boring. You’re boring everybody. Quit boring everyone!

You caught a routine fly ball? YAWN! Wake me when something interesting happens.

But it’s hopeless; nothing interesting will ever happen with Stephen Piscotty. I’m sure the two GIFs above have convinced you that he was the least interesting player in baseball last year. But, on the off-chance that you have some lingering doubts, we can quantify it. I’ve made a custom leaderboard of various statistics for all qualified batters in 2016. For each of these statistics, I computed the z-score and the square of the z-score. In this way, we can boil down how interesting each player was to one number — the sum of the squared z-scores. The idea is that if a player was interesting in even one of these statistics, they’d have a high number there. Here are the results:

Click through for an interactive version

I don’t need to tell you who the guy on the far right is. On the flip side, though, there are two data points on the left that stick out. The slightly higher of the two is Marcell Ozuna, with an interest score of 1.627. The one on the very far left is Stephen Piscotty, with an interest score of 0.997. That’s right — if you sum the squares of his z-scores, you don’t even get to 1! This is as boring and average as baseball players get.

Where the real fun begins, though, is when you start making scatter plots of these statistics against each other. I’ve made an interactive version where you can play around with making these yourself, but here are a few highlights:



ISO vs. wRC+

Pretty boring, right? But wait, there’s more! Let’s investigate a little further what went into his interest score. Remember how we summed his squared z-scores and got a value below 1? Well, let’s look at the individual components that went into that sum.

The Most Boring Table Ever
Statistic Squared z-score
LD% 0.108
GB% 0.002
PA 0.296
G 0.220
OPS 0.001
BB% 0.057
SLG 4.888e-05
WAR 0.007
BABIP 0.141
K% 0.103
IFFB% 0.0004
ISO 5.313e-05
FB% 0.007
wOBA 0.022
AVG 1.69e-29
wRC+ 0.025
OBP 0.006

Yes, you’re reading that right — where he stood out the most was in games played and plate appearances. Yay, we got to see that much more boring! Also, I think it is especially apt that his AVG was EXACTLY league average.

All right, time to step back and be serious for a second. As Brian Kenny is always reminding us, there is great value in being a league-average hitter. Piscotty was worth 2.8 WAR last year, just his second year in the league. He’s already a very valuable contributor to a very good team. Maybe it’s time we started noticing guys who do everything just as well as everyone else, and value their contributions too?

(Nah, I’m going to go back and pore over Barry Bonds’s early-2000s stats for the next few hours.)

All the code used to generate the data and visualizations for this post can be found on my GitHub.

Basic Machine Learning With R (Part 2)

(For part 1 of this series, click here)

Last time, we learned how to run a machine-learning algorithm in just a few lines of R code. But how can we apply that to actual baseball data? Well, first we have to get some baseball data. There are lots of great places to get some — Bill Petti’s post I linked to last time has some great resources — but heck, we’re on FanGraphs, so let’s get the data from here.

You probably know this, but it took forever for me to learn it — you can make custom leaderboards here at FanGraphs and export them to CSV. This is an amazing resource for machine learning, because the data is nice and clean, and in a very user-friendly format. So we’ll do that to run our model, which today will be to try to predict pitcher WAR from the other counting stats. I’m going to use this custom leaderboard (if you’ve never made a custom leaderboard before, play around there a bit to see how you can customize things). If you click on “Export Data” on that page you can download the CSV that we’ll be using for the rest of this post.

View post on

Let’s load this data into R. Just like last time, all the code presented here is on my GitHub. Reading CSVs is super easy — assuming you named your file “leaderboard.csv”, it’s just this:

pitcherData <- read.csv('leaderboard.csv',fileEncoding = "UTF-8-BOM")

Normally you wouldn’t need the “fileEncoding” bit, but for whatever reason FanGraphs CSVs use a particularly annoying character encoding. You may also need to use the full path to the file if your working directory is not where the file is.

Let’s take a look at our data. Remember the “head” function we used last time? Let’s change it up and use the “str” function this time.

> str(pitcherData)
'data.frame':	594 obs. of  16 variables:
 $ Season  : int  2015 2015 2014 2013 2015 2016 2014 2014 2013 2014 ...
 $ Name    : Factor w/ 231 levels "A.J. Burnett",..: 230 94 47 ...
 $ Team    : Factor w/ 31 levels "- - -","Angels",..: 11 9 11  ...
 $ W       : int  19 22 21 16 16 16 15 12 12 20 ...
 $ L       : int  3 6 3 9 7 8 6 4 6 9 ...
 $ G       : int  32 33 27 33 33 31 34 26 28 34 ...
 $ GS      : int  32 33 27 33 33 30 34 26 28 34 ...
 $ IP      : num  222 229 198 236 232 ...
 $ H       : int  148 150 139 164 163 142 170 129 111 169 ...
 $ R       : int  43 52 42 55 62 53 68 48 47 69 ...
 $ ER      : int  41 45 39 48 55 45 56 42 42 61 ...
 $ HR      : int  14 10 9 11 15 15 16 13 10 22 ...
 $ BB      : int  40 48 31 52 42 44 46 39 58 65 ...
 $ SO      : int  200 236 239 232 301 170 248 208 187 242 ...
 $ WAR     : num  5.8 7.3 7.6 7.1 8.6 4.5 6.1 5.2 4.1 4.6 ...
 $ playerid: int  1943 4153 2036 2036 2036 12049 4772 10603 ...

Sometimes the CSV needs cleaning up, but this one is not so bad. Other than “Name” and “Team”, everything shows as a numeric data type, which isn’t always the case. For completeness, I want to mention that if a column that was actually numeric showed up as a factor variable (this happens A LOT), you would convert it in the following way:

pitcherData$WAR <- as.numeric(as.character(pitcherData$WAR))

Now, which of these potential features should we use to build our model? One quick way to explore good possibilities is by running a correlation analysis:

cor(subset(pitcherData, select=-c(Season,Name,Team,playerid)))

Note that in this line, we’ve removed the columns that are either non-numeric or are totally uninteresting to us. The “WAR” column in the result is the one we’re after — it looks like this:

W    0.50990268
L   -0.36354081
G    0.09764845
GS   0.20699173
IP   0.59004342
H   -0.06260448
R   -0.48937468
ER  -0.50046647
HR  -0.47068461
BB  -0.24500566
SO   0.74995296
WAR  1.00000000

Let’s take a first crack at this prediction with the columns that show the most correlation (both positive and negative): Wins, Losses, Innings Pitched, Earned Runs, Home Runs, Walks, and Strikeouts.

goodColumns <- c('W','L','IP','ER','HR','BB','SO','WAR')
inTrain <- createDataPartition(pitcherData$WAR,p=0.7,list=FALSE)
training <- data[inTrain,goodColumns]
testing <- data[-inTrain,goodColumns]

You should recognize this setup from what we did last time. The only difference here is that we’re choosing which columns to keep; with the iris data set we didn’t need to do that. Now we are ready to run our model, but which algorithm do we choose? Lots of ink has been spilled about which is the best model to use in any given scenario, but most of that discussion is wasted. As far as I’m concerned, there are only two things you need to weigh:

  1. how *interpretable* you want the model to be
  2. how *accurate* you want the model to be

If you want interpretability, you probably want linear regression (for regression problems) and decision trees or logistic regression (for classification problems). If you don’t care about other people being able to make heads or tails out of your results, but you want something that is likely to work well, my two favorite algorithms are boosting and random forests (these two can do both regression and classification). Rule of thumb: start with the interpretable ones. If they work okay, then there may be no need to go to something fancy. In our case, there already is a black-box algorithm for computing pitcher WAR, so we don’t really need another one. Let’s try for interpretability.

We’re also going to add one other wrinkle: cross-validation. I won’t say too much about it here except that in general you’ll get better results if you add the “trainControl” stuff. If you’re interested, please do read about it on Wikipedia.

method = 'lm' # linear regression
ctrl <- trainControl(method = 'repeatedcv',number = 10, repeats = 10)
modelFit <- train(WAR ~ ., method=method, data=training, trControl=ctrl)

Did it work? Was it any good? One nice quick way to tell is to look at the summary.

> summary(modelFit)

lm(formula = .outcome ~ ., data = dat)

     Min       1Q   Median       3Q      Max 
-1.38711 -0.30398  0.01603  0.31073  1.34957 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.6927921  0.2735966  -2.532  0.01171 *  
W            0.0166766  0.0101921   1.636  0.10256    
L           -0.0336223  0.0113979  -2.950  0.00336 ** 
IP           0.0211533  0.0017859  11.845  < 2e-16 ***
ER           0.0047654  0.0026371   1.807  0.07149 .  
HR          -0.1260508  0.0048609 -25.931  < 2e-16 ***
BB          -0.0363923  0.0017416 -20.896  < 2e-16 ***
SO           0.0239269  0.0008243  29.027  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4728 on 410 degrees of freedom
Multiple R-squared:  0.9113,	Adjusted R-squared:  0.9097 
F-statistic: 601.5 on 7 and 410 DF,  p-value: < 2.2e-16

Whoa, that’s actually really good. The adjusted R-squared is over 0.9, which is fantastic. We also get something else nice out of this, which is the significance of each variable, helpfully indicated by a 0-3 star system. We have four variables that were three-stars; what would happen if we built our model with just those features? It would certainly be simpler; let’s see if it’s anywhere near as good.

> model2 <- train(WAR ~ IP + HR + BB + SO, method=method, data=training, trControl=ctrl)
> summary(model2)

lm(formula = .outcome ~ ., data = dat)

     Min       1Q   Median       3Q      Max 
-1.32227 -0.27779 -0.00839  0.30686  1.35129 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.8074825  0.2696911  -2.994  0.00292 ** 
IP           0.0228243  0.0015400  14.821  < 2e-16 ***
HR          -0.1253022  0.0039635 -31.614  < 2e-16 ***
BB          -0.0366801  0.0015888 -23.086  < 2e-16 ***
SO           0.0241239  0.0007626  31.633  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4829 on 413 degrees of freedom
Multiple R-squared:  0.9067,	Adjusted R-squared:  0.9058 
F-statistic:  1004 on 4 and 413 DF,  p-value: < 2.2e-16

Awesome! The results still look really good. But of course, we need to be concerned about overfitting, so we can’t be 100% sure this is a decent model until we evaluate it on our test set. Let’s do that now:

# Apply to test set
predicted2 <- predict(model2,newdata=testing)
# R-squared
cor(testing$WAR,predicted2)^2 # 0.9108492
# Plot the predicted values vs. actuals

View post on

Fantastic! This is as good as we could have expected from this, and now we have an interpretable version of pitcher WAR, specifically,

WAR = -0.8 + 0.02 * IP + -0.13 * HR + -0.04 * BB + 0.02 * K

Most of the time, machine learning does not come out as nice as it has in this post and the last one, so don’t expect miracles every time out. But you can occasionally get some really cool results if you know what you’re doing, and at this point, you kind of do! I have a few ideas about what to write about for part 3 (likely the final part), but if there’s something you really would like to know how to do, hit me up in the comments.

Basic Machine Learning With R (Part 1)

You’ve heard of machine learning. How could you not have? It’s absolutely everywhere, and baseball is no exception. It’s how Gameday knows how to tell a fastball from a cutter and how the advanced pitch-framing metrics are computed. The math behind these algorithms can go from the fairly mundane (linear regression) to seriously complicated (neural networks), but good news! Someone else has wrapped up all the complex stuff for you. All you need is a basic understanding of how to approach these problems and some rudimentary programming knowledge. That’s where this article comes in. So if you like the idea of predicting whether a batted ball will become a home run or predicting time spent on the DL, this post is for you.

We’re going to use R and RStudio to do the heavy lifting for us, so you’ll have to download them (they’re free!). The download process is fairly painless and well-documented all over the internet. If I were you, I’d start with this article. I highly recommend reading at least the beginning of that article; it not only has an intro to getting started with R, but information on getting baseball-related data, as well as some other indispensable links. Once you’ve finished downloading RStudio and reading that article head back here and we’ll get started! (If you don’t want to download anything for now you can run the code from this first part on R-Fiddle — though you’ll want to download R in the long run if you get serious.)

Let’s start with some basic machine-learning concepts. We’ll stick to supervised learning, of which there are two main varieties: regression and classification. To know what type of learning you want, you need to know what problem you’re trying to solve. If you’re trying to predict a number — say, how many home runs a batter will hit or how many games a team will win — you’ll want to run a regression. If you’re trying to predict an outcome — maybe if a player will make the Hall of Fame or if a team will make the playoffs — you’d run a classification. These classification algorithms can also give you probabilities for each outcome, instead of just a binary yes/no answer (so you can give a probability that a player will make the Hall of Fame, say).

Okay, so the first thing to do is figure out what problem you want to solve. The second part is figuring out what goes into the prediction. The variables that go into the prediction are called “features,” and feature selection is one of the most important parts of creating a machine-learning algorithm. To predict how many home runs a batter will hit, do you want to look at how many triples he’s hit? Maybe you look at plate appearances, or K%, or handedness … you can go on and on, so choose wisely.

Enough theory for now — let’s look at a specific example using some real-life R code and the famous “iris” data set. This code and all subsequent code will be available on my GitHub.

inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)
training <- iris[inTrain,]
model <- train(Species~.,data=training,method='rf')

Believe it or not, in those five lines of code we have run a very sophisticated machine-learning model on a subset of the iris data set! Let’s take a more in-depth look at what happened here.


This first line loads the iris data set into a data frame — a variable type in R that looks a lot like an Excel spreadsheet or CSV file. The data is organized into columns and each column has a name. That first command loaded our data into a variable called “iris.” Let’s actually take a look at it; the “head” function in R shows the first five rows of the dataset — type


into the console.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

As you hopefully read in the Wikipedia page, this data set consists of various measurements of three related species of flowers. The problem we’re trying to solve here is to figure out, given the measurements of a flower, which species it belongs to. Loading the data is a good first step.


If you’ve been running this code while reading this post, you may have gotten the following error when you got here:

Error in library(caret) : there is no package called 'caret'

This is because, unlike the iris data set, the “caret” library doesn’t ship with R. That’s too bad, because the caret library is the reason we’re using R in the first place, but fear not! Installing missing packages is dead easy, with just the following command


or, if you have a little time and want to ensure that you don’t run into any issues down the road:

install.packages("caret", dependencies = c("Depends", "Suggests"))

The latter command installs a bunch more stuff than just the bare minimum, and it takes a while, but it might be worth it if you’re planning on doing a lot with this package. Note: you should be planning to do a lot with it — this library is a catch-all for a bunch of machine-learning tools and makes complicated processes look really easy (again, see above: five lines of code!).

inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)

We never want to train our model on the whole data set, a concept I’ll get into more a little later. For now, just know that this line of code randomly selects 70% of our data set to use to train the model. Note also R’s “<-” notation for assigning a value to a variable.

training <- iris[inTrain,]

Whereas the previous line chose which rows we’d use to train our model, this line actually creates the training data set. The “training” variable now has 105 randomly selected rows from the original iris data set (you can again use the “head” function to look at the top 5).

model <- train(Species~.,data=training,method='rf')

This line of code runs the actual model! The “train” function is the model-building one. “Species~.” means we want to predict the “Species” column from all the others. “data=training” means the data set we want to use is the one we assigned to the “training” variable earlier. And “method=’rf'” means we will use the very powerful and very popular random-forest method to do our classification. If, while running this command, R tells you it needs to install something, go ahead and do it. R will run its magic and create a model for you!

Now, of course, a model is no good unless we can apply it to data that the model hasn’t seen before, so let’s do that now. Remember earlier when we only took 70% of the data set to train our model? We’ll now run our model on the other 30% to see how good it was.

# Create the test set to evaluate the model
# Note that "-inTrain" with the minus sign pulls everything NOT in the training set
testing <- iris[-inTrain,]
# Run the model on the test set
predicted <- predict(model,newdata=testing)
# Determine the model accuracy
accuracy <- sum(predicted == testing$Species)/length(predicted)
# Print the model accuracy

Pretty good, right? You should get a very high accuracy doing this, likely over 95%*. And it was pretty easy to do! If you want some homework, type the following command and familiarize yourself with all its output by Googling any words you don’t know:

confusionMatrix(predicted, testing$Species)

*I can’t be sure because of the randomness that goes into both choosing the training set and building the model.

Congratulations! You now know how to do some machine learning, but there’s so much more to do. Next time we’ll actually play around with some baseball data and explore some deeper concepts. In the meantime, play around with the code above to get familiar with R and RStudio. Also, if there’s anything you’d specifically like to see, leave me a comment and I’ll try to get to it.

The Worst Pitch in Baseball

Quick thought experiment for you: what’s the worst pitch a pitcher can throw? You might say “one that results in a home run” but I disagree. Even in batting practice, hitters don’t hit home runs all the time, right? In fact, let’s quantify it — according to Baseball Savant there were 806 middle-middle fastballs between 82 and 88 MPH thrown in 2016. Here are the results of those pitches:

2016 Grooved Fastballs
Result Count Probability
Strike 296 36.7%
Ball 1 0.1%
Out 191 23.7%
Single 49 6.1%
Double 17 2.1%
Triple 4 0.5%
Home Run 36 4.5%
Foul 212 26.3%
SOURCE: Baseball Savant

So 86% of the time, we have a neutral or positive result for the pitcher, and the remaining 14% something bad happens. Not great, but when a pitcher *does* give up a homer on one of these pitches, there wasn’t really more than a 5% chance of that happening.

No, for my money, the worst thing a pitcher can do is to throw an 0-2 pitch that has a high probability of hitting a batter. The pitcher has a huge built-in advantage on 0-2, and by throwing this pitch he throws it all away and gives the batter a free base (or, at best, runs the count to 1-2). But everyone makes mistakes.

That’s Clayton Kershaw, hitting the very first batter he saw in 2015 with an 0-2 pitch. Here’s Vin Scully, apparently unwilling to believe Kershaw could make such a mistake, calling the pitch:

Strike two pitch on the way, in the dirt, check swing, and it might have hit him on the foot, and I believe it did. So Wil Myers, grazed by a pitch on an 0-2 count, hit on the foot and awarded first base. So Myers…and actually, he got it on his right knee when you look at the replay.

I was expecting more of a reaction from Kershaw — for reference, check out this reaction to throwing Freddie Freeman a sub-optimal pitch — but we didn’t get one. I wouldn’t worry about him, though — he’s since thrown 437 pitches on 0-2 counts without hitting a batter.

Kershaw is pretty good at avoiding this kind of mistake, but the true champion of 0-2 HBP avoidance is Yovani Gallardo*, who has thrown well over 1,200 0-2 pitches in his career without hitting a batter once. Looking at a heat map of his 0-2 pitches to right-handers (via Baseball Savant), you can see why — it’s hard to hit a batter when you’re (rightly) burying the pitch in the opposite batter’s box.

*Honorable mention: Mat Latos, who has thrown nearly as many 0-2 pitches as Gallardo without hitting a batter

Of course, 0-2 HBPs are fairly rare events, so it shouldn’t be too surprising to find that a few pitchers have managed to avoid them entirely. In fact, most pitchers are well under 1% of batters hit on 0-2 pitches. To get a global overview of how all pitchers did, let’s look at a scatter plot of average 0-2 velocity versus percent of HBPs in such counts over the past three years (click through for an interactive version):

I think one of these data points sticks out a bit to you.

I hate to pick on the guy, but that’s Nick Franklin, throwing the only 0-2 pitch of his life, and hitting Danny Espinosa when a strikeout would have (mercifully) ended the top of the ninth of this game against the Nationals. Interestingly, Franklin was much more demonstrative than Kershaw was, clapping his hands together and then swiping at the ball when it came back from the umpire. He probably knew that was his best opportunity to record a strikeout in the big leagues, and instead he gave his man a free base. Kevin Cash! Give this man another chance to redeem himself. He doesn’t want to be this kind of outlier forever.

Happy Trails, Josh Johnson

Josh Johnson could pitch. In this decade, seven players have put up a season in which they threw 180+ innings with a sub-60 ERA-: Clayton Kershaw (three times), Felix Hernandez (twice), Kyle Hendricks and Jon Lester in 2016, Zack Greinke and Jake Arrieta in 2015, and Josh Johnson in 2010. That was the second straight excellent year for Johnson, making the All-Star team in both 2009 and 2010, and finishing fifth in the Cy Young balloting the latter year. Early in 2011 he just kept it going, with a 0.88 ERA through his first few starts. In four of his first five starts that year, he took a no-hitter into the fifth inning. Dusty Baker — a man who has seen quite a few games of baseball in his life and normally isn’t too effusive in his praise of other teams’ players — had this to say at that point:

“That guy has Bob Gibson stuff. He has power and finesse, instead of just power. That’s a nasty combination.”

It seemed like he was going to dominate the NL East for years to come.

Josh Johnson felt pain. His first Tommy John surgery was in 2007, when he was just 23. His elbow had been bothering him for nearly a year before he finally got the surgery. His manager was optimistic at the time:

“I think he’ll be fine once he gets that rehab stuff out of the way,” Gonzalez said. “You see guys who underwent Tommy John surgery, they come back and pitch better.”

But the hits kept coming. His excellent 2010 season was cut short because of shoulder issues (though he didn’t go on the DL) and his promising 2011 season came up short because of shoulder issues. Those same issues had been bothering him all season but he pitched through the pain for two months.

“It took everything I had to go and say something,” he said. “Once I did, it was something lifted off my shoulders. Let’s get it right and get it back to feeling like it did at the beginning of the season.”

“I’m hoping [to return by June 1st],” he said. “You never know with this kind of stuff. You’ve got to get all the inflammation out of there. From there it should be fine.”

That injury cost him the rest of the season.

Josh Johnson loved baseball. Think about something you loved doing, and your reaction if someone told you that you had to undergo painful surgery with a 12-month recovery time in order to continue doing it. Imagine you did that, but then later on, someone told you that you had to do it again if you wanted even an outside chance of performing that activity, but the odds were pretty low. Josh Johnson had three Tommy John surgeries, because they gave him a glimmer of hope of continuing to play baseball.

Josh Johnson had a great career. It’s only natural to look at a career cut short by injuries and ask “what if?” but he accomplished plenty. He struck out Derek Jeter and Ichiro in an All-Star Game, threw the first pitch in Marlins Park, and made over $40 million playing the game he loved. He even lucked his way into hitting three home runs. Now he’s a 33-year-old millionaire in retirement; I think he did all right.