Author Archive

Forecasting League-wide Strikeout and Homer Rates

Two of the more notable league-wide trends in MLB today are rising home run and strikeout rates.  Strikeouts have consistently trended upward over the past 35 or so years.  Home-run rate, meanwhile, has moved up and down a bit more, but has also increased during that span overall.

An accurate long-term forecast of trends such as these could be valuable.  As this Beyond the Box Score article illustrates, ideal roster construction changes in tandem with the league-wide run-scoring environment.  During periods where offense is scarce, power hitters see their value go up.  When offense is plentiful, speedy contact hitters become somewhat more valuable.

In the following paragraphs, I will attempt to project strikeout percentage and home-run rate — measured as plate appearances per home run — for the 2017-2026 seasons.  First I will take a univariate approach (i.e., use only past patterns in the data to predict future values). Then, I will try to improve the model by adding in an external regressor variable.

Strikeout Rate

First, here’s a plot of the raw data.

Strikeouts rose fairly steadily from the early 1920s to the late 1960s, dipped for about 10 years, then started to tick back up again around 1980.  They’ve been on the rise ever since, and at an especially accelerated pace since 2005.

I considered several classes of time-series models to represent this data, including Auto-Regressive Integrated Moving Average (ARIMA), exponential smoothing state-space (ets), and artificial neural network.  I used AICc to narrow down the field of models somewhat.  I then split the data into a training set and a test set, fit each remaining model on the training data, and evaluated its forecast accuracy based on mean absolute error and median absolute prediction error using a rolling forecast origin.

The data had to be differenced once to make it approximately stationary, after which there was little to no auto-correlation remaining.  Given this fact, it shouldn’t be too surprising that the best-performing model was a random walk with drift.  Below are forecasts from this model for the next decade, along with 80% and 95% prediction intervals.

Year Forecast Low 80 High 80 Low 95 High 95
2017 21.21 20.49 21.92 20.11 22.3
2018 21.31 20.3 22.33 19.76 22.87
2019 21.42 20.17 22.67 19.51 23.33
2020 21.52 20.07 22.97 19.31 23.74
2021 21.63 20 23.26 19.14 24.12
2022 21.74 19.94 23.53 18.99 24.48
2023 21.84 19.89 23.79 18.86 24.82
2024 21.95 19.86 24.04 18.75 25.15
2025 22.05 19.83 24.28 18.65 25.46
2026 22.16 19.8 24.52 18.55 25.77

The model projects a continued, but decelerated rise in K% relative to what we’ve seen the past decade.

Home Run Rate

I used the same general process to fit a model for the home run data, except I first utilized a Box-Cox transformation to stabilize variance.  This time, there was some auto-correlation that remained after differencing.  The best-performing model turned out to be an ARIMA(0,1,1).

Once again, 80% and 95% prediction intervals are given from that model along with the point forecasts.

Year Forecast Low 80 High 80 Low 95 High 95
2017 34.86 31.87 38.58 30.52 40.95
2018 34.86 31.39 39.37 29.85 42.36
2019 34.86 30.98 40.08 29.30 43.66
2020 34.86 30.63 40.74 28.83 44.91
2021 34.86 30.31 41.36 28.42 46.13
2022 34.86 30.03 41.96 28.04 47.32
2023 34.86 29.77 42.54 27.70 48.50
2024 34.86 29.53 43.10 27.39 49.69
2025 34.86 29.31 43.65 27.11 50.87
2026 34.86 29.10 44.19 26.84 52.06

The projection is flat, but with a decrease in home-run rate from one every 32.90 PA in 2016 to one every 34.86 PA going forward.  If plate appearances remain constant, this would mean a 315 home-run reduction across MLB, or just over 30 per team.

Modeling with Regressors

The difficult part with including regressors in the model is finding ones that are known into the future.  Exit velocity, for example, is something that would probably be quite helpful if you were trying to predict home-run rate.  However, since we don’t actually know what it will be in a given season until after that season is over, it doesn’t do much good for forecasting purposes.

One variable I was able to consider was the percentage of home runs and strikeouts in previous years that came from particularly young or old players.  My theory was that if an unusually high percentage of home runs (or strikeouts) came from players that were nearing the ends of their career, league-wide numbers would be more likely to drop in the coming years (and vice versa if  the sources of strikeouts or power were unusually concentrated among young players).

As it turns out, considering age was not especially useful when I back-tested the strikeout model.  Considering the number of old power hitters was not very useful either.  However, percentage of home runs that came from players under 25 was a significant predictor of home-run rate in future years.

I created a variable called “Youth Index” that averaged percentage of home runs from young players in the previous five seasons, weighted by their correlations to home-run rate in the season in question.  To avoid having to forecast Youth index separately, I actually used a slightly different model for each step in the forecast, each considering only known data.  For example, for the 2017 forecast, data from each of the 2012-2016 seasons is available, but for the 2018 forecast, 2017 data is not.  Thus, the Youth index predictor for 2018 used only data from 2-5 seasons back, the 2019 Youth index predictor used only data from 3-5 seasons back, etc.  I limited the forecast to only five seasons ahead, by which point the model started to converge with the univariate forecast anyway.

Year Forecast Low 80 High 80 Low 95 High 95
2017 36.27 33.15 40.16 31.74 42.65
2018 36.25 32.84 40.61 31.32 43.45
2019 36.03 32.38 40.81 30.77 44.00
2020 35.59 31.71 40.77 30.02 44.31
2021 35.67 31.37 41.62 29.54 45.84

*Note: the red and green lines are 80% and 95% prediction intervals just like on the other graphs.  It only looks different because I created this graph manually rather than using an R-package.

The updated forecast projects a more aggressive rebound in PA/HR (i.e., decrease in home-run rate).  The difference overall in the two forecasts is not huge, but not nothing either.  Interestingly enough, the model is over 90% confident that PA/HR will rise to some degree or another next season.

Ultimately, both home run and strikeout rate are influenced by a wide array of factors, many of which are difficult or even impossible to consider in a long-ish term forecast like this.  The confidence bars aren’t quite as narrow as I’d like, which suggests the observed data may end up deviating quite a bit from these projections.  Nonetheless, I think this is a good starting point.