Author Archive

Predicting 2015 Starting Pitcher Performance Using Regression Trees

Projecting starting pitcher performance has proved more difficult than projecting hitter performance, mostly because pitcher skill level and performance tends to be more volatile. Another issue is that pitcher performance indicators are heavily reliant on batted-ball outcomes. This means a team’s defense and luck (e.g., softly hit balls that drop for hits) become a large part of run prevention, all of which are mostly out of the pitcher’s control. This realization has led to the development of a variety of pitching statistics that attempt to reduce pitcher performance into metrics that rely on outcomes only under pitcher control, such as walks, strikeouts, and home runs (e.g., Fielding Independent Pitching, FIP). Given that these metrics are the state of the art in terms of summarizing and describing a player’s past performance (not necessarily predictive measures; see Dave Cameron’s 2011 article here), it is useful to develop ways to attempt to predict these metrics from prior predictive statistics. As such, the goal of the current analysis was to develop prediction models using various regression tree methods that best predict starting pitcher performance metrics.

Data
Data for these analyses were compiled from several different sources, including Fangraphs.com and by using the ‘Lahman’ and ‘Retrosheet’ packages in R. Data were aggregated from the prior three seasons (2012-2014), as well as the 2015 regular season. The final data set included average performance statistics of starting pitchers from 2012-2014 who also pitched at least 50 innings during the 2015 season (N=127). The primary outcome was 2015 pitcher Wins Above Replacement (WAR). Predictors included aggregated values of over 30 performance metrics from the prior three seasons, including standard and advanced statistics (e.g., K-BB%), batted-ball measures (e.g., GB%), quality of contact statistics (e.g., hard contact %), and PITCHf/x measures (e.g., average fastball velocity).

Analytic Approach
The goal of this analysis was to use several different data modeling techniques to develop models that best predicted pitcher performance during the 2015 season from pitching data from the 2012-2014 seasons. Three separate techniques were utilized that fall within the general family of Classification and Regression Tree (CART) methods. CART methods use search procedure algorithms to find variables that are most important for prediction, then, determine the best possible cut point on the selected predictor in order to subset the data into multiple predictor spaces (Breiman, Friedman, Olshen, & Stone, 1984; Steinberg & Cola, 2009). These procedures allow for non-linear associations and higher order interactive effects. Regression trees were grown using several different packages in R, including the rpart and party packages. These packages are capable of growing large regression trees, but also include cost complexity and control parameters that allow for the assessment of over fit and tree size reduction. Next, a technique known as boosting using the gbm package in R was used to identify the predictors of highest importance for predicting pitcher performance. Although similar to ensemble CART methods that re-sample data to grow multiple large regression trees (e.g., bootstrap aggregation), boosting is a slow learning algorithm that grows regression trees sequentially, not independently. Each tree is fit to the residuals from the previous tree in order to isolate the misfit and re-shape the regression tree.

Results
First, the complete dataset was split in half in order to create training and test data sets. Next, the training data was used to fit a regression tree predicting 2015 WAR from all variables in the dataset. In the first model, liberal control parameters were set for the size of the tree, meaning a large tree was grown that selected all the best possible predictors. Each chosen predictor was then optimally split until each pitcher could be placed into a terminal node. The results from the initial model demonstrated that average strikeout rate per plate appearance (K%) was the best predictor of WAR with an optimal split of 22.39%. The initial model R2 demonstrated that 97% of the variance in WAR could be explained by this regression tree. Despite the high amount of variance explained, this model has likely over fit the data. In other words, the model is overly fit to the empirical data set, which means the model is too complex and unlikely to replicate across other samples. Reducing the size of the tree, or pruning the tree, will result in higher bias, but will reduce variance in the predicted values.

Initial Regression Tree Overfit to the Sample Data

In order to determine the optimal tree size (i.e., prune the tree) cost complexity pruning using 10-fold cross validation was done on the training data set. Based on the model deviance, the optimal tree size was determined to be between 4 and 6 terminal nodes. After pruning the tree, the R2 was reduced to .68, but the mean square error (MSE) was also reduced from 6.8 to 3.6 in the training data set. Next, the optimized tree was fit to the test data set, which produced an R2 of .57 and a MSE of 1.4. Surprisingly, after the initial split on K% the next-best predictors were related to quality of contact statistics (go here for more detailed information). Although there is a large amount of measurement error in these variables, it is still interesting these measures are predictive of WAR.

An inherent problem with regression trees is that continuous predictors with more unique values are more likely to be chosen because they contain a higher number of possible split points. The party package in R attempts to control for this issue by taking into account the distributional properties of the predictors (Hothorn, Hornik, & Zeileis, 2006). As such, similar models were fit predicting 2015 WAR using the party package in R. Results were similar to the model using the rpart package, which found that average strikeout rate was the best predictor with a split of 22.3%. However, it was determined that the data only required one optimal split, partitioning pitchers into those who were above and below a strikeout rate of 22.3% (see Figure below). Although this model explained significantly less variance in WAR (R2 =.29) than the larger tree, this model is likely to have higher stability and predictive utility in new samples.

Optimized Regression Tree using the Party Package
Figure 2.

Finally, boosted regression trees were fit to the data to examine the optimal predictors of 2015 WAR. The number of trees (B=1,700) was chosen by examining the decline in the squared error loss for the out of the bag sample. The shrinkage parameter was set to λ =.001 with an interaction depth of d=1. For the training data the MSE was 1.79 and the R2 was .59. The model was then tested against the left-out half of the dataset (test dataset), which produced a MSE of 1.98 and an R2 of .55. Given the small differences in the R2 value and MSE for the test and training data sets, this model appears to show relative consistency. The most important predictors were determined by the importance function in the gbm package. Average strikeout rate, average fastball velocity, and average strikeouts per plate appearance minus walks per plate appearance were the most important predictors of 2015 WAR. To see a list of the relative influence of all variables refer to the table below.

Order of Variable Importance Predicting 2015 WAR

Table 1.

Based on these results it is clear that K% is a strong predictor of future WAR, which is not surprising because pitcher WAR is based on FIP (derived from K, BB, HR outcomes). Average fastball velocity and K% minus BB% also came out as a relatively strong predictors of WAR in the boosted regression tree models. Quality of contact was found to be an important predictor, but more analysis should be done in other samples to see if these measures have consistent predictive ability.