Often I would like to have an estimate of a player’s true talent in a past year. Projection systems are always only focused on predicting future performance based on past results, but what I wanted was the best estimate of the expected performance for a player in a given year, based on his results in that year and the surrounding years.
I wanted to find suitable weights to assign to performance in the given year, plus the years immediately before and after, and have the right amount of regression to the mean. But I kept running into the same mental block; how to assign a weight to the given year’s performance, since that is exactly what I am trying to “predict”?
A suggestion from J.Cross, and help from Tangotiger, on the blog for The Book, was to try splitting the data from the year in question in half, and using one half to “predict” the other half. I could then build a multiple regression model, using the data from the surrounding years, and the league average, that would give proper weights to each year, as well as the necessary amount of regression to the league average.
I split the data in half by taking each player’s performance in games on odd- and even-numbered dates, and then used the even-numbered dates to predict performance on the odd-numbered dates. (The original suggestion was to use odd- and even-numbered innings, but for many players this doesn’t split the data in half. For instance, players who hit in the top few slots of the lineup could end up with over 60% of their plate appearances in odd-numbered innings.)
Details of the Model
First I will describe details for a model for wOBA; models for On-Base Percentage and Slugging Percentage were similar, and helped to confirm the general framework. The initial model uses the year immediately prior to, and the year immediately after, the year of interest; that is, a three-year window. For the previous and succeeding years, I also used only plate appearances on even-numbered dates, so that the wOBA in each year was estimated using similar numbers of plate appearances. Using Retrosheet’s data, my population was all three-year windows of player’s plate appearances from 1950 to 2010 (so the earliest three-year window was 1950-52, where the year of interest would be 1951). A player would need to have plate appearances in all three years in the window to make it into the model; this gave me 25,038 observations for the model.
For example, the first row in the regression model is Cal Abrams in 1951. His wOBA in odd-numbered dates in 1951 was .382 in 92 plate appearances. His wOBA in even-numbered dates was .315 in 1950, .388 in 1951, and .269 in 1952 (in 27, 94, and 53 plate appearances respectively). League wOBA in 1951 was .329. Each row was weighted in the model by the total number of plate appearances for that row; i.e., 266 plate appearances for Cal Abrams (1951).
This model produced the following estimation equation:
0.151 x (wOBA in Year-1) + 0.336 x (wOBA in Year) +
0.200 x (wOBA in Year+1) + 0.321 x (League wOBA)
All terms were significant, with standard errors of approximately 0.006 for all four terms. The average number of PA used in finding the wOBA for each term was 140 for Year-1, 147 for Year, and 140 for Year+1.
It is customary to convert these regression coefficients to weights that can be applied to each year’s performace, by dividing each coefficient by the 0.336 for (wOBA in Year). Doing this gives factors of 0.45 for Year-1, 1 for Year, and 0.60 for Year+1. The r for the model was 0.60, so using the method suggested by Tangotiger, this implies that we need to add 200 PA of league-average wOBA for the regression to the mean component. 3-year models for OBP and SLG produced similar factors.
I repeated this model with 5-year windows, so each observation consisted of performance from Year-2, Year-1, Year, Year+1, and Year+2, which left about 16,600 observations. I did the same for the models for OBP and SLG.
Conclusions and Example
Taking all of these models as a whole, they suggest round weighting factors of .5^n for Year-n, and .6^n for Year+n, with 200 PA of league-average performance added for regression to the mean. Since it may offend some of our sensibilities to have different factors for Year-n and Year+n, and to make things easier, we can use 0.55^n for both Year-n and Year+n. (I could not come up with an explanation for the fact that Year+n has a greater weight than Year-n, but the phenomenon persisted in every regression and subset I tried. Since I like fractions I will probably use 5/9 in the future.)
Returning to the Cal Abrams (1951) example, we would estimate his true talent On-Base Percentage in 1951 as
.5 x (18 times on-base in 1950) + (78 times on-base in 1951) +
.6 x (67 times on-base in 1952) + (200 x .336 league-average OBP)
.5 x (53 PA in 1950) + (186 PA in 1951) +
.6 x (189 PA in 1952) + (200 league-average PA)
which gives a final estimate of .370.
A more accurate true talent estimate could be found by applying an age adjustment to the performance from the preceding and succeeding years. For most players this will not change much, since for young players, the adjustment for the preceding years will be positive and that for succeeding years will be negative, and vice versa for older players. But certain players, such as those at age 27 (where the age adjustments on both sides would need to be positive), and players at the beginning or ending of their careers (where data is not available before or after the year in question), the age adjustments could be more important.