Update: The previous version of this post, published last week, contained a data error that has now been fixed. Steamer/Razzball and Pod projections have been added and the hitter sample has been corrected from the prior version of this article.
Welcome to my 5th annual forecast review. Each year, every projection submitted to me at http://www.bbprojectionproject.com is tested for error (RMSE), overall predictive power (R^2), and is then ranked. I present both RMSE and R^2 because both have their uses. RMSE is a standard measure of forecast error, but this metric penalizes general optimism/pessimism about the run environment, even if a forecast has low error after controlling for the bias. For instance, Marcel is very good at predicting the run environment and the FanGraphs Fans are pretty terrible, so Marcel will usually have a better RMSE than the Fans. On the other hand, R^2 serves as a better test of the relative performance of players by ignoring any general biases in the forecasts that are pervasive in the forecasting system. Marcel tends to be lower in this metric versus other systems due to its rigid formula, whereas more sophisticated methods like ZIPS or Steamer tend to do better.
Comparisons are based on the set of players that every system projected. This amounts to 70 pitchers and 141 hitters for 2014. This is certainly limiting, but there is an inherent tradeoff in the number of projection systems that can be analyzed vs. the number of players that are projected by all systems. My policy is to consider as many projection systems as possible, as long as the number of players doesn’t get too low.
Now, on to the contest!
This year certainly saw some interesting results. By the R^2 metric, the best forecaster for hitters (Dan Rosenheck) only published forecasts for hitter categories–evidently there’s some benefit in specialization when it comes to projecting baseball players. The best pitcher forecasts came from Mike Podhorzer’s Pod forecasts. The best composite score came from my own personal forecast brew, which is computed based on an algorithm that estimates weights of other main-line forecasts. In a sense, this is not an original forecast, so I now note forecasts that I know use other forecasts as inputs with an “*” (I realize that to some degree, most everyone calibrates their forecasts to what they see other people doing). The next two forecasts are also of this same type, the AggPro and the Steamer/Razzball forecasts. The top “structural” forecast was Pod, followed by ZIPS, Rotovalue, and CBS.
In terms of RMSE, Dan Rosenheck ran away with the hitters, and my weighted average did the best among pitchers. The top overall performers across categories were MORPS, Marcel, Rotovalue, and AggPro.
Overall, there are a few interesting comparisons to be made between projection systems across different years. Among the open-source stats community, Steamer vs ZIPS is always interesting to watch. In prior years, Steamer has been better. This year, however, ZIPS made huge gains and beat Steamer. Marcel, had a typical year—with a very favorable ranking on RMSE but not R^2. The FanGraphs Fans had a down year, finishing near the bottom in most metrics. CBS Sportsline is the top forecast by a major media company, which in general, tend to do poorly. Finally, most every projection submitted beat the naïve previous-season benchmark, where the 2014 forecast is simply the actual performance in 2013. At least we’re all doing something right.
Thank you again to all who submitted projections. I invite anyone who is interested to submit their top-line hitter and pitcher projections to me at email@example.com. You projection will be put up on http://www.bbprojectionproject.com as soon as I receive it, unless you want me to embargo it until the end of the season, which some people choose to do because of fantasy baseball or other proprietary reason. All the code (STATA) and data for these evaluations are available upon request. If I’m using the wrong versions of anyone’s projections (which can happen!), please let me know.
Welcome to the 3rd annual forecast competition, where each forecaster who submits projections to bbprojectionproject.com is evaluated based on RMSE and model R^2 relative to actuals (see last year’s results here). Categories evaluated for hitters are: AVG, Runs, HR, RBI, and SB, and for pitchers are: Wins, ERA, WHIP, and Strikeouts. RMSE is a popular metric to evaluate forecast accuracy, but I actually prefer R^2. This metric removes average bias (see here) and effectively evaluates forecasted player-by-player variation, making it more useful when attempting to rank players (i.e. for fantasy baseball purposes).
Here are the winners for 2014 for R^2 (more detailed tables are below):
And here are the winners for the RMSE portion of the competition:
I’m beginning to notice some trends in the results across years. First, systems that include averaging do particularly well. This is pretty well established by now, but it’s always useful to reflect upon. It’s been asked in the past to perform evaluations separating forecasts computed by averaging with those that do not include information from others’ forecasts (more “structural” forecasts). I decided not to do this because the nature of the baseball forecasting “season” makes it impossible to be sure forecasts are created without taking into account information from others’ forecasts. This can include direct influence (forecasting as a weighted average of others’ forecasts), but can also occur in more subtle ways, such as model selection based on forecasts that others have put forward. Second, FanGraphs Fans are always fascinating to me, and how they can be so biased, but yet contain some of the best unique and relevant information for forecasting player variation. The takeaway from the Fans forecast set is that crowdsourced-averaging works, as long as you can remove the bias in some way, or ignore it by instead focusing on ordinal ranks.
Some additional notes: it would be interesting to decompose these aggregate stats in to rates multiplied by playing time, but it’s difficult to gather all of this for each projection system. Therefore, I focus on top-line output metrics. Also, absolute rankings are presented, but many of these are likely statistically indistinguishable from each other. If someone wants to run Diebold-Mariano tests, you can download the data used in this comparison from bbprojectionproject.com
Thanks for reading, and please submit your projections for next year! Also, as always, I welcome any comments, and I’ll do my best to respond.
R^2 Detailed Tables
RMSE Detailed Tables
Evaluating 2012 Projections
Hello loyal readers. It’s time for the annual evaluation of last year’s player projections. Last year saw Gore, Snapp, and Highly’s Aggpro forecasts win among hitter projections (http://www.fangraphs.com/community/comparing-2011-hitter-forecasts/) and Baseball Dope win among pitchers http://www.fangraphs.com/community/comparing-2011-pitcher-forecasts/ . In general, projections computed using averages or weighted averages tended to perform best among hitters, while for pitchers, structural models computed using “deep” statistics (k/9, hr/fb%, etc.) did better.
In 2012, there were 12 projections submitted for hitters and 12 for pitchers (11 submitted projections for both). The evaluation only considers players where every projection system has a projection.
Read the rest of this entry »
This article is the second of a two part series evaluating 2011 baseball player forecasts. The first looks at hitters and found that forecast averages outperform any particular forecasting system. For pitchers, it appears as though the results are somewhat reversed. Structural forecasts that are computed using “deep” statistics (k/9, hr/fb%, etc.) seem to have done particularly well.
As with the other article, I will look at two main bases of comparison: Root Mean Squared Error both with and without bias. Bias is important to consider because it is easily removed from a forecast and it can mask an otherwise good forecasting approach. For example Fangraphs Fan hitter projections are often quite biased, but are very good at predicting numbers when this bias is removed.
This article is an update to the article I wrote last year on Fangraphs.
This year, I’m going to look at the forecasting performance of 12 different baseball player forecasting systems. I will look at two main bases of comparison: Root Mean Squared Error both with and without bias. Bias is important to consider because it is easily removed from a forecast and it can mask an otherwise good forecasting approach. For example, Fangraphs Fan projections are often quite biased, but are very good at predicting numbers when this bias is removed.
In two previous articles, I considered the ability of freely available forecasts to predict hitter performance (part 1 and part 2), and how forecasts can be used to predict player randomness (here). In this article, I look at the performance of the same six forecasts as before (ZIPS, Marcel, CHONE, Fangraphs Fans, ESPN, CBS), but instead look at starting pitchers’ wins, strikeouts, ERA, and WHIP.
Results are quite different than for hitters. ESPN is the clear winner here, with the most accurate forecasts and the ones with the most unique and relevant information. Fangraphs Fan projections are highly biased, as with the hitters, yet they add a large amount of distinct information, and thus are quite useful. Surprisingly, the mechanical forecasts are, for the most part, failures. While ZIPS has the least bias, it is encompassed by other models in every statistic.* Marcel and CHONE are also poor performers with no useful and unique information, but with higher bias.
This article explores the ability to predict the randomness of players’ performance in 5 standard hitting categories: HRs, Runs, RBIs, SBs, and AVG. There have been efforts to do so by forecasters, most notably by Tango’s “reliability score.” (See Matt Klaassen’s article) I also test the idea that variation among forecasts (among ESPN, CHONE, Fangraphs Fans, ZIPS, Marcel, and CBS Sportsline) can predict player randomness as well.
I find that 1) variance among forecasts is a strong predictor of actual forecast error variance for HRs, Runs, RBIs and Steals, but a weak one for batting average, 2) Tango’s reliability score serves as a weak predictor of all 5 stats, and that 3), the forecast variance information dominates Tango’s measures in all categories but AVG.
Now let’s set up the analysis. Say, for example, that three forecasts say that Player A will hit 19, 20, and 21 home runs, respectively, and Player B will hit 10, 20, and 30 home runs. Does the fact that there is agreement in Player A’s forecast and disagreement in Player B’s provide some information about the randomness of Player A’s eventual performance relative to Player B’s?
To answer this, we need to do a few things first. We need a measure of dispersion of the forecasts. To do this, I define the forecast variance as the variance of the six forecasts for each stat, for each player. If we take the square root of this number, we get the standard deviation of the forecast. So, the standard deviation of the forecasts of Player A’s HRs would be 1, and the standard deviation of the forecasts for Player 2 would be 10.
Next we turn to some regression analysis.* The dependent variable is the absolute error for a particular player’s consensus forecast (defined as the average among the six different forecasts). For both players A and B in the example, this number would be 20. This is my measure for performance randomness. Controlling for the projected counting stats, we can estimate this absolute error as a function of some measure of forecast reliability.
Tango’s reliability score is one such measure, and the forecast standard deviation is another. What we would predict is that Tango’s score (where 0 means least reliable and 1 means most) would have a negative effect on the error. We would also predict that the forecast standard deviation would have a positive effect on the error. Now let’s see what the data tell us:
We see that HRs are the statistic for which errors are most easily forecasted, errors for Rs, RBIs, and SBs are moderately forecastable, and errors for AVG are not very forecastable. We see this because of the negative and statistically significant coefficients for Tango’s score and the positive and statistically significant coefficients on the standard deviation measure. In regressions with both measures, the standard deviation measure encompasses Tango’s measure, except in the AVG equation.
So what does this all mean? If you’re looking at rival forecasts, 80% of the standard deviation between the HR forecasts and about 50% of the standard deviation of the forecasts of the other stats is legitimate randomness. This means that you can tell how random a player’s performance will be by the variation in the forecasts, especially home runs. If you don’t have time to compare different forecasts, then Tango’s reliability score is a rough approximation, but a pretty imprecise measure.
*For those of you unfamiliar with regression analysis, imagine a graph of dots and drawing a line through it. Now imagine the graph is 3 or 4 dimensions and doing the same, and the line is drawn such that the (sum of squares of) the distance between the dots and the line is minimized.
In Part 1 of this article, I looked at the ability of individual projection systems to forecast hitter performance. The six different projection systems considered are Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans, and each is freely available online. It turns out that when we control for bias in the forecasts, each of the forecasting systems is, on average, pretty much the same. In what follows here, I show that the Fangraphs Fan projections and the Marcel projections contain the most unique, useful information. Also, I show that a weighted average of the six forecasts predicts hitter performance much better than any individual projection.
Forecast encompassing tests can be used to determine which of a set of individual projections contain the most valuable information. Based on the forecast encompassing test results, we can calculate a forecast that is a weighted average of the six forecasts that will outperform any individual forecast.
There are a number of published baseball player forecasts that are freely available and online. As Dave Allen notes in his article on Fangraphs Fan Projections, and what I find as well, is that some projections are definitely better than others. Part 1 of this article examines the overall fit of each of six different player forecasts: Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans. What I find is that the Marcel projections are the best based on average error, followed by the Zips and CHONE projections. However, if we control for the over-optimism of each of these projection systems, each of the forecasts are virtually indistinguishable.
This second result is important in that it requires us to dig a little deeper to see how much each of these forecasts is actually helping to predict player performance. This is addressed in Part 2 of this article.
The tool that is generally used to compare the average fit of a set of forecasts is Root Mean Squared Forecasting Error (RMSFE). This measure is imperfect in that it doesn’t consider the relative value of an over-projection versus and under-projection; for example, in earlier rounds of a fantasy draft we may be drafting to limit risk while in later rounds we may be seeking risk. That being said, RMSE is pretty easy to understand and is thus the standard for comparing average fit of a projection.
Table 1 shows the RMSFE of each of the projection systems in each of the main five fantasy categories for hitters. Here, we see that each of the “mechanical” projection systems (Marcel, Zips, and CHONE) are the best compared to the three “human” projections. The value is the standard deviation of the error of a particular forecast. In other words, 2/3rds of the time, a player projected by Marcel to score 100 runs will score between 75 and 125 runs.
Table 1. Root Mean Squared Forecasting Error
Another measure that is important is bias. Bias occurs when a projection consistently over or under predicts. Bias inflates the MSFE, so a simple bias correction may improve a forecast’s fit substantially. In Table 2, we see that the human projection systems exhibit substantially more bias than the mechanical ones.
Table 2. Average Bias
We can get a better picture about which forecasting system is best by correcting for bias in the individual forecasts. Table 3 presents the results of bias corrected RMSFEs. What we see here is a tightening in the results of the forecasts across each of the forecasting systems. Here, we see that each forecasting system is about the same.
Table 3. Bias-corrected Root Mean Squared Forecasting Error
So where does this leave us if each of these six forecasts are basically indistinguishable? As it turns out, evaluating the performance of individual forecasts doesn’t tell the whole story. It may be true that there is useful information in each of the different forecasting systems, so that an average or a weighted average of forecasts may prove to be a better predictor than any individual forecast. Part 2 of this article examines this in some detail. Stay tuned!