Comparing 2010 Hitter Forecasts Part 2: Creating Better Forecasts
In Part 1 of this article, I looked at the ability of individual projection systems to forecast hitter performance. The six different projection systems considered are Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans, and each is freely available online. It turns out that when we control for bias in the forecasts, each of the forecasting systems is, on average, pretty much the same. In what follows here, I show that the Fangraphs Fan projections and the Marcel projections contain the most unique, useful information. Also, I show that a weighted average of the six forecasts predicts hitter performance much better than any individual projection.
Forecast encompassing tests can be used to determine which of a set of individual projections contain the most valuable information. Based on the forecast encompassing test results, we can calculate a forecast that is a weighted average of the six forecasts that will outperform any individual forecast.
The term “forecast encompassing” sounds complicated, but it’s a simple concept. The idea is that if one projection doesn’t contain any unique information helpful to forecasting compared to another projection, then that forecast is said to be “forecast encompassed” and it can be discarded. When we are left with a group of forecasts that don’t encompass each other, then each must then contain some unique, relevant information.
Table 1 shows the optimal forecast weights after forecast encompassing tests have eliminated the forecasts with duplicate or irrelevant information. One thing that we see is that the Fangraphs Fan projections contain a large amount of unique information relevant for forecasting in each statistical category. Marcel projections are relevant in four categories. ESPN and CHONE projections are only useful in two categories, Zips in one, and the CBS projections have no unique, useful information in them according to these metrics.
Table 1. Optimal Forecast Weights
Runs | HRs | RBIs | SBs | AVG | |
Marcel | 0.22 | 0.53 | 0.25 | 0.38 | |
Zips | 0.30 | ||||
CHONE | 0.44 | 0.44 | |||
Fangraphs Fans | 0.19 | 0.47 | 0.31 | 0.29 | 0.55 |
ESPN | 0.29 | 0.33 | |||
CBS |
Using these weights, we can compute a forecast for each statistic that is a weighted average of these six publicly available forecasts. Table 2 shows the Root Mean Squared Forecasting Errors (RMSFE) of this composite forecast versus the other six forecasts. Here, we see that the weighted average performs substantially better than any individual forecast.
Table 2. Root Mean Squared Forecasting Error
Runs | HRs | RBIs | SBs | AVG | |
Marcel | 24.43 | 7.14 | 23.54 | 7.37 | 0.0381 |
Zips | 25.59 | 7.47 | 26.23 | 7.63 | 0.0368 |
CHONE | 25.35 | 7.35 | 24.12 | 7.26 | 0.0369 |
Fangraphs Fans | 29.24 | 7.98 | 32.91 | 7.61 | 0.0396 |
ESPN | 26.58 | 8.20 | 26.32 | 7.28 | 0.0397 |
CBS | 27.43 | 8.36 | 27.79 | 7.55 | 0.0388 |
Weighted Average | 21.74 | 6.62 | 21.71 | 6.77 | 0.0338 |
Even when we correct for the over-optimism of the six base projections, the average forecast still does better in every category, though by not as much.
Table 3. Bias-corrected Root Mean Squared Forecasting Error
Runs | HRs | RBIs | SBs | AVG | |
Marcel | 23.36 | 6.83 | 22.81 | 7.28 | 0.0348 |
Zips | 22.98 | 7.02 | 23.52 | 7.59 | 0.0341 |
CHONE | 22.96 | 6.85 | 22.33 | 7.24 | 0.0341 |
Fangraphs Fans | 23.24 | 6.88 | 23.53 | 7.08 | 0.0340 |
ESPN | 23.03 | 7.27 | 23.62 | 7.14 | 0.0357 |
CBS | 22.91 | 7.29 | 23.90 | 7.27 | 0.0347 |
Weighted Average | 21.74 | 6.62 | 21.71 | 6.77 | 0.0338 |
So what is the takeaway from this two-part series comparing six of the freely available sets of hitter forecasts?
1) Without correcting for the over-optimism (bias) of the forecasts, the mechanical forecasts, Marcel, CHONE, and Zips, outperform the others.
2) When correcting for the biases, no set of forecasts is any better than another.
3) A weighted average of the forecasts performs much better than any individual forecast.
4) Forecast encompassing tests indicate that the Fangraphs Fan projections and the Marcel projections contain the most unique and relevant information in them compared to the other forecasts considered.
“3) A weighted average of the forecasts performs much better than any individual forecast.”
Well, duh. We shouldn’t expect a model to ourperform all other models in all five categories, which is necessary for this statement to have any chance of being untrue. Otherwise, you could just weight the models that are best in each category at 1.00 and make the same assertion.
What you really want to know, though, is not whether a blended model would have performed better after the season, but whether a blended model can consistently outperform the individual models, with the weights having been set prior to the season.
@Marver: Easy there! You’re exactly right. That’s why I’m working on a set of forecasts for 2011 that are in part based on the weights from 2010. Then we’ll see if the approach works or not.
Wouldn’t it be a good idea to run this sort of analysis back several years to see if there’s any consistent weighting of the different systems, or if its just noise?
@Everett: I haven’t had time to dig up all of the old forecasts, but that would be good to do. The problem is that the fangraphs fan projections, which is one of the top performers in my analysis, is only 1 year old. For older forecasts and forecast systems, check out Nate Silver’s work at http://www.baseballprospectus.com/unfiltered/?p=564
@Will…then just include fangraphs’ projections with extremely smaller weights for 2011, or build two models: one with and one without.
Ultimately, the exercise you’re trying to do will prove very difficult due to year-to-year variations in ideal weights, plus the fact that many projection systems incur tweaks to the logic, coding, etc. that further distort its year-to-year weights.
I ran this analysis last year on a few sources and came to the conclusion that the weights were unstable year-to-year, producing an edge that was negligible and ultimately not worth the time/assets that went into it. That’s not to say it isn’t a good article to write!
@Marver: You raise some good points. Weights will vary from year to year; that is certain. I also am sure that the methods of a particular forecast (ESPN and CBS especially) will vary from year to year. Those are shortcomings, for sure.
However, there is quite a bit of formal study on forecast averaging and this is the general result:*
1) forecast averages computed using previously optimal weights are better than
2) forecast averages computed using a simple average of other forecasts, which are better than
3) any single forecast
Again, this is something that we should be investigating, and the next step is to get some forecasts based on this procedure that we can start to look at next year.
*see Stock and Watson (2004), “Combination forecasts of output growth in a seven country data set,” Journal of Forecasting, 23, 405-430 and follow their citations if you want to look into this more.
I absolutely agree; they are better. The problem is that it is certainly more true in some fields than others, and baseball projecting is a relatively untested in comparison to other fields in which projection models are prevelant, like weather.
I’ve done basically the exact same thing you’re about to replicate and while I found that the result is a better projection system than any of its constituent parts, the difference was small in terms of added applied value to fantasy baseball teams. The difference was especially small when comparing the time put into developing/grading the system to other studies that could have been completed in the same amount of time.
@Will I think your articles are very interesting. I don’t have a statistics background and I’ve wondered for years why people didn’t take the useful (unique?) data from the various projection systems to develop a weighted “super” system.
Do you plan on looking at Pitchers as well? What about expanding the hitter categories, K’s, BB’s, XBH’s?)
@Marver: Isn’t fangraphs awesome? We wouldn’t be having this conversation on any other site. Maybe we should be doing some work together.
@Jeremiah: Thanks!! There are 2 things that I’d like to do now that these articles are out. First, I want to get some hitter forecasts on record for next year so I can see if this whole idea works in practice. Second, I’d like to do the same thing I’ve done for hitters to pitchers. I don’t have plans right now to expand the hitter categories, but that’s a pretty natural extension of what I’ve done here if someone else wants to take a look at it.
For the 2010 season, I forecasted averages computed using a simple average of forecasts (Zips, Marcel, Chone and ESPN) and it worked rather well. For the 2011 season I plan on adding a simple weighting to my forecasts.
1). What do you feel is a good way to weight the various projections? I had initially thought of ranking the 6 projections, giving 6 to Marcel, 5 to FG Fans, 4 to Chone etc. Denominator would be the sum of 21, so Marcel would be weighted 6/21 and CBS 1/21. Is this too simple?
2) Once I had created my projections, I want to do an ESPN-like-player rater calculationto give weights to marignal production in each roto categorey. I usually play in a points H2H league where such a calculation is easy. Do you have any experience performing such calculations? Any insight.
3) Does anyone know of any sites where rotoheads can contribute or co-develop such projection resources?
@Brett: Your intuition was right in doing a simple average of each of those forecasting systems. That’s usually pretty tough to beat. You’re also on the right track trying to weight the forecasts by the ones that have historically performed the best and have the most useful information in them.
My article here says that you should use different weights depending on the category. For example, when you want to forecast HRs, it’s best to do about 50% marcel and 50% fangraphs fans and ignore the other systems because they don’t add anything beyond those two. For SBs, it’s best to do 1/3 marcel, 1/3 fangraphs fans, and 1/3 ESPN.
If you had a limited amount of time, what I would do is take the marcel projections and the fangraphs fan projections and do a simple average of the two. It’s tough to go wrong there.
As for your questions about 2), I don’t. I’m sure you can find some stuff here or at razzball.com. I know they do some point share analysis over there.
As for 3), yes! Fangraphs! Upload your projections in the fan projections page. If you posted a link to an excel file of your projections somewhere online, then this time next year I could see how well you did relative to the other systems. I plan on doing this as well. Frankly, this is the only way to figure out which methods of forecasting work and which ones don’t.
Will,
I saw your weighting by source in the article above. I was under the impression that these weights were for 2010. Is there any reason to believe that any system is better at projecting a given category from year-to-year?
Also, do you plan to do the same review for pitching forecasts?
Brett
@Brett: see ^^^^^^^:
“However, there is quite a bit of formal study on forecast averaging and this is the general result:
1) forecast averages computed using previously optimal weights are better than
2) forecast averages computed using a simple average of other forecasts, which are better than
3) any single forecast”
I plan on doing the same thing for pitcher forecasts in the next couple of months.
@Will – since CHONE is off the free market, how would you suggest a simple weighted average of these three systems: fangraph fans, ZIPS and Marcel?
Thanks!
Jeremiah
@marver: i have forecasted that you need more fiber in your diet.
@jaywrong: Marver had some good points. It’s your comment that has no place here.
@Marver: keep ’em coming! Do you have any forecasts for 2011? I think we should compile forecasts somewhere and do a big comparison at the end of the year. Maybe like 20 or so, including the main ones, then a bunch of personal forecasts from different people trying different things (average, weighted average, subjective, etc).
@fangraphs crew: Is there anyway to upload forecasts en masse as opposed to manually entering individual stats for individual players? Then, is there any way to get access to the individual forecasts done by the users here so we can see who did the best?
@Jeremiah: It’s not as simple as just removing the weight from CHONE and splitting it all amongst the remainder because CHONE is potentially duplicating info in the encompassed projections. I’d have to re-specify and run my routines in order to CHONE-less weights. In the absence of this, I’d just do a 1/3, 1/3, and 1/3 average between Fans, ZIPS, and Marcel, or 50/50 Fans and Marcel.
Will,
How does this help you in any way shape or form?
Pick any of the RMSFE numbers for runs as an example. Knowing that the projects are going to be plus or minus 20-some runs doesn’t help a whole lot does it? That means if a player is predicted to score 90 runs, he could score anywhere from less than 70 to more than 110 runs.
Sure, now you’ve done the statistical analysis to know how accurate the projections are and you have basically shown that all of the projections have a big enough error that they really can’t be trusted. But we need something to base our draft picks on, do we not?
I guess my question is, how do you use any of this information as an advantage come draft day?
@Brian: I’ve gone through periods where I ask the same question.
What it comes down to though, is that any sort of projection, ranking, draft order, etc, is going to be uncertain. The real question is, despite this uncertainty, can we rank players based on their expected performance? The answer to this question is yes.
While we may have difficulty getting any single particular player right, we can do much better, on average, by having a solid draft list constructed using solid projections. Any increase in the accuracy of our forecasts will make our draft lists better. If by bias correcting, weighting our forecasts, and averaging them, we can make a forecast that’s 5%-10% better, then I think that’s worth it, even if the individual forecasts are still pretty random.
@Will: I’m totally convinced, that sounds like a great idea to compile a bunch of forecasts and create your own to gain that advantage over everyone else because in the end, isn’t that what we are all looking for? A way to dominate our friends so we can boast about being the best.
Are you going to have these projections somewhere to share so that the rest of us can see them, or are you just describing a way for us to do our own, more accurate projections?
@Brian: Both. I’m putting a website together where I’ll gather the main online projections and allow users to submit their own projections. Then, when the season is over, we can see what systems did the best.
The beta is up at http://www.williamlarson.com/projections
If you have any other ideas (or anyone else!!!) let me know what you’d like out of a site like that.
I found using ~2/3 Bill James and ~1/3 Marcel produced highest pearson, and lowest RMSE to actual results for HITTERS.
I did this a little over a year ago with the 2009 and 2008 stats. I’m not really good enough with SQL to go back any further.
The PECOTA was best for projecting pitchers if I do recall.
Keep up the good work Will,
Matt
@Matt: I’m surprised you found that. I also ran Bill James’ numbers for hitters (and pitchers, for that matter) and I found it to be a pretty poor performer, and not adding any new information beyond the 6 freely available systems. What stats were you using to compare?
Will, I’m just not sure you understand sample size, and in-sample data vs. out-of-sample data. You cannot try to find optimal weights using one year of data. There is way too much variance in baseball performance to think that the stat-specific weights you mention (below) are anything but noise. The only way to prove otherwise is to generate your optimized weights on a set of training data, and check the performance on a (completely different) set of test data.
So to say you “should do” or “it’s best to do” various stat-specific system weightings based on this extremely limited study can only do more harm than good.
“My article here says that you should use different weights depending on the category. For example, when you want to forecast HRs, it’s best to do about 50% marcel and 50% fangraphs fans and ignore the other systems because they don’t add anything beyond those two. For SBs, it’s best to do 1/3 marcel, 1/3 fangraphs fans, and 1/3 ESPN.”
@evo34: Evo, I’m very clear about the limitations of my work. This article is looking at 2010 hitter forecasts and is thus a purely ex post analysis. Weighted averages using historic weights are useful in forecasting in other areas, so I’m presenting the hypothesis that this weighted forecast will be better than an average forecast. Your hypothesis that “there is way too much variance in baseball performance to think that the stat-specific weights you mention (below) are anything but noise” is testable as well. Before you make bold statements about my work ((such as “this extremely limited study can only do more harm than good”), please realize the limitations of your statements as well.
I frankly don’t care whose hypothesis is correct or not–I just want to figure out how to make better baseball player forecasts. In the future, I hope that you approach my work in this spirit as opposed to the adversarial posture that you’ve chosen to take.
Well done, Will.
Obviously there will be year-to-year deviation, but since the factors underlying the mechanical predictions should remain consistent, a historical accumulation of projection data should be helpful. Even the fan projections should be consistent on some level…
Will, if you’ve done any prediction work whatsoever (stocks, sports, weather, anything), you should know that you CANNOT optimize parameters of a system on the same data you are using to test said system, and expect it to be successful. This is Data Mining 101.
What you have done is this article is described what has occurred in the past. By itself, that would be fine. Not very useful, but fine. But you take the reckless step of claiming that you have found the best system to use to predict the future: “My article here says that you should use different weights depending on the category.”
That’s a completely inappropriate conclusion based on the (lack of ) analysis you have done.
Evo,
Both of ours are testable hypothesis. We will see after this season. If my weighted forecast averaging approach is worthless, we will be able to clearly see it in the data!
–Will
This is exactly the same mentality that led your original erroneous conclusions. You cannot test any model creation hypothesis on one season of data.
It’s critical to understand this when you are in the business of prediction.