The Elusive Clutch Hitter
It’s (almost) spring (training), and a young man’s thoughts turns to baseball metrics. I’ll start with two charts:
It’s (almost) spring (training), and a young man’s thoughts turns to baseball metrics. I’ll start with two charts:
If you haven’t heard anything about what Albert Pujols’ value will be over the next 7-10 years, I suggest you go here…Or here…Or here…Or take a look at the discussion here. Haven’t had enough? Read on.
Somebody is going to sign Pujols to a massive contract in the next 12 months. That contract will likely be hard to justify in projected on field value alone. If you are a fan of the team that gets him, after you get done saying PUJOLSAWESOMEBASEBALLYAY, you may want to know what his expected value will be and then take some time rationalizing the contract to yourself and justifying it to rival fans.
In two previous articles, I considered the ability of freely available forecasts to predict hitter performance (part 1 and part 2), and how forecasts can be used to predict player randomness (here). In this article, I look at the performance of the same six forecasts as before (ZIPS, Marcel, CHONE, Fangraphs Fans, ESPN, CBS), but instead look at starting pitchers’ wins, strikeouts, ERA, and WHIP.
Results are quite different than for hitters. ESPN is the clear winner here, with the most accurate forecasts and the ones with the most unique and relevant information. Fangraphs Fan projections are highly biased, as with the hitters, yet they add a large amount of distinct information, and thus are quite useful. Surprisingly, the mechanical forecasts are, for the most part, failures. While ZIPS has the least bias, it is encompassed by other models in every statistic.* Marcel and CHONE are also poor performers with no useful and unique information, but with higher bias.
After reading an extremely interesting piece by Jeff Passan on the legendary sabermetrics whiz Voros McCracken I have to admit it had me a bit down in the dumps, and depressed. How could the man who basically redefined the sabermetric movement not be involved in baseball in some form or fashion? It doesn’t seem right, or fair, that the man who basically founded and created ‘defence independent pitching’ (or DIPS) statistics wasn’t good enough for the game anymore.
Maybe it affected me more on a personal level and it was gut check time, if ‘Voros’ wasn’t accepted and embraced by the baseball world, what chance in hell did I ever have? Now is the time for you to snicker, or snidely remark ‘fat chance in the first place’ and, to be honest, I would be saying the exact same things. But I have a confession, and on some level every fellow ‘baseball nerd’ who writes about the game we love was affected in the same manner – we lost a bit of hope.
Over the last week or so, various reputable baseball analysis sites have been digging into the relationship between infield fly ball rates (IFFB%) and home run per fly ball rates (HR/FB). The discussion was prompted by a blog post by Rory Paap at Paapfly.com called “Matt Cain ignores xFIP, again and again,” which generated a response from Dave Cameron here at Fangraphs.
Paap suggested FIP and xFIP do Cain a disservice because they don’t give him his due credit for possessing the “unique skill” of inducing harmless fly ball contact, a theory that David Pinto at Baseball Musings attempted to quantify last October. Cameron’s response included some interesting analysis that looked at the best pitchers from 2002-2007 in terms of HR/FB rate and compared their IFFB% over that span to what they posted the next three seasons. His conclusion?
Is there some skill to allowing long fly outs? Maybe. But if you can identify which pitchers are likely to keep their home run rates low while giving up a lot of fly balls before they actually do it, then you could make a lot of money in player forecasting.
Simply out of curiosity, I decided to throw my hat into the ring and see if I could find a trend between IFFB% and HR/FB rate. My theory was that if IFFB% and HR/FB rate showed some sort of correlation, then plotting HR/FB rate as a function of IFFB% would show a clear inverse trend (meaning that a higher IFFB% would more likely generate a lower HR/FB rate, and vice versa).
This article explores the ability to predict the randomness of players’ performance in 5 standard hitting categories: HRs, Runs, RBIs, SBs, and AVG. There have been efforts to do so by forecasters, most notably by Tango’s “reliability score.” (See Matt Klaassen’s article) I also test the idea that variation among forecasts (among ESPN, CHONE, Fangraphs Fans, ZIPS, Marcel, and CBS Sportsline) can predict player randomness as well.
I find that 1) variance among forecasts is a strong predictor of actual forecast error variance for HRs, Runs, RBIs and Steals, but a weak one for batting average, 2) Tango’s reliability score serves as a weak predictor of all 5 stats, and that 3), the forecast variance information dominates Tango’s measures in all categories but AVG.
Now let’s set up the analysis. Say, for example, that three forecasts say that Player A will hit 19, 20, and 21 home runs, respectively, and Player B will hit 10, 20, and 30 home runs. Does the fact that there is agreement in Player A’s forecast and disagreement in Player B’s provide some information about the randomness of Player A’s eventual performance relative to Player B’s?
To answer this, we need to do a few things first. We need a measure of dispersion of the forecasts. To do this, I define the forecast variance as the variance of the six forecasts for each stat, for each player. If we take the square root of this number, we get the standard deviation of the forecast. So, the standard deviation of the forecasts of Player A’s HRs would be 1, and the standard deviation of the forecasts for Player 2 would be 10.
Next we turn to some regression analysis.* The dependent variable is the absolute error for a particular player’s consensus forecast (defined as the average among the six different forecasts). For both players A and B in the example, this number would be 20. This is my measure for performance randomness. Controlling for the projected counting stats, we can estimate this absolute error as a function of some measure of forecast reliability.
Tango’s reliability score is one such measure, and the forecast standard deviation is another. What we would predict is that Tango’s score (where 0 means least reliable and 1 means most) would have a negative effect on the error. We would also predict that the forecast standard deviation would have a positive effect on the error. Now let’s see what the data tell us:
Runs:
| R absolute error | |||
| [1] | [2] | [3] | |
| R Standard Deviation | 0.45 | 0.44 | |
| (0.27) | (0.32) | ||
| R mean forecast | 0.05 | 0.02 | 0.03 |
| (0.06) | (0.05) | (0.06) | |
| Tango’s reliability measure | -8.15 | -0.59 | |
| (9.09) | (10.60) | ||
| Constant | 22.94 | 14.93 | 15.36 |
HRs:
| HR absolute error | |||
| [1] | [2] | [3] | |
| HR Standard Deviation | 0.82 | 0.78 | |
| (0.30) | (0.32) | ||
| HR mean forecast | 0.20 | 0.12 | 0.13 |
| (0.03) | (0.04) | (0.04) | |
| Tango’s reliability measure | -3.26 | -0.84 | |
| (2.52) | (2.69) | ||
| Constant | 5.32 | 2.31 | 2.94 |
RBIs:
| RBI absolute error | |||
| [1] | [2] | [3] | |
| RBI Standard Deviation | 0.44 | 0.34 | |
| (0.28) | (0.31) | ||
| RBI mean forecast | 0.09 | 0.05 | 0.08 |
| (0.05) | (0.05) | (0.05) | |
| Tango’s reliability measure | -12.52 | -7.83 | |
| (9.12) | (10.08) | ||
| Constant | 23.78 | 12.66 | 18.37 |
SBs:
| SB absolute error | |||
| [1] | [2] | [3] | |
| SB Standard Deviation | 0.50 | 0.41 | |
| (0.24) | (0.27) | ||
| SB mean forecast | 0.37 | 0.30 | 0.31 |
| (0.03) | (0.04) | (0.04) | |
| Tango’s reliability measure | -3.47 | -1.90 | |
| (2.19) | (2.42) | ||
| Constant | 3.80 | 0.75 | 2.30 |
AVG:
| AVG absolute error | |||
| [1] | [2] | [3] | |
| AVG Standard Deviation | 0.567 | 0.287 | |
| (0.689) | (0.713) | ||
| AVG mean forecast | -0.085 | -0.107 | -0.083 |
| (0.091) | (0.090) | (0.092) | |
| Tango’s reliability measure | -0.023 | -0.022 | |
| (0.014) | (0.015) | ||
| Constant | 0.069 | 0.054 | 0.066 |
We see that HRs are the statistic for which errors are most easily forecasted, errors for Rs, RBIs, and SBs are moderately forecastable, and errors for AVG are not very forecastable. We see this because of the negative and statistically significant coefficients for Tango’s score and the positive and statistically significant coefficients on the standard deviation measure. In regressions with both measures, the standard deviation measure encompasses Tango’s measure, except in the AVG equation.
So what does this all mean? If you’re looking at rival forecasts, 80% of the standard deviation between the HR forecasts and about 50% of the standard deviation of the forecasts of the other stats is legitimate randomness. This means that you can tell how random a player’s performance will be by the variation in the forecasts, especially home runs. If you don’t have time to compare different forecasts, then Tango’s reliability score is a rough approximation, but a pretty imprecise measure.
*For those of you unfamiliar with regression analysis, imagine a graph of dots and drawing a line through it. Now imagine the graph is 3 or 4 dimensions and doing the same, and the line is drawn such that the (sum of squares of) the distance between the dots and the line is minimized.
Last year, Manny Acta made a splash by dropping Grady Sizemore to second in the batting order. This year, he’s considering moving him back to leadoff. Is either the right move? And how should the rest of the lineup look?
In Part 1 of this article, I looked at the ability of individual projection systems to forecast hitter performance. The six different projection systems considered are Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans, and each is freely available online. It turns out that when we control for bias in the forecasts, each of the forecasting systems is, on average, pretty much the same. In what follows here, I show that the Fangraphs Fan projections and the Marcel projections contain the most unique, useful information. Also, I show that a weighted average of the six forecasts predicts hitter performance much better than any individual projection.
Forecast encompassing tests can be used to determine which of a set of individual projections contain the most valuable information. Based on the forecast encompassing test results, we can calculate a forecast that is a weighted average of the six forecasts that will outperform any individual forecast.
We hear about position scarcity all the time, but category scarcity also plays a role in valuing players. In 2000, 47 players hit at least 30 HR (hmm, wonder why?) as compared to just 18 players in 2010. Mark Reynolds hit 32 HR last year and tied for 10th in baseball. Many fantasy owners continued to start Reynolds every day despite his sub-Mendoza .198 average because his power was so valuable. Had Reynolds hit 32 HR with a .198 average back in 2000, he would have been riding the digital pine. Power wasn’t at a premium back then.
And that’s category scarcity in a nutshell. In fact, position scarcity is really just a function of category scarcity. Shortstop is only considered shallow because there are so few players who can contribute across the board. A quick look at any shortstop rankings shows how rapidly talent plummets at the position.
There are a number of published baseball player forecasts that are freely available and online. As Dave Allen notes in his article on Fangraphs Fan Projections, and what I find as well, is that some projections are definitely better than others. Part 1 of this article examines the overall fit of each of six different player forecasts: Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans. What I find is that the Marcel projections are the best based on average error, followed by the Zips and CHONE projections. However, if we control for the over-optimism of each of these projection systems, each of the forecasts are virtually indistinguishable.
This second result is important in that it requires us to dig a little deeper to see how much each of these forecasts is actually helping to predict player performance. This is addressed in Part 2 of this article.
The tool that is generally used to compare the average fit of a set of forecasts is Root Mean Squared Forecasting Error (RMSFE). This measure is imperfect in that it doesn’t consider the relative value of an over-projection versus and under-projection; for example, in earlier rounds of a fantasy draft we may be drafting to limit risk while in later rounds we may be seeking risk. That being said, RMSE is pretty easy to understand and is thus the standard for comparing average fit of a projection.
Table 1 shows the RMSFE of each of the projection systems in each of the main five fantasy categories for hitters. Here, we see that each of the “mechanical” projection systems (Marcel, Zips, and CHONE) are the best compared to the three “human” projections. The value is the standard deviation of the error of a particular forecast. In other words, 2/3rds of the time, a player projected by Marcel to score 100 runs will score between 75 and 125 runs.
Table 1. Root Mean Squared Forecasting Error
| Runs | HRs | RBIs | SBs | AVG | |
| Marcel | 24.43 | 7.14 | 23.54 | 7.37 | 0.0381 |
| Zips | 25.59 | 7.47 | 26.23 | 7.63 | 0.0368 |
| CHONE | 25.35 | 7.35 | 24.12 | 7.26 | 0.0369 |
| Fangraphs Fans | 29.24 | 7.98 | 32.91 | 7.61 | 0.0396 |
| ESPN | 26.58 | 8.20 | 26.32 | 7.28 | 0.0397 |
| CBS | 27.43 | 8.36 | 27.79 | 7.55 | 0.0388 |
Another measure that is important is bias. Bias occurs when a projection consistently over or under predicts. Bias inflates the MSFE, so a simple bias correction may improve a forecast’s fit substantially. In Table 2, we see that the human projection systems exhibit substantially more bias than the mechanical ones.
Table 2. Average Bias
| Runs | HRs | RBIs | SBs | AVG | |
| Marcel | 7.12 | 2.09 | 5.82 | 1.16 | 0.0155 |
| Zips | 11.24 | 2.55 | 11.62 | 0.73 | 0.0138 |
| CHONE | 10.75 | 2.67 | 9.14 | 0.61 | 0.0140 |
| Fangraphs Fans | 17.75 | 4.03 | 23.01 | 2.80 | 0.0203 |
| ESPN | 13.26 | 3.78 | 11.59 | 1.42 | 0.0173 |
| CBS | 15.09 | 4.08 | 14.17 | 2.05 | 0.0173 |
We can get a better picture about which forecasting system is best by correcting for bias in the individual forecasts. Table 3 presents the results of bias corrected RMSFEs. What we see here is a tightening in the results of the forecasts across each of the forecasting systems. Here, we see that each forecasting system is about the same.
Table 3. Bias-corrected Root Mean Squared Forecasting Error
| Runs | HRs | RBIs | SBs | AVG | |
| Marcel | 23.36 | 6.83 | 22.81 | 7.28 | 0.0348 |
| Zips | 22.98 | 7.02 | 23.52 | 7.59 | 0.0341 |
| CHONE | 22.96 | 6.85 | 22.33 | 7.24 | 0.0341 |
| Fangraphs Fans | 23.24 | 6.88 | 23.53 | 7.08 | 0.0340 |
| ESPN | 23.03 | 7.27 | 23.62 | 7.14 | 0.0357 |
| CBS | 22.91 | 7.29 | 23.90 | 7.27 | 0.0347 |
So where does this leave us if each of these six forecasts are basically indistinguishable? As it turns out, evaluating the performance of individual forecasts doesn’t tell the whole story. It may be true that there is useful information in each of the different forecasting systems, so that an average or a weighted average of forecasts may prove to be a better predictor than any individual forecast. Part 2 of this article examines this in some detail. Stay tuned!