Expected Run Differential: Using Statcast to Build a Team Performance Metric

In order for a team to win the World Series, it needs to win a whole bunch of games. In order to win games, a team needs to score more runs than its opponents. Over time, we’ve come to accept that a team’s winning percentage, while very important, is an imperfect predictor of how likely a team is to win and lose games in the future.

Take the year-to-year performances of teams over the last three seasons (the time frame of focus for this post). The correlation between a team’s year one and year two winning percentages (Win%) is limited (R2 of 0.19). Extending the sample back to 1995 improves the correlation, but only slightly (R2 of 0.25).

Replace a team’s year one winning percentage with a team’s year one run differential per game (RD/G) and you’re left with a slightly stronger correlation (R2 of 0.21). Again, a slightly stronger correlation exists if we extend the sample back to 1995 (R2 of 0.26).

Now, a lot of different things go into scoring more runs than your opponent does—namely hitting, pitching, running and fielding well. In the Statcast era (2015-17, hence why I limited myself to the small sample above), we have new ways of examining hitting and pitching. Instead of being limited to what actually happened, we can observe what was expected to happen, given the combinations of exit velocity and launch angle associated with a batted ball.

That led me to wonder if there was any use in creating a version of RD/G that was regressed from components taken (in part) from this Statcast data. Intuitively, there should be, as RD/G suffers from two big issues—batted ball luck and cluster luck—which could introduce statistical noise and drown out the metric’s signal.

Batted ball luck comes from the fact that a team may run a high RD/G not because they have been hitting and/or pitching well, but because they have been getting lucky results relative to the underlying contact. Vice versa for artificially low RD/G caused by unlucky results.

Cluster luck comes from the fact that a team can score an unsustainably high number of runs if hits are clustering in a small number of innings—the classic example compares two teams that produced nine hits in a game. The one that gets one hit per inning may end up with zero runs scored, while the one that gets nine hits in one inning may score a handful of runs. A team that is scoring a lot of runs because it is generating lots of hits is likely experiencing more sustainable success.

If we produce a RD/G based on xwOBA (alongside some other metrics), we may be able to overcome both issues. This expected run differential per game (xRD/G) would avoid a great deal of the batted ball luck problem, as it rewards teams for generating lots of good contact and limiting good contact from the pitching side of the equation. It would also overcome a great deal of the cluster luck problem, as xwOBA is unaffected by the order or clustering of batted ball events. The xRD/G that I will elaborate on in this post certainly seems like a useful contribution to the discourse. For example, a team’s year one xRD/G is much more strongly correlated to its year two winning percentage (R2 of 0.37) than either RD/G or Win%.

[Statistical note: I’m going to be using R2 frequently in this post. R2 measures how well two variables are correlated to one another. It ranges from 0 to 1. Interpreting an R2 requires context, as 0.37 may be considered high in one context and low in another. In this context, comparing different variables from different time frames, a lower R2 would be expected. The key then is comparing the R2 of different relationships, as I’ve done above.]

How is xRD/G calculated?

xRD/G seeks to estimate the run differential per game that a team would have been expected to produce given the team’s batting xwOBA, starting pitching xwOBA, relief pitching xwOBA, baserunning runs (BsR per 600 PA) and defensive runs saved (DRS per 150 games, when possible given data split limitations). Given the recent conversation around what “expected” means in these new x-stats, let me make clear that I agree with Craig Edwards’ take: “I have always interpreted the ‘expected’ to mean ‘what might have been expected to happen given neutral park and defense.'” That said, as we have already seen, xRD/G has predictive value as well.

I started working on xRD/G by regressing the RD/G produced by a team in a given season (from 2015-17) against that team’s batting xwOBA, SP xwOBA, RP xwOBA, BsR/600 and DRS/150. [I used DRS given that it accounts for pitcher and catcher defence, unlike UZR.] These five stats explain about 79% of the variation in a team’s full-season run differential per game and were each highly significant. I opted against using a constant term as it was not statistically significant, nor did it increase adjusted R2.

Then, I incorporated interaction terms, particularly between 1) batting xwOBA and BsR/600, 2) SP xwOBA and DRS/150 and 3) RP xwOBA and DRS/150. The eight terms explained 81% of the variation in a team’s full-season RD/G. However, I found that the latter two were statistically insignificant. After removing them, I was left with six highly significant variables that still explained 81% of the variation in full-season RD/G. These six variables comprise the full xRD/G equation that I chose to settle on:

xRD/G = 23.31*(Batting xwOBA) – 2.52*(BsR/600) + 8.34*(Batting xwOBA)*(BsR/600) – 13.16*(SP xwOBA) – 10.19*(RP xwOBA) + 0.004*(DRS/150)

The coefficients are mostly straightforward. A higher RD/G is correlated with a higher batting xwOBA and lower SP/RP xwOBA. Better defence leads to a better run difference. However, the correlation between base running and run difference is a little tricky to read because of the interaction term. In a nutshell, base running is good, but it’s more useful for teams that have more base runners (higher batting xwOBA). [Let me explain via an example. The average team batting xwOBA in the sample is .317. For a team with an average batting xwOBA, a one-unit increase in BsR/600 is associated with a 0.13 run increase in its RD/G. For teams with a low batting xwOBA (around .300), base running isn’t associated with a change in RD/G. For teams with a high batting xwOBA (around .350), a one-unit increase in BsR/600 is associated with a 0.39 run increase in RD/G.]

There is a great deal of wiggle room in producing other versions of xRD/G. I opted to build the equation by regressing teams’ full-season RD/G against six variables taken from the same time frame. Alternate versions of xRD/G could be built by regressing teams’ RD/G over smaller periods (month, half-season, etc.) against variables taken from the same time frame. Alternate versions could also put more emphasis on predictiveness, by regressing teams’ RD/G over some period against variables from a previous period of time.

Similarly, I opted to go with xRD/G because it seemed most fruitful after a brief analysis of potential alternatives. I also played around with expected runs scored and allowed per game (xRS/G and xRA/G) and expected win percentage (xWin%). While not as initially fruitful as xRD/G, these are ideas worth coming back to. As such, consider the analysis in this post to only be a jumping off point in building an all-in-one team performance statistic based (in part) on Statcast variables.

Testing the reflectiveness, predictiveness and consistency of xRD/G

When examining a new metric, there are three key questions to answer.

1) How well does the metric reflect what has happened?

Pretty well. A team’s xRD/G explains 75% of the variation in same year winning percentage. For context, RD/G explains 85% of this variation. That RD/G is better than xRD/G at telling us what happened is not surprising. After all, wins require teams to score runs and limit runs against. A team’s full-season xRD/G is also highly correlated to RD/G, explaining about 83% of its variation. The slope of the trendline is roughly one.

2) How well does the metric predict what will happen?

Predictive power is the true strength of xRD/G, which is interesting because it wasn’t specifically built to predict. As mentioned earlier, a team’s full-season xRD/G explains 37% of the variation in next season’s winning percentage, compared to only 21% for the team’s first-year RD/G and 19% for the team’s first-year Win%.

Similarly, a team’s full-season xRD/G is a better predictor of next-season RD/G than RD/G itself. While a team’s first-year RD/G explains only 20% of the variation in second-year RD/G, a team’s first-year xRD/G explains 36% of this variation.

xRD/G is also useful for in-season prediction. Let’s split the three seasons of data into halves, demarcated by each season’s all-star break. For this purpose, I’ve had to create a modified xRD/G, as DRS splits are unavailable. For this purpose, I used the following equation to build xRD/G:

xRD/G = 26.05*(Batting xwOBA) – 2.91*(BsR/600) + 9.73*(Batting xwOBA)*(BsR/600) – 13.49*(SP xwOBA) – 12.69*(RP xwOBA)

Let’s start with some context: a team’s first-half Win% explains 27% of the variation in its second-half Win%, while a team’s first-half RD/G explains 34% of this variation. However, a team’s first-half xRD/G explains 39% of the variation in its second-half Win%, despite the forced exclusion of DRS/150. Theoretically, including DRS/150 would make a team’s first-half xRD/G even more predictive of its second-half record.

The predictive power of xRD/G is even more evident when explaining a team’s second-half RD/G. While first-half RD/G explains only 24% of the variation in its second-half cousin, first-half xRD/G explains 33% of this variation.

3) How consistent is this metric over time?

Beyond its predictiveness, xRD/G is a relatively consistent metric. Again, let’s first look at the other two stats for some context. A team’s Win% is the least consistent of the bunch. As observed earlier, a team’s full-season Win% explains only 19% of the variation in its next-season Win%, while a team’s first-half Win% explains only 27% of the variation in its second-half Win%.

RD/G is about as consistent as Win%. As observed earlier, a team’s full-season RD/G explains about 20% of the variation in its next-season RD/G, while a team’s first-half RD/G explains about 24% of the variation in its second-half RD/G. In contrast, a team’s xRD/G is much more consistent both from year-to-year (R2 of 0.35) and from half-to-half (R2 of 0.47).

xwOBA vs. wOBA

Given my intention of using Statcast data to create xRD/G, incorporating batter, starting pitcher and relief pitcher xwOBA into xRD/G was an obvious choice. However, it is fair to ask whether xRD/G would be an even better metric if wOBA was used in place of xwOBA.

In order to test this, I built two versions of xRD/G based on wOBA. For the sake of consistency, I used the same variables as above, but with wOBA instead of xwOBA.

The full-season version, which includes DRS/150:

wOBA-xRD/G = 28.42*(Batting wOBA) – 0.49*(BsR/600) + 1.67*(Batting wOBA)*(BsR/600) – 13.63*(SP wOBA) – 15.03*(RP wOBA) + 0.0006*(DRS/150)

And the half-season version, which excludes DRS/150:

wOBA-xRD/G = 29*(Batting wOBA) – 0.45*(BsR/600) + 1.55*(Batting wOBA)*(BsR/600) – 13.69*(SP wOBA) – 15.56*(RP wOBA)

Unsurprisingly, the base running and fielding variables are not statistically significant, likely because wOBA already accounts for those aspects of the game—good base running helps batters get extra bases (leading to a higher batter xwOBA), good fielding helps limit the number/quality of base-hits that a pitcher allows (leading to a lower SP/RP xwOBA).

It would appear that xwOBA makes a more useful foundation for xRD/G than does wOBA. The xwOBA-based xRD/G is more reflective of what happened in a given season, in terms of both RD/G (R2 of 0.83 vs. 0.73 for the wOBA-based version) and Win% (R2 of 0.75 vs. 0.69). It can better predict a team’s future RD/G in both a season-to-season (R2 of 0.36 vs. 0.29) and half-to-half time frame (R2 of 0.33 vs. 0.30). Similarly, it is more predictive of a team’s future Win% in both a season-to-season (R2 of 0.37 vs. 0.31) and half-to-half time frame (R2 of 0.39 vs. 0.36). It also has the edge in terms of half-to-half consistency (R2 of 0.47 vs. 0.33). The one edge that the wOBA-based xRD/G has is in season-to-season consistency (R2 of 0.35 vs. 0.42).

xRD/G vs. FanGraphs’ Projections

A much bigger test of xRD/G’s predictive power is the FanGraphs Playoff Odds projections: “FanGraphs Projections Mode…uses a combination of Steamer and ZiPS projections and the FanGraphs Depth Charts to calculate the winning percentage of each remaining game in the MLB season.” Conveniently, one can find a rest-of-season Win% projection for any date since the start of the 2016 season (so this section will focus only on the 2016-17 seasons).

FanGraphs’ preseason Win% projections for each team explains 43% of the variation in a team’s full-season Win%. As noted earlier, a team’s previous-season xRD/G explains about 37% of the variation in a team’s Win%. So, while it’s more predictive of Win% than a team’s previous-season RD/G and Win%, xRD/G comes up a little short when matched up with the FanGraphs preseason projection.

We can repeat the test using FanGraphs rest-of-season Win% projections at the 2016 and 2017 all-star breaks. In this case, the FanGraphs projected Win% is less correlated with future Win% than its preseason version. This midseason projected Win% explains 40% of the variation in a team’s second-half Win%, more than either first-half Win% (27%) or RD/G (34%).

However, first-half xRD/G has even more predictive power. Earlier, we saw that (from 2015-17) a team’s first-half xRD/G explains 39% of the variation in its second-half Win%. However, over the last two seasons, a team’s first-half xRD/G explains 46% of the variation in its second-half Win%.

The idea that first-half xRD/G was less predictive of second-half Win% in 2015 than in 2016-17 makes a lot of sense. As has been well-documented by FanGraphs, FiveThirtyEight and countless others, the 2015 all-star break represented a turning point in the MLB. There is very strong evidence that baseballs were altered at that point to help them travel farther, leading to a power surge across the majors.

This change is also reflected in a key component of xRD/G: xwOBA. The largest half-to-half xwOBA gap in the last three seasons occurred in 2015—MLB batters produced a combined .302 xwOBA before the all-star break and a .315 mark afterwards. In fact, from 2015 to 2016 to 2017, two trends emerged: the absolute half-to-half gap in xwOBA shrunk—from 0.013 to 0.007 to 0.003—while the ability of first-half xRD/G to predict second-half Win% improved—from 32% to 38% to 52%.

Similarly, the ability of FanGraphs’ midseason Win% projection to predict a team’s second-half Win% improved from 2016 to 2017. In 2016, it explained 34% of the variation in second-half Win%, while in 2017 it was able to explain 45% of this variation. However, both of these single-season marks fall short of xRD/G. Going forward, if MLB’s run-scoring environment continues to be stable over the course of the season (as it was in 2017), the midseason predictive power of xRD/G may continue to be quite strong.

An important test

A big issue with building xRD/G is the limited sample size. Not only is the Statcast era limited to three full seasons but, since xRD/G is a team-level stat, I only have 30 observations per season. One of my concerns is that I’m using 2015-17 data to build the xRD/G equation, then going back to the same data and testing the metric’s predictive power, which could lead to artificially positive results.

In order to test for this, I decided to rebuild xRD/G (temporarily, for this purpose only) using only data from 2015-16. Then, I’d only use 2017 data to test this new metric’s predictiveness.

The full-season version, which includes DRS/150:

xRD/G = 22.24*(Batting xwOBA) – 1.68*(BsR/600) + 5.65*(Batting xwOBA)*(BsR/600) – 11.51*(SP xwOBA) – 10.85*(RP xwOBA) + 0.005*(DRS/150)

And the half-season version, which excludes DRS/150:

xRD/G = 25.70*(Batting xwOBA) – 2.28*(BsR/600) + 7.71*(Batting xwOBA)*(BsR/600) – 14.13*(SP xwOBA) – 11.64*(RP xwOBA)

The results suggest that this particular concern is nothing to worry about. This version of 2016 xRD/G was able to account for 34% of the variation in a team’s 2017 Win%, implying that its predictive power was much stronger than that of a team’s 2016 record (12%) or RD/G (16%) and roughly equal to that of FanGraphs’ 2017 preseason Win% projections (35%). Moreover, this version of a team’s 2017 first-half xRD/G explained 50% of the variation in second-half Win%, a better mark than a team’s first-half record (31%), RD/G (39%) and midseason FanGraphs projected rest-of-season record (45%).

“Predicting” the Postseason

Finally, let’s examine how well xRD/G predicts the outcome of playoff series relative to RD/G, Win% and FanGraphs’ playoff odds. This section is mainly for fun, as we are working with a very small sample of series. Moreover, I will assume that all predictions are equal—whether a team has a one run edge in xRD/G or a 0.01 run edge, they will be viewed as the predicted winner.

From 2015-17, xRD/G was superior to both Win% and RD/G at predicting series winners, correctly predicting the winner in 19 of 27 series (vs. 16 correct predictions for the other two metrics). This is impressive, as the team that Win% predicts to win a series is always also the home team (except for the 2016 World Series), which is sort of an unfair advantage.

Focusing on the last two seasons allows us to include FanGraphs’ playoff odds in our comparison. Over the 2016-17 seasons, xRD/G, RD/G and FanGraphs’ playoff odds made correct predictions 13 out of 18 series. Win% made 12 correct predictions. Again, the fact that xRD/G is at least as predictive as Win% and FanGraphs’ playoff odds is impressive (for xRD/G), as the latter two account for home-field advantage as well as quality.

Let’s have a look at the individual series predictions. In the 2015 postseason, xRD/G was correct six times, erring exclusively in series that the Royals won. xRD/G was the only metric of the bunch to pick the Mets to win a series, let alone make the World Series. In 2016, xRD/G almost ran the table, erring only when it gave the Red Sox a slight edge against Cleveland in the ALDS. Also, as a Jays fan, I can’t help but note that only one team that made the LDS over the last three seasons had a negative xRD/G: the 2016 Texas Rangers. The 2015 Rangers are the second-worst LDS team in the bunch, by xRD/G. xRD/G had its worst postseason in 2017. It joined the other predictors by whiffing on Cleveland over the Yankees and the Nationals over the Cubs. xRD/G was sort of low on the Astros last season, at least relatively speaking, figuring that the Yankees and Dodgers would edge them in the ALCS and WS. For what it’s worth, those were both seven-game series.

Concluding Thoughts

xRD/G seems like an idea with a great deal of potential. Over our relatively small sample, xRD/G has been more strongly correlated with a team’s future record (whether next season or next half-season) than simple metrics like a team’s record or actual run differential per game. There is also evidence that it can be a better predictor of future Win% than FanGraphs’ team projections, particularly at the all star break. It even holds its own in postseason series predictions, despite it not accounting for home-field advantage.

xRD/G is also relatively consistent, another key strength. It is not improved by replacing xwOBA with wOBA, implying that the Statcast data is key to its usefulness. Finally, the ability of first-half xRD/G to predict a team’s second-half record has improved each year of the Statcast era—likely because, due to unrelated reasons, the MLB run-scoring environment has been more stable with each passing year. This implies that continued stability might allow xRD/G to explain around half of the variation in a team’s second-half record.

There’s also a great deal of potential in the fact that there is a lot to play around with here and improve upon. An xRD/G built specifically to predict future records may, logically, be better at predicting future records than the version built in this post. Other foundational stats could be brought in which may enhance xRD/G’s reflectiveness, predictiveness and consistency. There are a lot of different threads to explore from here.

Finally, since I presented a number of different equations, let me end with what I consider to be the “proper” equation for xRD/G, which I use for posts on Jays from the Couch:

xRD/G = 23.31*(Batting xwOBA) – 2.52*(BsR/600) + 8.34*(Batting xwOBA)*(BsR/600) – 13.16*(SP xwOBA) – 10.19*(RP xwOBA) + 0.004*(DRS/150)

newest oldest most voted
Rollie's Mustache

Awesome stuff, Jeff. I’ve been eagerly awaiting your release of this (Twitter follower) and appreciate all the work put in.