playerElo: Factoring Strength of Schedule into Player Analysis
*Note: All numbers updated to August 12th, 2019*
Introduction
Consider the following comparison between Freddie Freeman (29) and Carlos Santana (33). Both players were starters for the 2019 All-Star teams of their respective leagues, and both are enjoying breakout seasons beyond their usual high production level, with nearly identical statistics across the board.
PA | wOBA | xwOBA | wRC+ | |
Freeman, 1B | 533 | 0.400 | 0.398 | 146 |
Santana, 1B | 503 | 0.390 | 0.366 | 142 |
However, I argue that there is an underlying statistic that makes Santana’s success less impressive and Freeman’s worth MVP consideration. Recall the quality of competition of pitchers faced. The Atlanta Braves’ division, the NL East, contains the respectable pitching competition of the Mets (13th in league-wide in ERA), Nationals (15th), Marlins (16th), and Phillies (19th). Contrast this with the competition of the Cleveland Indians in the AL Central: The Twins (ninth), White Sox (22nd), Royals (24th), and Tigers (28th). Over 503 plate appearances, Santana has faced a top-15 pitcher (ranked by FIP) just 15 times, compared to 46 times by Freeman over 533 plate appearances. wRC+ controls for park effects and the current run environment, while xwOBA takes into account quality of contact, but all modern sabermetrics fail to address the problem of Freeman and Santana’s near-equal statistics despite widely different qualities of competition. Thus, I present the modeling system of playerElo.
Methodology
Conceived out of inspiration from Arpad Elo’s rating system for zero-sum games like chess, as well as FiveThirtyEight’s use of an Elo modeling scheme for MLB team ratings and season-wide predictions, playerElo treats all at-bats as events and maintains a running power ranking of all MLB batters and pitchers. The system uses expected run values over the 24 possible base-out states. Additionally, run values are calculated for each at-bat event by subtracting the run expectancy of the beginning state from the ending state, and adding the runs scored.
The following run expectancy matrix presents the expected runs scored for the remainder of the inning, given the current run environment, baserunners, and number of outs. Data is sourced from all at-bats from 2016-2018, and expected run values are rounded to the second decimal place. For example, a grand slam hit with one out would shift the run expectancy from 1.54 to 0.27 and score four runs, so the run value of the play would be 2.73.
1B | 2B | 3B | 0 outs | 1 out | 2 outs |
— | — | — | 0.51 | 0.27 | 0.11 |
1B | — | — | 0.88 | 0.52 | 0.22 |
— | 2B | — | 1.15 | 0.69 | 0.32 |
— | — | 3B | 1.39 | 0.97 | 0.36 |
1B | 2B | — | 1.45 | 0.93 | 0.44 |
1B | — | 3B | 1.77 | 1.20 | 0.48 |
— | 2B | 3B | 1.97 | 1.40 | 0.56 |
1B | 2B | 3B | 2.21 | 1.54 | 0.75 |
The model begins with a calibration year of 2018, and for 2019, players begin with their previous seasons’ ending playerElo, regressed to the mean slightly. If a player did not have a single plate appearance or batter faced pitching in 2018, such as Vladimir Guerrero Jr. or Chris Paddack for example, then they are assigned a baseline playerElo of 1000 (calibration year of 2018 began every player at 1000). For every at-bat, given the current base-out state, an expected run value for both the batter and pitcher is calculated, based on quadratic formulas of historic performance of players of that caliber in the given situation. The dependency of the Elo formula on the base-out state ensures the model is context-dependent, meaning it incorporates the fact that a bases-loaded double is far more valuable than a double with the bases empty, however, it also takes into account that runs were more likely to be scored in the former situation compared to the latter.
It is important to note playerElo is a raw batting statistic and does not evaluate overall production, meaning stolen bases are not factored into the ranking system. Additionally, while the model does not take defense into account, it also does not count stolen bases or passed balls negatively against a pitcher, and likewise does not count changes in game states due to wild pitches positively for a batter.
Once an expected run value is synthesized from the current state and the playerElo of the batter and the pitcher, park factor and home field advantage adjustments (if applicable) are made, and the expected run value of the play is then compared to the true run value outcome. The playerElo of both the batter and pitcher are then updated accordingly, dependent on the difference between the true run value and the expected run value. For example, if an excellent pitcher strikes out a mediocre batter, the batter will not lose much Elo, and the pitcher will not gain much Elo. Likewise, if a below-average batter does extremely well against a top pitcher, there will be a far greater change in the Elo of both players. Errors are also taken into account and will prevent a positive run value from counting against a pitcher or positively for a batter.
Analysis
Elo Rankings
It is interesting to note Charlie Blackmon does particularly well in the model, even with park factor adjustments. This can be attributed to the difficulty of schedule of the Rockies and likely the high performance of Blackmon contextually. The Rockies regularly face the powerhouse pitching of the Dodgers, as well as the respectable competition of the Diamondbacks, Padres, and Giants. Blackmon has faced 58 top-15 pitchers (rated by FIP) respectively thus far, fifth-most among batters with over 100 plate appearances. Blackmon also has the 15th average playerElo faced batting. Quality of contact does leave room to be desired, however, as playerElo does not incorporate statistics like exit velocity and launch angle in its calculations, and thus the model is a better reflection of on-field performance than underlying swing metrics. Likewise, Cody Bellinger, while he still ranks high, is knocked down a few rungs to ninth overall among batters due to ease of competition (only 26 top pitchers faced).
Pitching-wise, the breakout performances of Hyun-Jin Ryu, Giovanny Gallegos, and Charlie Morton are all captured and supported by the playerElo model. Max Scherzer, Josh Hader, and Kirby Yates’ continued success looks to be sustainable. Mike Minor ranks particularly high (18th among SPs) after seeing success against the high-powered offense of the Astros (fourth in runs scored), Athletics (10th), Angels (11th), and Mariners (14th), and would have been an impact pitcher if dealt at the deadline.
In contrast, playerElo has little faith in Trevor Bauer, one of the most discussed pitchers at the deadline, especially in the impact he will have on the Reds. Bauer is ranked 74th among SPs with more than 100 batters faced, with a currentElo of 1010.21 after beginning the season with a preseasonElo of 1111.30. He has struggled mightily against a relatively easy division of the AL Central (apart from the Twins), regularly facing the Royals (24th in runs scored), White Sox (28th), and Tigers (30th), posting an xFIP of 4.29. However, the Reds play in a division only slightly tougher offensively than the Indians, and given the recent trend of quality starts from Bauer (despite his impressive temper tantrum), it’s entirely possible he does indeed turn things around for the second half. We can visualize Bauer’s season-long playerElo trend within the context of the league with the following graphic. Displayed below is the playerElo trends for 2019, with specific batters and pitchers highlighted. The bold line on both graphs denotes the average playerElo.
The graph illustrates Mike Trout has dominated since the start of the year and Bryce Harper has consistently been great and continues to rise in playerElo (now ranked 10th among batters). Fernando Tatis Jr. elevated from a rookie with a playerElo of 1000 to approaching the top tier of players, and Joey Votto has slowly declined in value all season, with occasional flashes of his former ability. On the pitcher side, both Scherzer and Hader have maintained their elite status all season. Aaron Nola’s curve looks to be a long valley, but by his current trend, he can be expected to regain his former value and playerElo score.
Sabermetric Analysis
As playerElo only evaluates historical season performance to-date, and is therefore not a prediction statistic, comparing playerElo with underlying swing metrics can enable accurate forecasts as to potential second-half decliners, as well as identify first-half breakouts that are for real.
The below heatmap-style table displays the top 40 players ranked by Expected Weighted On-Base Average (xwOBA), per Baseball Savant, with currentElo, Expected Batting Average(xBA), and Expected Slugging (xSLG) also displayed. By evaluating the four statistics collectively, it is easy to identify players such as J.D. Davis, Alex Avila, Justin Smoak, Jason Castro, and C.J. Cron, whose quality of contact has been excellent but they have faced relatively easier pitching or performed worse contextually than other batters with similar quality of contact (as shown by the lighter shade of currentElo). Additionally, players such as Trout, Anthony Rendon, Christian Yelich, DJ LeMahieu, and Freeman have performed extremely well, and their level of competition backs up their success. Harper and Xander Bogaerts are likely due for excellent second halves, as they have continuously faced tough competition and thrived under the circumstances, all-the-while maintaining solid quality of contact.
Pitching-wise, the heat map is again ranked by Expected wOBA. Yates and Hader stand out among all pitchers, both with a high playerElo yet extremely low quality of contact stats. Gallegos, Scherzer, Gerrit Cole, and Jacob deGrom all have a high currentElo and excellent quality of contact numbers. In contrast, Yimi García, Hector Neris, and Adam Morgan have posted great numbers, but their low currentElo indicate they have struggled contextually or against tougher competition (or simply have not faced tough competition), perhaps a sign of second-half regression.
Teams
Aggregate teamElo power rankings can also be created from playerElo, with weights for each individual playerElo assigned by the plate appearances of the batter, or batters faced of the pitcher. Recall playerElo does not take into account defense or stolen bases, which will influence the ratings slightly. Speedster teams such as the Royals and Mariners will rank slightly lower than their true abilities, and conversely the White Sox and Tigers, who have had atrocious defensive performances thus far (second and third respectively among the league in errors), will rank slightly higher than true value. However, the teamElo rankings still give a fairly accurate view of the current state of baseball, and particularly highlighted upstart teams such as the Braves, Nationals, and Athletics that could be even better than they have seemed thus far. Team Batting Elo is highly correlated with Runs Scored (r = 0.90), as is Team Pitching Elo with Runs Against (r = -0.91).
Application
The playerElo system is able to reveal characteristics of the game and performance current metrics miss. For example, the context-dependent nature and run value calculations of playerElo appropriately credits a reliever who gets three outs without allowing a run after the previous pitcher loaded the bases, and similarly splits the credit of those runs being allowed between the pitcher who loaded the bases and the reliever who failed to get the necessary outs to end the inning. Additionally, playerElo does not assign extra value to players who simply come up to the plate as part of high-powered offenses who often have runners on, nor does it dock players who simply do not get the same opportunities as their peers to accumulate as many counting statistics such as RBIs by nature of playing for a bad team.
The difference between the playerElo rankings is that it is run context-dependent, treating at-bats differently depending on the runs-out state, and context-free, meaning it treats all at-bats with equal weights of importance, highlighting context in player evaluation. The playerElo system accurately captures if hitters consistently bat in favorable situations, and appropriately does not allow this to over-inflate their value.
But most importantly, the fundamental aspect of the playerElo system is that the quality of competition is factored into player analysis. Once again, consider the comparison between Freeman and Santana, who have nearly identical statistics but clearly face different levels of pitching competition.
PA | wOBA | xwOBA | wRC+ | playerElo | |
Freeman 1B | 533 | 0.400 | 0.398 | 146 | 1303.78 |
Santana 1B | 503 | 0.390 | 0.366 | 142 | 1153.24 |
There is now a distinguishing factor between the two players, reflected by Freeman’s second-best overall playerElo among batters versus Santana’s 22nd-overall playerElo.
playerElo was developed by Jacob Richey and Professor Abraham Wyner of the University of Pennsylvania. A Shiny App containing up to date rankings can be found here: https://jrichey.shinyapps.io/playerelo/, with a link to the corresponding GitHub.
This is very interesting! For what its worth, DRC+, which does factor in quality of competition has Freeman 143, and Santana 145
I’d never heard of DRC+ before, thanks for letting me know! I’ll take a look at the methodology for the statistic, and perhaps can revise playerElo with some new ideas.
👏🙌
This was very interesting and seems well-done! I’ve always been interested in strength of schedule adjustments as well as ELO ratings in general. However, just out of curiosity, why did you use 1000 instead of 1500 as the average?
No particular reason, 1000 just seemed like a natural inflection point. The true average is roughly 980, meaning a player with an Elo of four digits, or above 1000, is a valuable contributor to his team. The Shiny App is also color coordinated by 5% quantiles to make playerElo interpretation quick and simple (i.e. blue good, red bad).
Very, very interesting. Do you do any adjustments for players who have been injured? For example, I noticed when I clicked through to the Web site that Andrew McCutchen has gained on the other outfielders even though he has not played.
*Edit–nevermind. Saw that his graph wasn’t trending any further since his injury. My bad.
I’m wondering, if the goal is to evaluate 2019 performance, then shouldn’t each player for the purposes of their own Elo calculation start at 1000 for 2019? Or better yet, for the purposes of their own calculation, shouldn’t they always be considered to be at 1000? Especially if you’re trying to make a claim such as, “worth MVP consideration”.
For example, let’s say Freeman and Santana both have their first AB of the season in the exact same situation, against the same pitcher, and have the same outcome. You wouldn’t want to say that Freeman is more worthy of MVP consideration at that point. I’m also not sure at what point previous years should essentially stop impacting current year. By the end of this year, if Freeman and Santana have the exact same production against the exact same pitchers, how much does the fact that Freeman started with a higher Elo impact things?
I’m not sure if this is an issue with the metric itself, or the way it’s being framed, but something seems a bit off there. It almost seems as if it would be better to use 2018 values in the gain/loss calculations, but still start everybody at neutral.
The nature of an Elo rating system is to have a running ranking of every player or team, and use the differences in these ranking to calculate expected outcomes of matchups (Win, Loss, Run Value in this case, etc.). These expected outcomes are then compared to what truly happened, and updates are made to the Elo rankings accordingly.
Additionally, if every player began the season at 1000, then the first month’s worth of batter-pitcher matchups would likely simply be calibration for the model, and not a true reflection of relative skill levels.
To your point on Freeman and Santana, if both players have the exact same production against the exact same pitchers, and Freeman started with a higher Elo than Santana, then (assuming overall Elo change is positive) Santana would rise in rankings more than Freeman, because he was expected to do worse. In contrast, Freeman was expected to perform better, and if he simply performed at Santana’s exact level, (assuming overall Elo change is positive), Freeman would not see an increase in his Elo of the same magnitude as Santana.