Projecting Risk in Major League Baseball: A Bayesian Approach

The following is an introduction to a new Bayesian projection system, which can be found here.

Introduction / Motivation

This project was partly inspired by a recent episode of the Driveline Baseball podcast which aired this offseason in which Kyle Boddy, founder of Driveline, and Mike Rathwell, CEO of Driveline, had a conversation about the topic that overwhelmed baseball for many months – whether teams should sign Manny Machado or Bryce Harper. Their initial reaction was to express disappointment that the debate had started in the first place. Machado and Harper, they argued, had almost nothing in common besides the fact they belonged to the same free agent class. They play different positions (implying different replacement levels and, thus, entirely different markets), and more importantly, have entirely different amounts of uncertainty associated with their projections. Machado has been as reliable as they come, playing premium positions (SS and 3B) and improving his defensive abilities every year. Harper, on the other hand, had just come off of what some have called one of the worst defensive seasons by an outfielder in recent history. However, he also had a 10-WAR season in 2015, a ceiling which Machado hasn’t touched. This led into a broader discussion about how to compare contracts from players with different levels of risk. Specifically, they explain how the sabermetric community’s approach to answering the question of valuing risk in Major League Baseball contracts has fallen short in three areas.

First, while many writers make note of the riskiness of certain assets, they fail to define that risk in precise terms. Most public projection systems output point estimates and public researchers suggest that the output is at the upper limit of predictive accuracy, and hence should be treated as a near certainty. It is worth noting that baseball is not the only field which has grown uncomfortable with uncertainty. Whether it is a decision to buy a certain stock, hire a particular company to ship your goods across the country, or decide who will be our next president, many analysts make the mistake of assuming a binary, discrete outcome is the result of a binary, discrete process. Instead, we posit that once we start to see the world as the outcome of several continuous, probabilistic processes, we can manipulate those processes in ways that give us an extreme competitive advantage (in baseball at the very least). Boddy explains:

“While a tweeter very fairly pointed out that sites like Baseball Prospectus and FanGraphs do mention upside and downside, rarely is it quantitatively actually approached in these articles. Rarely if ever, I should say. It’s very frustrating because just making a note that Tesla’s stock is more volatile than Microsoft’s is not enough. That wouldn’t be enough for a financial planner to be like ‘Oh, ok that’s a very deep analysis.’ It’s also not all downside, which is how a lot of these tweets [go].”

In other words, before describing the optimal mix of risky and safe players on a major league roster, there need to be accurate and reliable methods by which to describe that risk. There are very significant drawbacks to assuming too much downside, so carefully tracking exactly how uncertain you are of your team’s future performance as a whole is imperative. Also, as is the case with all science, precisely measuring levels of uncertainty and tracking resulting performance over time is the most reliable way to gain a deeper understanding of what exactly is uncertain about player projections and perhaps eliminating some of that ambiguity in the future.

Second, much of the public’s focus on risk tends to be restricted to “downside” or “bust potential,” however when defining risk as deviation from expected value, taking on risk can have a positive connotation. Boddy again explains:

“Because a lot of the media or a lot of the fans’ attention goes to Bryce Harper’s not worth 300 million dollars and 10-year contracts don’t ever work out inside baseball. And those are very simplistic views on the one side. But then on the other side it’s like ‘Oh Machado and Harper are very good players and if you discount it it’s still not even the biggest free-agent contract, A-Rod got paid more from an inflation-adjusted perspective.’ There’s a lot of other stuff that doesn’t go into it. So, it’s frustrating that a team can prefer Machado or Harper assuming they have them at the exact same mean projection based on their win curves and where they’re at in their spending and payroll and prospect pipeline and to not pay attention to that stuff is very frustrating.”

The question should not be “how do I minimize risk on my roster,” but “how much extra utility do I require in exchange for accepting more risk on my roster.”

Because of diminishing marginal utility, the cost of a loss is often higher than the benefit of a reward of the same magnitude (otherwise known as “loss aversion”). That means that in order to accept an equal probability of a win and a loss of the same magnitude, a loss-averse individual will require extra utility in addition to their expected winnings (what is called a “risk premium”). This is exactly how insurance exists as an industry. The annual premiums you pay to your health insurer are a reflection of your willingness (or in this case unwillingness) to assume the risk of being saddled with the hefty cost of a hospital bill. A corollary to that is that because general managers tend to be risk averse, you can acquire a player with a higher expected value for the exact same amount of money as a player with a lower expected value, as long as you assume additional risk (free money).

There are, on the other hand, situations in which a general manager may wish to acquire players with a higher variance of outcomes, especially as this includes the possibility of great performance (i.e., players with significant upside potential) along with a greater possibility of a bust. In particular, the downside performance risk is limited by the expected performance of the best available replacement player, making a high-risk player a potentially beneficial gamble.

Going a step further, one might imagine the benefits of diversifying the risk. One might identify areas of correlated risk (a sharp increase in the number of league-wide home runs, changes in the ball causing blisters, disruption of high-tech player development tools, etc.) and build a team with several high-reward players that would only be affected by one or two of those eventualities which would minimize team-wide risk while maximizing expected returns.

Boddy also mentions win curves as a consideration for whether a manager is willing to take on excess risk. For example, suppose you are the manager of a returning playoff team projected for 90 wins and the second-place team in the division is projected at 85 wins. The value of wins 85-90 on the year far outweigh the benefits of wins 91-100. In this case, it may not make sense to play a high-risk, high-reward player because the benefits of a breakout are minimal while the costs of a bust are substantial.

The corollary to that is that a small-payroll team in a division with several high-payroll powerhouses (I don’t want to use the name of an actual team, so for the sake of argument we’ll just call them the “Rays”) may not make the playoffs unless two or three of their cheaper players “break out” in a significant fashion. In this case, these “Rays” will take their free money and hope one or two dice-rolls turn out in their favor. In the event they bust, it’s a low-cost scenario because they would not have made the playoffs anyway had they opted for the low-risk players.

Third, there is not enough commentary on how to make risk analysis practical. In particular, suppose you could precisely quantify every single free agent’s distribution of outcomes for the upcoming year. What is a general manager supposed to do with that information? On this point, Boddy says:

“I think probably more teams are on the Machado side, just because they tend to be risk-averse, which I think is a huge mistake. I think only some teams are actually using risk in the free agent market anyway, correctly. They’re weaponizing it by getting discounts and actually signing risky players and making it a strategy. A big one would be the Angels. They sign a lot of injured players and a lot of players who are coming off bad years and they sign a lot of them. The other organization who really started this would be the Dodgers. They sign a lot of players to what are, potentially, using a simplistic calculation, not worth the dollars they were given at the time of the signing and potentially at the time of evaluation after the contract expires, but that’s not the point. The point was to bolster depth for reasons that Doug Fearing has talked about and that they famously used to reach the World Series multiple times, because there was so much turnover by design on those rosters that they were able to wield that risk.”

Borrowing cues from Modern Portfolio Theory in finance, Boddy is arguing that there exists an optimal mix of high and low risk players that allow teams to take significant discounts on players contracts. For some teams like the Dodgers, it always seems like it’s a different player that steps up at the right time and helps the team win. From Justin Turner and Corey Seager in 2017 to Max Muncy and Walker Buehler in 2018 and now Joc Pederson and Kiké Hernandez in 2019, having just a small subset of your players perform at an elite level is enough to power a team to the playoffs year after year, even if you miss significant time from Clayton Kershaw and Seager. It’s this kind of thinking that allows teams to sustain a championship team across years even when their young core hits arbitration and demands more money.

In this article, we will attempt to address all three. First, we will demonstrate how uncertainty around a player’s future statistics can be inferred from past performance using Bayesian inference. Second, we will visualize not only the bust potential of each player, but their breakout potential as well by simulating 3,000 seasons according to the parameters previously estimated. And third, we will translate our projected intervals into a risk-adjusted Win value using the Sharpe ratio from quantitative finance. While there are entire literature bases focusing on each of these questions individually, we hope our unified treatment of them, which by no means is comprehensive, can shift public attention toward the question of risk in baseball.

Projecting Risk

The variables used in the projection were all taken from the FanGraphs seasonal leaderboard. These include “standard” (G, AB, 1B), “advanced” (K%, ISO, wRC), “batted ball” (GB%, Pull%, Soft%), “pitch type” (FB%, FBv), “pitch value” (wFB), and “plate discipline” (O-Swing%, Z-Contact%) statistics. All of these were collected over the three years preceding the year of prediction and variables that were perfect linear combinations of others were deleted (like K%-BB%). Playing time, injury prediction, and Statcast data were not included in this application. In particular, plate appearances were estimated using a combination of previous major league performance and plate appearances from the previous three years. As a result, the model does not project accurately for recent call-ups. In an attempt to remedy this without manually and arbitrarily altering plate appearances, a “pro-rate” button was added which linearly transforms the plate appearance distribution to be centered around 600 plate appearances.

Missing data was then imputed using K-Nearest Neighbors (KNN) imputation using the 10 most comparable players from 2010-18. Due to computational resources and the strenuous task of Bayesian inference, principal components analysis (PCA) was used to obtain a lower-dimensional representation of the data and we only used the first 10 principal components for our computation. A similar approach was used to prepare the data for pitchers, however the size of the matrix made KNN imputation infeasible, so the Expectation Maximization (EM) algorithm was used to fill in missing values.

The mathematical calculations cannot be carried out exactly, so variance of a player’s projected performance was estimated using a hierarchical Poisson regression using the “Just Another Gibbs Sampler” (JAGS) library in R. Parameter estimation was performed using Markov Chain Monte Carlo (MCMC) simulation and states were accepted using the Metropolis-Hastings (MH) algorithm. The estimated outcomes were then linearly combined using the Wins Above Replacement formula to create an uncertain estimate of WAR.

To estimate each count statistic (BB, HBP, 1B, etc.), we began by estimating plate appearances for 3,000 different simulations and then used that distribution of values in the calculation for three true outcomes (TTO) and balls in play (BIP), and then used those values in the calculations for the next statistic and so on. The graphs for this structure were adapted from Bauer and Zimbalist (2014) and pictured below.

I won’t go through the entire model specification (I will leave it in the appendix for the more math-notation inclined readers), but I will mention one important feature. Not only did we use plate appearances to estimate opportunities to put a ball in play, but we looked at the effect of plate appearances on the probability that a ball is put in play for each plate appearance. We repeated this process for every “parent” node (appearing above another outcome in the graph) on the estimation of the “child” node (directly below and connected to its “parent”). This is what separates this projection system from the publicly available ones and allows us to correlate risk across statistics to get an accurate representation of uncertainty.

Existing projection systems treat each performance statistic as if they were independent of each other, which can result in significantly underestimating the uncertainty in their projections. Nate Silver details the importance of this approach in his article Why FiveThirtyEight Gave Trump A Better Chance Than Almost Anyone Else:

“The single most important reason that our model gave Trump a better chance than others is because of our assumption that polling errors are correlated. No matter how many polls you have in a state, it’s often the case that all or most of them miss in the same direction. Furthermore, if the polls miss in one direction in one state, they often also miss in the same direction in other states, especially if those states are similar demographically.”

To give an example, there is some risk that a player will hit many fewer doubles than expected. This might happen because they sustain an injury and therefore have fewer plate appearances, or it might happen because more of their line drives in the gap end up going over the fence as home runs. So the risk associated with the projected number of doubles cannot be understood in isolation from the other statistics.

Valuing Risk

There exist many different formulas from the field of finance in which to adjust the value of an investment according to its riskiness. For this app in particular, we chose the ex-ante Sharpe ratio as the method by which to evaluate risk, due to its interpretability. The Sharpe ratio is simply the calculation of excess return per unit of risk taken on by the investor. In other words, the higher the Sharpe ratio, the more you are getting paid to accept additional risk. In order to calculate this, one needs to know the “risk-free rate of return,” or the amount of money one can expect to make from the market without assuming any risk, and the standard deviation of our investment. We should note that commonly the Sharpe ratio is used on a group of assets, however the question of how risk is correlated across players requires much more extensive treatment than is presented here. The formula for calculating Sharpe ratio is as follows:

For the purposes of this application, we use our calculated league average 1.3 Wins Above Replacement as the risk-free rate of return. This is not a trivial statement, because technically 0 Wins Above Replacement is truly “replacement level”. However, we are not using “replacement level” and “risk free” as synonyms. We are implicitly assuming that a major league team can, without assuming any risk in outcomes, acquire a player capable of producing at 1.3 WAR (and if not, you are comparing production to the best available minor league player, who is often times above replacement level).

Settling the Harper vs. Machado Debate

Bryce Harper and Manny Machado’s projected batting WAR distribution are plotted below.

The first thing we notice is that Harper’s expected batting WAR is much higher according to our model than Machado’s. We can call that the “risk premium” of choosing Harper. That is, a manager is taking a 2 batting WAR hit in expected value to get a lower variance player should they choose Machado over Harper.

The second, and more interesting, fact we can see from the below graphs is that Machado’s batting WAR distribution has a much higher standard deviation than that of Harper. Much of the discussion surrounding variance in WAR surrounds defensive value, so we might expect the distributions to be closer than many discussed over the offseason. However, this still stands in stark contrast to the public narrative of Harper as a much riskier offensive player. Here, the benefits of precisely defining and quantifying risk are clear.

The third fact we can extract from this is the shape of Harper’s distribution. It looks as though there is a sharp peak around the mean at 3.16 batting WAR, while the tails are incredibly sharp on the edges. This feature describes the “kurtosis” of the curve, or how extreme the values on the tails of a curve are in comparison to the average. In comparison, you can see Machado’s batting WAR curve as having a much rounder peak. This suggests that the values on the edge of the curve are not as extreme in relation to his expected value.

As mentioned above, when incorporating fielding into the WAR calculation, we would expect Harper’s left-tail (downside) to be slightly larger. However, just looking at batting WAR, even with the larger kurtosis, the probability that Harper out-performs Machado according to our model is still around 90%.

We can now answer the question of which hitter we would prefer to have on our team if we are a risk-averse manager.

We can see that even with the larger tails, the risk-adjusted value of Harper is still much larger than that of Machado. This means that purely in terms of offense, it might make sense for a risk-averse manager to consider acquiring Harper over Machado, because they are making 1.5 Wins more per unit of risk than they would otherwise.

Conclusion

This article and application are an attempt to precisely infer the uncertainty surrounding future player performance using data from the past three years and visualize it in an insightful and easily interpretable fashion for all people interested in baseball — from the most Bayes-literate to those just starting their probability journey. We argue that the Sharpe ratio, while only one way of translating probability distributions to a risk-adjusted singular number, is an effective method for quickly comparing players across all levels of the risk-spectrum. Finally, we made our best case for the practicality of weaponizing risk in the player pool. We believe this application has the ability to serve as a valuable proof-of-concept for more extensive Bayesian front office tools.

Appendix / Full Model Specification:





2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tommymember
4 years ago

Great article

danielcalzada
4 years ago

Interesting article, thank you. It does seem that some of the PA predictions are skewed a little high. For example, Kris Bryant is showing approximately the same likelihood to have 675 PA as he is to have 800 PA (more than the single-season record), even though he only had 457 PA last year. Is this an artifact of constraining the PA total to follow a Poisson distribution (which is not exactly what this type of event would follow) or something else? Or maybe I’m reading too much into specific probabilities?