## Analyzing Underlying Factors Impacting Tickets Sold for Major League Baseball Games

I. Introduction

In 2017, Major League Baseball exceeded 10 billion dollars in total revenue for the first time. Ticket sales were a major component, making up 29.84 percent of this revenue (Statista.com). Due to the fact that fans continue to spend money once inside the stadium, 29.84 percent is merely a lower bound on revenue from ticket sales. For example, the average 2017 ticket price was 31 dollars; however, once inside the stadium, fans spent an average of 16 additional dollars on food (Statista.com).

II. Data

The data for this project are in an unbalanced panel format and contain 60,705 observations from 35 teams spanning from 1992 to 2017. Other than the 2017 season data, which I collected myself from baseballreference.com, the data from 1990 to 2016 were scraped from baseballreference.com by Troy Hepper, a consultant at Morgan Franklin Consulting, and shared on his github.com page.

Descriptive statistics of my game by game data are displayed in Table 1. The dependent variable is the percentage of tickets sold relative to a stadium’s capacity (PERCENTSOLD). PERCENTSOLD ranges drastically from a little bit under 2 percent to over 150 percent with a mean of around 66 percent. PERCENTSOLD is sometimes greater than 1 because for certain important games ticket sales exceed stadium capacity; however, only 76 out of 60,705 observations exceed 110 percent and these outliers have almost no effect on the estimated coefficients in the models.

The explanatory variables in this model are designed to control for the time effects of when a baseball game was played, the quality of the home team, and the quality of the opponent. To control for the time that a game was played, indicators for the month and year are included in the model. To control for day of the week and whether or not the game was played at night or during the day, four dummy variables were created indicating whether or not a game was a night game during the week (NIGHTWEEKDAY), a day game during the week (DAYWEEKDAY), a night game during the weekend (NIGHTWEEKEND), or a day game during the weekend (DAYWEEKEND). Due to the immense popularity of the first game of the season, an indicator variable for Opening Day is also used.

The quality of the home team is assessed using both information on payroll and playoff chances. Better teams have better players and since players are paid based on skill and production, better teams consistently have higher payrolls. The payroll variable created here is the percentage deviation from league average payroll (HOMEDEVIATION). The minimum percentage deviation is a little under 20 percent of the league average while the maximum is over 280 percent of the league average. A standard deviation of a little under 40 percentage points shows the consistent variability of team payroll throughout the data. The playoff chances of a team are weighted by the number of games back or up they are on the guaranteed divisional playoff spot.

The quality of the visiting team is assessed using information on payroll and the opponent’s relationship with the home team. Fans want to come to the park to see good teams play so more attractive visiting teams will consistently have higher payrolls. The visiting team’s payroll variable (AWAYDEVIATION) is constructed the same way as the home team’s payroll discussed above. Because fans want to see their teams make the playoffs and the best way to do this is by beating the teams in your division, an indicator variable to assess the draw of a divisional game is used as well.

III. Regression Specification and Results

To better understand the relationship between the explanatory variables and the long-run demand for tickets, the data were analyzed using three panel data estimation techniques: one-way fixed effects, two-way fixed effects, and random effects models. For these data, it is clear that a fixed effects model is a better fit due to the fact that the unobserved metric of fan loyalty, which is constant over time, correlates very strongly with the two explanatory variables that control for payroll. The reason that fan loyalty is constant over time is that it is clear that for some teams, like the Chicago Cubs, the teams are deeply engrained in the culture of their cities and the fan bases remain loyal to these teams no matter what. On the other hand, for certain teams, like the Oakland Athletics, fan bases consistently disregard their teams and never become engaged. Because loyal fans spend more money and demand higher quality teams, owners of these teams must spend more on players. For this reason, payroll is correlated highly with the omitted variable, fan loyalty, making the use of a fixed effects essential for unbiased coefficient estimates.

The results of the three separate panel estimation techniques are recorded in Table 2; however, this paper will focus on the results of the following two-way fixed effects model:

In this model, T represents the team, S represents the season, and G represents the gth home game for each season. An interesting conclusion is that except in the case of DAYWEEKEND, both the fixed and random effects estimation have the same sign and approximate magnitudes for each coefficient.

In the two-way fixed effects model, all variables except the time fixed effect for 1996 are significant at any standard level. The largest coefficient is that of the Opening Day dummy, which causes an estimated 38.7 percentage point increase in percentage of tickets sold. Interestingly, the year dummy variable shows an approximate 11 percentage point drop in PERCENTSOLD in 1995 in comparison to 1994. This drop is most likely due to the disdain towards baseball fans developed following the players’ strike of 1994. Another interesting league wide trend is the approximate 4 percentage point drop in PERCENTSOLD from 2007 to 2009 during the Great Recession. For the average sized stadium, this sized drop would result in a decrease of a little over 1,700 fans per game. According to statista.com, the average ticket price in 2009 was 26.6 dollars. Thus, the resulting setback of losing 1,700 fans paying 26.6 dollars per game over the course of 81 home games would be around 3.7 million dollars. According to the Hardball Times, league average revenue in 2007 was 171 million dollars so for the average team, a 3.7 million dollar drop in revenue in 2009 would result in around a two percentage point decline in revenue from ticket sales alone. This is economically significant for a profit maximizing firm like a baseball team.

Using April as the base case, the coefficients of all other month dummies are positive. This indicates that the first month of the season is the weakest month for maximizing PERCENTSOLD. Notably, July and August dominate the percentage of tickets sold with an estimated 13 to 14 percentage point increase in PERCENTSOLD in comparison to April. Economically, maximizing games played in July and August while scheduling off days during April would result in increased revenue; however, if three more games were scheduled in July and August, the increased number of fans paying the 2017 average price of 31 dollars per ticket would result in a little over 500,000 dollars in increased revenue, which is an economically insignificant increase of .2 percentage points.

The indicator variables designed to control for game time and game placement during the week also shed light on what type of games maximize PERCENTSOLD. In the model, NIGHTWEEKEND was left out and the coefficients of the other three dummies were negative. This tells us that weekend games played at night are the most popular. DAYWEEKEND seems to have the least effect decreasing PERCENTSOLD by around 1 percentage point, while NIGHTWEEKDAY has the most effect decreasing PERCENTSOLD by 14 percentage points.

The coefficient of HOMEDEVIATION can be interpreted as a 50 percentage point increase would result in a 14 percentage point increase in PERCENTSOLD. The other assessment of the home team, games back from the playoffs, predicts that for a five game lead on the division a team will see an approximate 2.5 percentage point increase in PERCENTSOLD while with a ten-game deficit a team will see a 5 percentage point decrease in PERCENTSOLD. This variable is particularly effective because on Opening Day everyone is 0 games back from the playoffs so it has no effect, but as the season continues and the games back variable becomes smaller or larger, its increased effect over the course of the season is naturally weighted in the model.

The coefficient AWAYDEVIATION has a smaller coefficient than HOMEDEVIATION, but is also positive and statistically significant. The effect of opponent is also shown in the divisional game dummy which tells us that if an opponent is in a team’s division, the percentage of tickets sold increases by a little under 1 percent. Although the divisional dummy is statistically significant, even if in 2017 the MLB had scheduled 40 more games against divisional opponents for each team, this change would have added under 500,000 dollars in revenue and increase total revenue by less than .2 percentage points, which is an economically insignificant change.

Overall, the data seem to tell the story that one would expect; however, it is always nice to attempt to quantify these relationships. For further information, the author can be contacted at marinojc@kenyon.edu.

## Optimizing Launch Angles Using Simulation and K-Nearest Neighbors

Although posted by Jack Marino this was a truly collaborative effort by Grant Carr, Justin Clark, Jake Fisher, Jack Marino, and Noah Nash.

The introduction of Statcast technology in 2015 has allowed analytics departments around the MLB to quantify aspects of the game that until the last few years were impossible to measure. One of the previously unanswerable questions that Statcast has allowed us to examine is the optimal launch angle for each hitter in the MLB. If the free agent market this winter has told us anything, it is that teams are now becoming more sabermetrically savvy with their checkbooks and are understanding the value a player adds to their roster in a far more analytical sense. For example, Mike Moustakas may have hit 38 bombs last year, but the fact of the matter is that he is a two WAR player with a below average glove and minimal range. Moustakas’s late signing for just \$5.5 million plus incentives after declining a \$17.4 million qualifying offer indicates that the market seems to have a much better understanding of his value than it has in years past. Since optimizing launch angle is defined as adding the greatest possible value per at bat, finding the right launch angle is undoubtedly a smart decision for a player trying to put himself in the best possible position to break the bank during free agency.

What makes this optimization problem so difficult is that simply knowing a launch angle on a certain ball in play very rarely tells us anything definitive about the outcome of that ball. The reason for this is that batted ball outcomes are extremely dependent on other variables such as exit velocity and the positioning of the opposing team’s defense. For example, a 25° launch angle hit above 100 mph is in most cases a home run; however, a ball hit at that same angle at 80 mph is almost surely a flyout. To gain a complete understanding of this relationship, we think the following visuals can be extremely helpful, but this relationship can also often make a lot of intuitive sense.

Never shying away from a challenge, we decided to dive into this problem and see what sort of algorithm we could develop to take a hitter’s batted ball data in 2017 and calculate an optimal launch angle for that hitter in 2018. The data we used for this project are from baseballsavant.com.

Since calculating an optimal launch angle will most likely result in an adjustment of a player’s swing, it is important to understand the possible repercussions of that change. For example, to increase launch angle, one definitely will need to adjust swing path to a more uppercut swing, which could in theory lead to a higher strikeout rate. For this reason, before recommending any changes, we wanted to make sure we understood the relationship between launch angle and strikeout rate. Using players with over 100 at bats during the 2017 season, we constructed the following plot and built a linear regression model trying to predict strikeout rate from launch angle. What we found was an R-squared value of approximately .05, meaning that only 5% of the variability in strikeout rate was accounted for by launch angle.

Following this conclusion, it seemed fair to move on and continue our analysis under the assumption that any tweaks we make to a player’s swing will not cause a drastic change in strikeout rate or quantity of balls put in play.

We think at this point, major strides have already been made in understanding launch angle, especially the possibly unexpected result of our linear model above. However, the problem still has not been solved and our methodology for solving it has not yet been revealed!

The method we came up with was to use simulation to increase the sample size of exit velocities based off the distribution of our hitter’s and calculated comparable players’ batted ball data, take these simulated exit velocities and fix a launch angle to them, use k-nearest neighbors on our hitter and comparable players’ to get a likely outcome for that batted ball type, then see what launch angle maximizes a hitter’s expected weighted On Base Average (wOBA) given the simulated distribution of exit velocities and k-nearest neighbor outcomes.

That may be a lot to throw at a reader all at once, so let’s examine a case analysis of this study using San Francisco Giants outfielder Andrew McCutchen. McCutchen’s 2017 season saw him have an average exit velocity of 88.4 mph and an average launch angle of 14.2°. Optimal launch angle is extremely player specific, so the first thing we have to do is gain a complete understanding of McCutchen’s batted ball profile. The chart below does an excellent job of helping us to do exactly this. For example, it appears McCutchen never surpassed a launch speed of 110 mph off the bat in 2017, had a pocket of homeruns between 23-30° and 95-110 mph, had a band of doubles at similar exit velocities but lower launch angles, and a group of singles at low launch angles and an even larger distribution of exit velocities than before. Now this is a great plot for understanding comparable players, but the fact of the matter is that there are entirely too many players to compare on a plot by plot basis.

To combat this problem, we first narrowed down the field to players who took over 100 at-bats during the 2017 season and then used the technique of Principal Component Analysis to narrow down the field of comparable players even further. For the variables in our PCA, we chose many different metrics using the Baseball Reference play index including home runs, triples, doubles, and singles per at bat, fly ball rate, ground ball rate, WPA, RC, and oWAR amongst others. After completing the analysis, we chose our first four principal components, which accounted for 76% of the variability in the original variables. We squared and summed the differences of each player’s first four principal component scores and created a list of the top 20 players whose four squared distances were the smallest. From here, we removed players who did not bat righty to try to account for the lefty/righty splits a righty batter like McCutchen may have. Then we went plot by plot trying to match the pattern of hits and exit velocities to McCutchen’s plot above. After this qualitative piece of our analysis was complete, we came up with Adrian Beltre, Alex Bregman, Brian Dozier, and Eugenio Suarez as our four comparable players. Their distribution of hits graphed with McCutchen’s can be found below and are remarkably similar.

When we considered how to create this optimal launch angle, we knew we wanted to somehow incorporate different areas of the strike zone, as the optimal launch angle on a ball up and in is likely not the same as on a ball down and away. To combat this potential problem, we divided the strike zone into 9 sections and created the following heat maps for both McCutchen and McCutchen and his comparable players. To understand these heat maps, it is important to note that the first number in each zone is the average launch angle on balls in play for that player or groups of players during the 2017 season in that zone, while the second number is the average exit velocity on balls in that zone.

Looking at McCutchen’s heat map, we saw clear variation in exit velocity, launch angle, and offensive outcome (in this case wOBA) by zone, which confirmed our belief that we would have to take zone-specific differences into account. We decided to find the optimal angle for each of our nine zones, planning eventually to combine those angles into a single, optimal number unique to McCutchen. Looking at zone-specific data for McCutchen and his comparable players, we ran into the same challenge that motivated finding those comps in the first place: lack of data. There was simply not enough data on launch angle, exit velocity, and wOBA between McCutchen and his comps to perform the kind of verifiable analysis that comes with a larger sample size.

To overcome this challenge, we turned to simulation. Specifically, we searched for a distribution that would allow us to generate reasonable launch velocities for a given zone. With this distribution, we could test possible combinations of launch angle and exit velocity to explore which zone-specific angles might be optimal. Looking at a histogram of launch velocities for McCutchen and his comps, we observed a pronounced left skew across all nine zones. With this trend in mind, the Weibull distribution made sense for its flexibility in modeling real-life processes that feature multiple varieties of skew. Implementing maximum likelihood estimation on the zone-by-zone data used to generate the heat maps gave us the parameters for nine Weibull distributions that closely characterized the trends in exit velocity we observed for each zone. For example, the fit of our Weibull distribution in zone 1 shows the clear left skew, but also the excellent job of the flexible Weibull to fit the data.

In all, this process allowed us to generate any number of exit velocities for each zone that might reasonably approximate the kind of speeds we see on actual batted balls, leaving us with finding a range of launch angles that could be optimal for a given zone. While looking at the distribution of launch angles for McCutchen and his comparable players, we decided to consider only the launch angles between the 25th and 75th percentile for each zone. This gave us a number of discrete angles to test in conjunction with each zone’s launch velocity distribution for optimal offensive performance.

For each possible angle within a given zone, we generated 1000 exit velocities from that zone’s respective Weibull launch velocity distribution. Next, we used k-nearest neighbors to assign a wOBA value to every launch angle, exit velocity pair by examining similar pairs of launch angles and exit velocity and their associated wOBA within the McCutchen and comps dataset. This procedure gave us 1000 wOBA values for every launch angle that might be observed in a particular zone. By taking the mean of those wOBA values for each possible launch angle, we gained a more complete sense of what kind of offensive performance might be associatedon averagewith the various launch angles for each zone. To identify which angle in each zone was optimal, we simply chose the launch angle with the highest associated wOBA.

Now that we had our nine optimal launch angles in each of the nine zones, we wanted to come up with a way to get to one optimal launch angle. When coming up with this angle, we knew it would be important to incorporate how often a player faces pitches in each zone as well as some measure of his talent level in each zone. To incorporate these two factors into our analysis, we started in zone one and took the product of the proportion of pitches McCutchen saw in zone one and his contact percentage in zone one, then repeated this process for the other eight zones and took the proportion of each of these products to create linear weights. Once we had our linear weights, we simply multiplied each zone’s weight with our previously calculated optimal launch angle in that zone and took the sum of these products. A visualization of this process can be seen below:

To finalize our findings, while Andrew McCutchen finished his 2017 regular season with an average launch angle of 14.2°, our advice based off our model is that he lower his average launch angle to 13.0°.

Well, there’s our methodology, not saying it’s perfect, but we’re certainly happy with our results.

About the Authors:

Grant Carr is a mathematics and economics double major at Kenyon College.

Justin Clark is a mathematics major at Kenyon College.

Jake Fisher is a history major at Kenyon College.

Jack Marino is a mathematics and economics  double major at Kenyon College.

Noah Nash is an english major and art history minor at Kenyon College.

The group can be contacted at marinojc@kenyon.edu with any further questions.

## Estimating Team Surplus in Jose Altuve’s 2013 Deal

In July of 2013, Jose Altuve agreed to a four-year contract extension with two club options for the 2018 and 2019 seasons that guaranteed \$12.5 million and a \$750,000 signing bonus. Now that the Astros have picked up the club option at \$6 million for the 2018 season and will pick up the club option at \$6.5 million for the 2019 season, the deal will end up totaling 6 years for \$25.075 million (this figure includes \$75,000 accrued in incentives). What is important to note about this extension is that the deal bought out all three of Altuve’s arbitration eligible years (2015-2017) at \$2.5, \$3.5, and \$4.5 million respectively and through the club options, controls his first two years of free agency.

In 2013, Altuve was following a 2012 All-Star campaign where he slashed .290/.340/.740. However, at the time the deal was signed on July 13, 2013, Altuve had seemed to regress slightly hitting .280/.317/.671 through 86 games. Following the 2013 season, Altuve’s stock soared, culminating in an MVP 2017 when he led the majors with an 8.3 WAR.

###### (Image from sportingnews.com)

Looking back on this contract, the deal obviously resulted in a large team surplus; however, contracts like this have had mixed results in the past (think Allen Craig) so I must stress that the point of this article is not to bash Altuve for signing the deal or laud Jeff Luhnow and the Astros for getting this deal done when they did. The point of this article is, if we hold Altuve’s past performance constant, had Altuve not signed this extension and instead had gone the more traditional route of going through three years of arbitration then hitting free agency, how much could he have expected to make along the way? Although I could see this article going in many different directions, such as trying to assess the level of risk Altuve signed away in July of 2013 or, in a similar way, trying to quantify the amount of the risk the Astros took on by signing the deal (this risk is definitely magnified due to the fact they started the 2013 season with a payroll of just \$22 million), I believe knowing what transpired over the last 6 years, calculating the surplus on this deal is that most interesting way to proceed.

For the purpose of this article, I am assuming that Altuve worked on one-year contracts through the 2017 season when he would have hit free agency. I am also assuming that during each arbitration year, any of the three possible arbitration outcomes (player victory, team victory, or prior settlement before a hearing) could have occurred. To assist in keeping track of the many numbers presented in the remainder of this article, I have prepared the following table:

2013/2014 Offseason

Since Altuve would have been under team control, 2014 is the easiest season to estimate surplus and was clearly the most player-friendly year for Altuve. During 2014, Altuve’s salary jumped from the \$510,000 he most likely would have received under team control to a \$1.25 million base salary plus a \$750,000 signing bonus and \$25,000 in incentives. Hence, during the 2014 season, Altuve saw approximately \$1.515 million dollars in surplus on this deal. For a player under team control, this number is pretty much unheard of, however, after 2014 is where the deal’s problems started.

2014/2015 Offseason

The 2014 season was clearly a breakout campaign for Altuve who hit .341/.377/.830 with 56 stolen bases and 225 hits. Since we are talking about his first arbitration-eligible offseason, it is important to keep in mind that Altuve’s .341 batting average and 225 hits both led the league. These more traditional statistics may not impress the sabermetrically savvy readers of FanGraphs, but the truth of the matter is that arbitration salaries are still very much reliant on these traditional metrics.

In the 2014/2015 offseason the largest contracts awarded to hitters who were arbitration eligible for the first time were Chris Carter (\$4.175 million), Trevor Plouffe (\$4.8 million), and Dayan Viciedo (\$4.4 million; however, Viciedo was released at the end of Spring Training of 2015). None of these players are great comps for Altuve who only hit 7 home runs in 2014; however, they clearly set the market for top first-time arbitration eligible hitters. While someone like Carter may have hit 37 home runs and used this traditional measure of player value to push his deal up, Altuve did smack 47 doubles during the 2014 campaign and actually had an OPS that was 31 points higher than Carter’s (.830 vs. .799). I think a fair claim to make from this is that both sides would have seen tremendous value from Altuve and my estimate of a \$5.0 million salary for a first-time arbitration eligible player is fair. This gives the Astros \$2.475 million in surplus on the 2015 contract for Altuve who in reality received \$2.5 million and \$25,000 in awarded incentives. This changes the tide and brings the total surplus of the contract through 2015 at \$960,000 in favor of the Astros.

2015/2016 Offseason

In 2015 Altuve was named to his third All-Star team and continued to impress, leading the league in hits and stolen bases for the second year in a row on his way to finishing in the top ten in MVP voting and winning a Gold Glove Award at second base. During the second year of arbitration salaries tend to jump up a lot, so after backing up his 2014 campaign with an even stronger 2015 season, a fair estimate for his 2016 salary is \$11 million. Now, since Altuve ended up earning \$3.525 million in 2016, assuming the \$11 million salary is correct, the Astros’ surplus in 2016 was \$7.475 million, making the total surplus of the deal through 2016 \$8.435 million in favor of the Astros.

2016/2017 Offseason

I’m sure this will come as no surprise that Altuve backed up his All-Star 2015 campaign by leading the league in batting, increasing his home run total to 24, and finishing third in 2016 MVP voting. These are exactly the type of numbers that jump out of a presentation to an arbiter and due to the precedent of great value placed on high impact position players in their third year of arbitration, it is fair to assume that the 2016/2017 market would have been high on Altuve. Further evidence of this fact can be found in the 2017/2018 market that saw Josh Donaldson set the arbitration record by reaching a 2018 salary of \$23 million. In addition, Bryce Harper settled on a 2018 salary of \$21.625 million in May of 2017 after coming off a massive step back in production from 2015 to 2016 that saw him hit 18 fewer home runs and his on-base percentage to drop 87 points (Harper did rebound strongly during the beginning of 2017 right before this deal was signed). For these reasons, it is safe to assume that Altuve could easily have expected a 2017 salary of \$20 million, leaving \$15.5 million in team surplus and totaling \$23.935 million in the deal.

2017/2018 Offseason: Free Agency!

Finally, a lot has been made over the free agent market this year, but the fact of the matter is that this past offseason Altuve would have been a 27-year-old reigning AL MVP hitting free agency. If the 5 year 151 million dollar extension he signed yesterday is any indication of the deal he would have received, I think 8 years at \$225 million paying \$23 million in 2018 and \$25 million in 2019 is a conservative estimate. Based solely off this past year’s market, people may scoff at the length and dollar value of this deal, but the comparison of what Altuve’s situation would have been to the situations of this year’s free agent class are not strong due to the fact that the three main reasons this class struggled, do not pertain to Altuve. The first reason this past year’s free agent class struggled is that teams now have a better understanding of how players age and are not offering long contracts to players hitting free agency at 29 or 30 years old anymore. The second reason is that not enough teams are trying to compete for a playoff spot in 2018 due to the many rebuilds that are taking place. Because of this fact, there were fewer teams in the market submitting bids and driving up prices. The third reason is that lower WAR players who excel at putting up big numbers in traditional statistics like home runs (i.e Mike Moustakas) are finally being more fairly valued.

Being one of top players in the league at just 27 years old Altuve would have faced none of these problems even if only one team (say Milwaukee) had competed with the reigning World Champion Houston Astros to sign Altuve. The presence of just one other team in the market would have been enough to drive up the price making the Astros not afraid to pull the trigger on this big deal.

For the purpose of this article, the only two years we are worried about are the first two years (2018-2019), which are covered under Altuve’s actual contract at \$6 million and \$6.5 million. Thus, the Astros will receive \$17,000,000 in surplus in 2018 and \$18,500,000 in surplus in 2019, totaling \$59,435,000 in surplus during the length of this 6-year contract. Talk about a team friendly deal!

###### (Image from clickhole.com)

Lastly, a quick note on what transpired yesterday. In my opinion, Altuve signing his second extension is a double down on the risk-averse behavior he first exhibited in 2013. Had he waited the two years until his current deal expired and hit free agency at age 29, teams once again may have shown the pattern of this past offseason and may not have been willing to give a 29-year-old more than a 5-year deal. Thus, Altuve’s contract way back in 2013 may have dealt his hand yesterday and forced him to sign this contract. The 2013 team friendly deal once again works perfectly for the Astros who were able to extend Altuve for five years while getting to take him off the books by his age 34 season.

The details on player contracts were taken from spotrac.com.