When I am thinking about buying a ticket to a baseball game, often my first question is “Who’s pitching?” I have always felt that the most enjoyable type of game is one in which a great starter is on the mound. Is this feeling common among fans or do they buy tickets regardless of the starting pitcher?
To answer this question, I trained random forest models to predict attendance for games based on situational factors (not including the starting pitcher). Then I considered how the quality of starting pitchers relates to whether the models overestimate or underestimate the attendance. If the models consistently underestimate attendance when star pitchers are on the mound, it would suggest more tickets are sold because of the starter.
A random forest model to predict attendance was built for each season from 1938-2018. The cutoff point of 1938 was chosen because of sample size restraints.
The independent variables for the model are:
- Opposing team
- Day-of-week type (Monday-Thursday or Friday-Sunday)
- Time of day (day or evening)
The differential was calculated between the actual attendance and the predicted attendance using the formula:
Differential = 100 (Actual – Predicted) / Predicted
Note that this is the percent error formula where the predicted value of attendance is the reference and the actual attendance is the approximation. Since the actual attendance is the metric influenced by the starting pitcher, that value is being tested. Thus a positive differential indicates the actual is higher than the predicted number. The percent error as opposed to absolute difference is necessary due to the increase in overall park attendance over time.
Next, the average differential for each starting pitcher was determined, and we considered only pitchers who made at least 10 starts in the given season. To compare differentials to pitcher performance, fWAR was used as an approximation. It seemed to be the best choice as it captures both the quantity and quality of a pitcher’s performance. The pitchers were separated into four groups:
- Best: fWAR at or above the 90th percentile
- Above_avg: fWAR between the 50th and 90th percentile
- Below_avg: fWAR between the 10th and 50th percentile
- Worst: fWAR below the 10th percentile
The differentials by group are considered.
The following graphs show the average differential by group over time.
The Best group consistently has a larger attendance than predicted, the Worst and Below_avg groups have a consistently lower attendance, and the Above_avg group is around the predicted level. All four groups tend to be closer to the predicted level over time. These relationships are also clear in the following summary table that collects the data over decades:
Although the magnitude ebbs and flows, the sign of the differentials remains consistent within groups. The R-sqaured value shown is the average R-squared value of the models in the decade, and they increase over time as the average number of pitchers who made at least 10 starts increases. Overall, the models do a pretty good job fitting the data, especially with the larger sample sizes in recent years. Also note that the models tend to overestimate attendance more than underestimate, as all of the groups other than Best (which accounts for just the 90th percentile and above) averages a negative differential. This makes it even more convincing that the pitchers in the Best group are bringing in more fans.
In more concrete terms, the average differential of 2.80 for 2010-2018 means that if a pitcher in the Best group is starting, and the model predicts 30,000 people will come to the game, then about 840 additional fans should be expected.
It appears that the quality of starting pitchers do in fact impact ticket sales, and in particular star pitchers starting a game tend to put more fans in seats. This relationship is consistent over time, with no clear linear trend that would suggest starters are having more or less of an impact now than in past years.