Author Archive

The Effectiveness of the Speed and Movement of a Four-Seam Fastball

December 21, 2014

Introduction

A few weeks ago I posted a proposal for a regression analysis for an econometric class I am taking with the promise I would post the full analysis when it is complete. Well, its been completed and here is the full analysis, as promised. Its a lot of words so if you don’t care much for how a Probit model works or how to perform a t-test I will go ahead and tell my findinds now. I found that the speed of a four-seam fastball does help determine the outcome of the pitch–the faster the pitch the lower quality of contact. I also found that movement of a four-seam fastball is statistically insignificant–a four-seam fastball can have zero movement and the outcome will be the same for that pitch. This could be because a four-seam fastball just doesn’t move that much relative to other pitches, I’m not sure though. Also, the model I created has very low goodness of fit measures, which means speed and movement of a four-seam fastball only play a small part in determining the outcome of the pitch. This makes sense: baseball is a complicated game and a lot of variables go into determining an outcome. Without adding even more words to this post below is the paper, in its entirety.

It could easily be said Major League Baseball is in an arms race. Teams have been putting a greater emphasis on finding and developing pitchers who can throw a baseball faster than their peers. Indeed, the average velocity of a fastball has increased every year from 2004 to 2013, with a slight downtick in 2014. From 1990 to 1999, 37 pitchers threw 25 percent or more of their fastballs at 95 MPH or faster; in 2013, 149 pitchers did so. From 2003 to 2008, seven pitchers threw a fastball 100 MPH or faster 20 or more times in a season; from 2009 to 2013, 38 pitchers did so. Teams are trying to find flame-throwers because they believe the faster a ball travels towards home plate, the harder it is for a hitter to make the type of contact resulting in a hit. On the other hand, other factors not emphasized, such as the amount of movement of a fastball may play a role. When a pitcher throws a fastball, it moves. Just as some pitchers can throw a fastball with more velocity, some pitchers can throw a fastball with more movement than others. The relationship between velocity and contact should be the same for movement—the more movement there is, the harder it is to make good contact.

Due to this assumed relationship between velocity, movement, and outcome, I would like to answer the following questions: is it more difficult to hit a fast-moving four-seam fastball than one moving more slowly? Also, is it more difficult to hit a four-seam fastball the more movement it has? Therefore, my hypothesis is twofold: A fast pitch will be more difficult to hit than a slower moving pitch, and the more movement a pitch has, the harder it will be to hit. If my hypothesis is true, more speed and more movement will make a pitch more difficult to hit. The ball from a specific pitch is difficult to hit if a batter swings his bat and fails to make contact with the ball, or the contact made is poor and results in the batter making a strike, if he swings and misses, or an out, if he puts the ball in play.

The body of this paper is organized into six categories: the economic model, the econometric model, the data, the procedures of estimation and inference, the empirical results, and the conclusion. The economic model section explains the composition of the independent variables, the dependent variables, and the error term. It also explains the assumptions as well as provides a general framework for the type of model required for the estimation. The econometric model lays out the functional form of the economic model by formalizing the variables and creating the equations; it also establishes a method to test the statistical significance of the independent variables. The data section explains how the data was gathered, any issues that had to be resolved, and any hesitations about the quality of the data. The procedures of estimation and inference section describes the tools, software, and the specific models chosen to derive the results, why they were chosen, and the characteristics of the model. The empirical results section reports the means of the independent variables, the discrete profile for the outcomes, the parameter estimates, interval estimates, the value of the test statistics, and the goodness of fit measures; it also puts the parameters into the equations. Finally, the conclusion section analyzes the implications from the empirical results and offers possible explanations for the results.

The Economic Model

Independent Variables

A pitcher can throw many types of pitches. The pitcher can try to deceive the batter by throwing a pitch that has a lot of movement, such as a curveball or slider, or a pitch that is slower than it looks like it will be when it leaves the pitcher’s hand, such as a change-up. But no pitcher tries to deceive a hitter when throwing the four-seam fastball. When a pitcher throws a four-seam fastball he is simply trying to throw it as hard and accurate as he can. And this is what teams are searching for—the maximum velocity of a pitcher’s four-seam fastball and the higher the velocity, the better. Even though a pitcher is not trying to induce movement when he throws a four-seam fastball, the ball still moves either horizontally or vertically, which can affect the outcome of the pitch, just as velocity can. This means there will be two independent variables: velocity, measured in MPH, and total movement, which is horizontal plus vertical movement, measured in inches.

Dependent Variables

The dependent variables will be all of the possible per-pitch outcomes that involve the batter attempting to hit the pitch by swinging his bat; this excludes pitches an umpire calls a strike or a ball. These two outcomes are excluded because the batter did not swing his bat, which means the speed or movement of the pitch having any effect on avoiding contact, or inducing poor contact, cannot be discerned.

In addition, because the outcomes are per-pitch, walks and strikeouts are excluded because those outcomes are already accounted for. More specifically, if the batter walks, then he did not swing at the pitch, and it is therefore excluded. If the batter strikes out by swinging and missing, which is accounted for with the swinging-strike outcome, or by being called out by the umpire, then it is excluded because the batter did not swing his bat.

The included outcomes are: swinging strike, foul ball, ground-out, pop-out, fly-out, line-out, single, double, triple, and home run. The difference between a pop-out and a fly-out is who catches the ball: if an infielder catches a ball in the air then it is a pop-out, if an outfielder catches a ball in the air then it is a fly-out. Many types of outs have been included because each type of out can indicate what type of contact was made. For example, if the contact was poor, then the result will either be a ground-out or a pop-out. If the contact was solid, but the batter still made an out, then the result will be a line-out or a fly-out. If the contact did not result in an out, then it will be assumed the contact was good.

From a pitchers perspective the most desirable outcomes are, from most to least desirable: swinging strike, pop-out, ground-out, fly-out, line-out, foul, single, double, triple, and home run. This ranking also reflects a continuous spectrum of contact from softest to hardest. An argument can be made that a swinging strike does not belong on the spectrum because no contact was made. But no contact is still a type of contact; it is the absence of contact, which is the lowest quality of contact and the lowest point on the contact spectrum.

Error Term

The error term will capture the sequencing of the previous pitches, the count, the base-out state, the location of the pitch, and the quality of the defense.

Each pitch will be context neutral; the pitches that preceded it will not be accounted for. This can affect the outcome of the pitch because the absolute speed of the pitch may not matter as much if the previous pitches that a batter has seen in an at bat have been much slower than the four-seam fastball.

The count of the at bat can affect the outcome of the pitch because batters know, in some counts, pitchers are more likely to throw a four-seam fastball. In this case, the batter may be anticipating the four-seam fastball, which will give the batter an advantage. The base-out state can affect the outcome of the pitch because it can dictate what pitch a pitcher is more likely to throw. The location can also affect the outcome of the pitch because some locations are more difficult for a batter to reach with his bat when he swings. Also, pitchers generally know there are certain locations where most hitters of a certain handedness have difficulty hitting a four-seam fastball if thrown in the particular location, and the location is less sensitive to speed and movement.

The quality of the defense can affect the outcome of the pitch as well because it can turn hits into outs, if the defense is good, or it can turn outs into hits, if the defense is poor. This can cause the ranking of outcomes to be less predictive of the type of contact made for each outcome. For example, a ground ball that gets past an infielder is a single. But the contact made was the type of contact consistent with the contact for a ground-out, not a single. Since a ground-out is ranked third and a single is ranked seventh, the difference in quality of contact between the two outcomes is substantial.

Estimation Methods

Since the dependent variable can take only one of ten possible values the relationship between the independent and dependent variable is not linear and the Ordinary Least Squares model would not be appropriate for our purposes. The best type of model to predict one of the possible outcomes for a pitch given an initial value of velocity and movement is a Limited Dependent Variable model. A Limited Dependent Variable model is used when the value of the dependent variable is restricted to a range of possible outcomes that can be ranked in a meaningful manner. The estimation of the relationship between the independent and dependent variable requires the method to take into account the restriction and ranking. This model was chosen because the range of possible outcomes is restricted and the values are discrete—each pitch can only result in one of ten possible outcomes—and the outcomes are ordered by their value to the pitcher. Also, the relationship between velocity, movement, and the outcome of the pitch requires the ranking of the outcomes to be accounted for because it is assumed velocity and movement influence the type of outcome.

Since the outcomes are also ranked by type of contact, an outcome occurs only if the contact for a particular outcome is greater than the contact required for the outcome located below it and less than the contact required for the outcome located above it. For example, if the contact made was greater than the contact required for a ground-out, but less than the contact required for a line-out, the outcome would most likely be a fly-out.

This type of reasoning implies interval estimates will need to be created for each outcome. Each interval estimate will have a lower limit and an upper limit; if the value the model calculates, given an initial value of velocity and movement, lies between the upper and lower limit, then the outcome the interval estimate represents will be the outcome to most likely occur.

The Econometric Model

Regression Equation

Formalizing the independent variables, dependent variables, and error term results in the following equations:

O_i= β₁ + β₂*V + β₃*M + ε (1)

Where ε ~ (0, σ²) (2)

The right side of equation 1 contains the dependent variable, outcome, and the subscript i represents the type of outcome. The left side of equation 1 has two parts: a structural component and a random component. The structural component contains the independent variables where β₁is the intercept, β₂ is the estimated coefficient for velocity, V is velocity in MPH, β₃is the estimated coefficient for movement, and M is horizontal movement plus vertical movement in inches. The random component is the error term, ε; it is the residual that cannot be explained by the variables in the model. The error term is assumed to have a standard normal distribution, which is indicated by equation 2.

Interval Estimates

If equation 1 is less than the lower limit of an outcome ranked two outcomes higher of the upper limit that equation 1 is greater than, the outcome is the one located between these two outcomes. This can be said in terms of quality of contact as well: if the quality of contact a particular amount of velocity and movement is likely to induce is less than the lower limit for the quality of contact required for an outcome located immediately above a particular outcome, but the quality of contact is greater than the upper limit for the quality of contact required for an outcome immediately below a particular outcome, the quality of contact results in the outcome located between the quality of contact required for the upper and lower limit of the particular outcomes. This means equation 1 can be used to create an interval estimate for a particular outcome:

L_Of < β₁ + β₂*V + β₃*M + ε < -L_Oc= O_i (3)

L_Ofis the upper limit for outcome f and -L_Ocis the lower limit for outcome c. Outcome f’s quality of contact is located immediately above the maximum amount of contact required for outcome i and outcome c’s quality of contact is located immediately below the minimum amount of contact required for outcome i. With that being said, interval estimates can be created for all of the outcomes and can be written as:

O_SS if O_i > L_po	₍₄₎
O_PO if -L_ss > O_i > L_go	(5)
O_GO if -L_po > O_i > L_fo	(6)
O_FO if -L_go > O_i > L_lo	(7)
O_LO if -L_fo> O_i > L_fl	(8)
O_FL if -L_lo > O_i > L_sl	(9)
O_SG if -L_fl > O_i > L_db	(10)
O_DB if -L_sl > O_i > L_tp	(11)
O_TP if -L_db > O_i > L_hr	(12)
O_HR if -L_tp > O_i	(13)

To make sense of equations 4 through 13, the outcomes have been assigned the following categorical values and subscripts in Table 1: Categorical Values & Subscripts

Outcome	Value	Subscript
Swinging Strike	10	SS
Pop Out	9	PO
Ground Out	8	GO
Fly Out	7	FO
Line Out	6	LO
Foul	5	FL
Single	4	SG
Double	3	DB
Triple	2	TP
Home Run	1	HR

Using equations 3, and 4 through 12, the interval estimates can be derived for each outcome, those equations are:

L_PO< β₁ + β₂*V + β₃*M + ε = O_SS(13)

L_GO < β₁ + β₂*V + β₃*M + ε < -L_SS= O_PO(14)

L_FO < β₁ + β₂*V + β₃*M + ε < -L_PO= O_GO(15)

L_LO < β₁ + β₂*V + β₃*M + ε < -L_GO= O_FO(16)

L_FL < β₁ + β₂*V + β₃*M + ε < -L_FO= O_LO(17)

L_SG < β₁ + β₂*V + β₃*M + ε < -L_LO= O_FL(18)

L_DB < β₁ + β₂*V + β₃*M + ε < -L_FO= O_SG(19)

L_TP < β₁ + β₂*V + β₃*M + ε < -L_SG= O_DB(20)

L_HR < β₁ + β₂*V + β₃*M + ε < -L_DB= O_TP(21)

-L_TP > β₁ + β₂*V + β₃*M + ε= O_HR(22)

Hypothesis Testing

Once the estimates for the coefficients are reported, their level of significance can be tested. To do this a null and alternative hypothesis was created:

H_o: β₂= 0, β₃= 0 (23)

H₁: β₂≠ 0, β₃≠ 0 (24)

Equation 23 is the null hypothesis and it states the coefficients for velocity and movement equal 0. This means if one of the coefficients is 0, the predicted outcome and quality of contact will not change. Equation 24 is the alternative hypothesis and it states the coefficients for velocity and movement is not equal to 0. This means the coefficients do influence the outcome and quality of contact. The next step in hypothesis testing is calculating a test statistic. Since the assumption is the error terms have a standard normal distribution and they are homoscedastic—all of the error terms have the same variance—the t-test will be used for the test statistic. The next step is to establish a rejection region. Because the alternative hypothesis is “not equal to” then a two-tail test needs to be used. This is done with the following equation:

t₍_{α/2, N-3)}< t < t_(1-_{α/2, N-3)}(25)

Where α is the critical value for the level of significance, N is the amount of observations, and N-3 is the degrees of freedom—3 is being subtracted because 3 degrees have been used by the coefficients and intercept. The rejection region has two regions: one located in the lower tail of the curve, the other located in the upper tail of the curve. The space to the left of t₍_α/2,N-3)is the lower tail and the space to the right of t_(1-_{α/2, N-3)}is the upper tail. Equation 25 states the null hypothesis can be rejected for two reasons: if t is greater than t₍_{α/2, N-3),}or if it t is less than t_(1-_{α, N-3)}. If either of these is true, the null hypothesis is located beyond the critical value somewhere in one of the rejection regions, which means the null hypothesis can be rejected and the alternative hypothesis can be accepted. But, if both of the reasons needed to reject the null hypothesis are false, the null hypothesis is located before the critical value of both tails somewhere in the acceptance region, which means it cannot be rejected and the coefficient being tested could be 0—which is statistically insignificant.

Data

The data was collected from www.BaseballSavant.com. This website maintains the PITCH f/x database, which contains data on every pitch thrown from the 2008 to 2014 season, using high speed cameras located in every Major League ballpark. Since the data from 2008 to 2009 has some classification issues, those years are excluded from the data sets; thus the data sets are from seasons 2010 to 2014. Each data set has approximately 21,000 observations. Since there are five data sets, the total amount of observations is approximately 105,000.

The website allows for many types of filters to be used when searching for data, but the filters used for our purposes are pitch type, pitch result, batted ball result, and at-bat result. The filters for pitch result do not include the type of outcome resulting from the ball being put in play. To get those results the filters for at-bat result had to be used. This resulted in the inclusion of data that was supposed to be excluded. For example, if a four-seam fastball was thrown during an at-bat, but the batter did not swing, then it needs to be excluded, but if the at-bat ended with one of the selected at-bat filters then it was included in the data set. All lines of data containing this type of issue had to be removed from the data sets.

Also, the data on movement came in two components—horizontal movement and vertical movement. Some of the values for horizontal and vertical movement were negative and some were positive. Horizontal movement is positive if the pitch moves towards the right side of home plate, and negative if the pitch moves towards the left side of home plate from the catchers’ perspective. Vertical movement is positive if the pitch drops less than it would from gravity alone, and negative if the pitch drops more than it would from gravity alone. If a pitch had one type of movement that was positive and another type of movement that was negative, the two values would subtract from each other when adding them together and not properly reflect total movement. To prevent this from occurring, the absolute value was taken for each type of movement and then added together.

Since a Limited Dependent Variable model is being used, a new variable had to be created. This variable captures the ranking of each outcome by assigning a numerical value to each type of outcome. Since each outcome was ranked from least to most desirable from the perspective of the pitcher, the least desirable outcome, a home run, was assigned the value of one, and the most desirable outcome, a swinging strike, was assigned the value of ten. Also, a variable had to be created indicating the year from which the data originated. Since there are five years’ worth of data, the variable could take on one of five possible values—1 through 5. This was done because all of the data was combined when put into the program. Having a variable indicating year allowed for a dummy variable to be created in the program so different data sets could be created and regressions could be run on each data set, and then all the data sets combined.

Procedures of Estimation and Inference

The program used to run the regression was SAS, version 9.3. The procedure used to estimate the mean, standard deviation, and the minimum and maximum values for the independent variables was the MEANS procedure. The procedure used to estimate the intercept, coefficients, and interval estimates was the QLIM procedure. The QLIM procedure is a Limited Dependent Variable model, and can use either the Binary Probit or Logit model, or the Ordinal Probit or Logit model. The Binary Probit or Logit model is used when the dependent variable assumes only one of two values. Since the dependent variable has ten possible values, the Binary model was not appropriate for our purposes. The Ordinal Probit or Logit model allows for a dependent variable to assume more than two values and the values can be ranked in either ascending or descending order, which was most appropriate for our purposes. The difference between the Ordinal Probit and Ordinal Logit model is the Ordinal Logit model assumes the error term has a standard Logistic distribution, and the Ordinal Probit model assumes the error term has a standard Normal distribution. Error terms can be assumed to have a standard normal distribution if the dependent variable is influenced by an unobserved continuous variable and the possibilities for the unobserved continuous variable is infinite, even if the possibilities are bounded between a minimum and maximum value.

The outcome of a pitch can be thought of as a proxy for quality of contact—the softer the contact the better the outcome for the pitcher and vice versa. Even though the model has ten dependent categorical ordinal outcomes—which by definition means it is not continuous—it measures a single variable at a distance, which is quality of contact. Quality of contact can be thought of as being continuous: it is a spectrum of infinite possibilities bounded between two values—no contact and perfect contact. Even though perfect contact is a nebulous concept, it still acts as a boundary that cannot be surpassed. This means quality of contact meets the criteria for having error terms that have a standard normal distribution, which means the Ordinal Probit model is the model most appropriate for our purposes.

The purpose of the Ordinal Probit model is to estimate the probability an observation will fall into one of the categorical outcomes. The central idea behind the Ordinal Probit model is there is an unobserved continuous variable underlying the dependent variable, which influences the ordering of the dependent variable. The unobserved continuous variable is quality of contact, which is assumed to determine the outcome, and it is assumed velocity and movement of a pitch influence quality of contact.

The Ordinal Probit model creates upper and lower threshold values partitioning the continuous variable into a series of regions corresponding to one of the ordinal categories representing one of the regions along the continuous spectrum. These upper and lower thresholds create intervals; each interval corresponds to a range of contact required for a particular type of outcome. Quality of contact lies on a continuous spectrum of no contact to perfect. Each outcome occupies a region along the quality of contact spectrum. Each outcome has two threshold values: if the quality of contact worsens and passes an upper threshold quality of contact value of a particular outcome, the outcome will be the outcome ranked immediately below the outcome whose upper threshold quality of contact value was passed, this is a lower limit. If the quality of contact improves and passes the lower threshold quality of contact value of a particular outcome, the outcome will be the outcome ranked immediately above whose lower threshold quality of contact value was passed, this is an upper limit.

The Ordinal Probit model relaxes the constraint that the effect of the independent variables is constant across different predicted values of the dependent variable. The model assumes an S-shaped curve. In each tail section of the curve the dependent variable responds slowly to changes in the independent variables, and as it moves closer towards the middle of the curve, the dependent variable responds faster. This implies as the probability of a particular outcome occurring approaches .5, changes in velocity and movement cause relatively large changes in the probability of a particular outcome occurring. As the probability of a particular outcome occurring approaches 0 or 1, changes in velocity and movement induces relatively small changes in the probability of the particular outcome occurring.

This cascading effect of outcome-probability has intuition: if the probability of an outcome occurring approaches 0, the probability of the outcomes furthest away—either below its lower limit or above its upper limit depending on the type of contact—must be approaching 1. This means as the probability of a particular outcome decreases by a particular amount, the amount it decreases by is allocated disproportionally between the outcomes in a particular direction in descending order, with the outcome ranked immediately above or immediately below receiving the biggest increase in probability of occurrence, and the outcome furthest away probability of occurrence increasing the least, which is closest to 1. Another way to put it is, as velocity and movement changes, contact moves along its spectrum changing the probability of each of outcome occurring; some probabilities increase and some decrease. If the probability of an outcome decreases, the amount it decreases by increases the probability of the outcome located immediately below or above to increase the most, and the outcome located the furthest away to increase the least, with the probability of all the intermediate outcomes increasing or decreasing disproportionally with their distance from the origin.

For example, a home run and swinging strike are on opposite ends of the contact spectrum. If the probability of a home run occurring approaches 0, and the probability of a swinging strike occurring approaches 1, the amount of velocity and movement—and therefore contact—required for the two outcomes is substantially different because the probability of anything occurring in between must be approaching 0, but not at the rate in which the home run contact is approaching 0. As velocity and movement change towards the amount of velocity and movement required to induce the type contact resulting in a home run, then the probabilities of the outcomes located between swinging strike and home run will increase, with the probability of the outcome located immediately below swinging strike, pop-out, increasing the most, and the outcome located immediately below pop-out, ground-out, increasing the second most, and so on, with the probability of a home run occurring increasing the least. As velocity and movement continue to change and contact moves along its spectrum towards the type of contact required for a home run, the probabilities of each outcome change with the outcomes closest to a swinging strike increasing the most until, eventually, the allocation of probability is reversed and the probability of a home run occurring approaches 1 and the probability of a swinging strike occurring approaches 0.

Empirical Results

Discrete Response Profile & Means

Table 2 is the discrete response profile for seasons 2010 to 2014. It reports the frequency of each outcome and the percent the frequency represents of all the outcomes.

Index	Outcome	Frequency	% of Total
1	Home Run	6	0.01%
2	Triple	196	0.18%
3	Double	2,252	2.09%
4	Single	8,112	7.52%
5	Foul	50,835	47.12%
6	Line Out	2,891	2.68%
7	Fly Out	10,435	9.67%
8	Ground Out	12,198	11.31%
9	Pop Out	4,072	3.77%
10	Swinging Strike	16,881	15.65%

Table 3 contains the amount of observations for each variable, the mean, standard deviation, and the minimum and maximum values for seasons 2010 to 2014.

Variable	N	Mean	Std Dev	Min	Max
Velocity	107,880	91.9668966	2.9209241	78	104.1
Movement	107,880	13.1161436	3.2585848	0.29	44.41

Parameter Estimates

Table 4 contains the parameter estimates for data from the 2010 to 2014 seasons. It contains the estimates, standard error, t values, and p values for each of the parameters. The standard error indicates the accuracy of the estimate in representing the population. The t and p values test for statistical significance. They both assume the null hypothesis is true and equal to 0. The t value indicates if the estimate is statistically significant from 0, the larger the t value, the more likely the null hypothesis is wrong and the parameter is statistically significant from 0. The p value indicates the probability the null hypothesis is true and the parameter is not statistically significant from 0. The lower the p value the more likely the null hypothesis is false and the parameter is statistically significant from 0.

Parameter	Estimate	S.E.	t Value	Pr > [t]
Intercept	2.175107	0.134481	16.17	< .0001
Velocity	0.017939	0.001114	16.1	< .0001
Movement	-0.001804	0.000998	-1.81	0.0706

Hypothesis Testing

Since the standard error and t value have been reported, their level of significance can be tested. Using the null and alternative hypothesis from equations 23 and 24 and using a critical value of 5 percent, equation 25 can be written as:

t₍_{2.5, 107,877)}< 16.10 < t_(97.5_{, 107,877)}= -1.960 < 16.1 < 1.960 (26)

t₍_{2.5, 107,877)}< -1.81 < t_(97.5_{, 107,877)}= -1.960 < -1.81 < 1.960 (27)

Equation 26 is the test hypothesis for velocity. Since -1.960 is less than 16.1 and 16.1 is greater than 1.960, the null hypothesis for velocity is located to the right of the critical value in the upper tail of the curve somewhere in the rejection region, which means it can be stated with 95% confidence that velocity is statistically significant from 0 and influences the quality of contact and the outcome of the pitch, holding movement constant.

Equation 27 is the test hypothesis for movement. Since -1.960 is less than -1.81, but -1.81 is not greater than 1.960 then the null hypothesis for movement is located to the left of the upper tail’s critical value, which is not beyond the critical value in the rejection region, and means the null hypothesis cannot be rejected. This means it can be stated with 95% confidence that movement is not statistically significant from 0. If movement were 0, the quality of contact and outcome of the pitch would not change, holding velocity constant. This means it can be removed from equations 1, 3, and 13 through 22. Regression Equation Since estimates for the parameters have been calculated and their level of significance has been determined, the values can be plugged into equation 1 to get:

O_i= 2.175107 + .017939*V + ε (26)

Movement has been removed because it has no effect on the outcome. Also, the error term remains unknown because its precise value cannot be determined using a Limited Dependent Model. The error term takes on a range of values depending on the value of the independent variables and the value of the upper and lower limit of the outcome.

Interval Estimates

Table 5 contains the interval estimates for seasons 2010 to 2014 for each type of outcome. It gives the lower limit, upper limit, standard error, t value, and p value, and the upper limit minus the lower limit, which gives the size of the interval.

Parameter	Home Run	Triple	Double	Single	Foul	Line Out	Fly Out	Ground Out	Pop Out	Swinging Strike
Lower Limit		0.900493	1.798627	2.175107	2.5067	3.975195	4.043806	4.304515	4.664113	4.811196
Upper Limit	0.900493	1.798627	2.175107	2.5067	3.975195	4.043806	4.304515	4.664113	4.811196
S.E.		0.086475	0.088013	0.088104	0.088125	0.088165	0.08818	0.088204	0.088235
t Value		10.41	20.44	28.45	45.11	45.87	48.81	52.88	54.53
Pr > [t]		< .0001	< .0001	< .0001	< .0001	< .0001	< .0001	< .0001	< .0001
Upper – Lower		0.898134	0.37648	0.331593	1.468495	0.068611	0.260709	0.359598	0.147083

Velocity can be removed from equations 13 through 22 and the values from Tables 4 and 5 can be plugged into the equations to get:

4.811196< 2.175107 + .017939*V + ε = O_SS(27)

4.664113 < 2.175107 + .017939*V + ε < 4.811196 = O_PO(28)

4.304515 < 2.175107 + .017939*V + ε < 4.664113 = O_GO(29)

4.043806 < 2.175107 + .017939*V + ε < 4.304515 = O_FO(30)

3.975195 < 2.175107 + .017939*V + ε < 4.043806 = O_LO(31)

2.506700 < 2.175107 + .017939*V + ε < 3.975195 = O_FL(32)

2.175107 < 2.175107 + .017939*V + ε < 2.506700 = O_SG(33)

1.798627 < 2.175107 + .017939*V + ε < 2.175107 = O_DB(34)

0.900493 < 2.175107 + .017939*V + ε < 1.798627 = O_TP(35)

0.900493 > 2.175107 + .017939*V + ε = O_HR(36)

Goodness of Fit Measures

Goodness of fit measures describes how well the model fits the observations. The measures typically summarize the discrepancy between observed values and the expected values in the model. Since the linear regression model was not used, the goodness of fit measures is not those that are typically expected such as the coefficient of determination, R². Table 6 contains the reported measures for the data from the 2010-2014 seasons.

Measure	Value
Likelihood Ration (‘R)	259.29
Upper Bound of R (U)	350699
Aldrich-Nelson	0.0024
Cragg-Uhler 1	0.0024
Cragg Uhler 2	0.0025
Estralla	0.0024
Adjusted Estralla	0.0022
McFadden’s LRI	0.0007
Veall-Zimmerman	0.0031
McKelvey-Zavoina	0.0027

The most useful of these measures is McFadden’s LRI because it is analogous to R². It is bounded between 0 and 1 and, in theory, can equal 1, meaning the model is a perfect fit for the data, even though most models that are a good fit fall in the range of .2 to .4 (vii). All of the other measures except for the Likelihood Ration (R) and Upper Bound of R (U) are similar to McFadden’s LRI—they’re an attempt to simulate R².

Conclusion

Since the estimated coefficient for velocity is positive, the greater the amount of velocity the lower the quality of contact, meaning a desirable outcome for the pitcher is likely to occur. This supports the first part of the hypothesis. But the estimate for movement was not significantly different from 0, which does not support the second part of the hypothesis. A pitcher is not trying to induce movement when he throws a four-seam fastball and the movement that does occur is relatively little compared to pitches in which a pitcher is trying to induce movement. Indeed, a four-seam fastball rotates backwards, which keeps the ball straight and limits the movement. This relatively small amount of movement may not do much to deceive a hitter and cause him to either swing and miss or make poor contact. It would be interesting to see if the amount of movement in pitches in which a pitcher is trying to induce movement leads to lower quality of contact.

According to Table 2, it appears to be difficult for a pitcher to get a hitter to swing and miss at a four-seam fastball. Hitters make contact 84.35 percent of the time, and swing and miss 15.65 percent of the time. It also appears to be difficult for a hitter to make the type of contact required to not make an out—only 9.8 percent of the outcomes resulted in a hit. When a hitter does make an out the type of contact is mostly poor—54.97 percent of the outs are ground-outs and pop-outs. The outs requiring a bit more solid contact—line-out and fly-out—make up 45.03 percent of all the outs. It also appears the most frequent outcome is a foul. Fouls can be good for a pitcher if they result in strikes, but a foul will only result in a strike if the count has less than two strikes. If the count for the hitter has two strikes, it is good for the hitter because he gets to see another pitch.

Since the interval for the foul is the largest and the intercept is the lower limit for the outcome immediately above it—the single—it is easy to see the model predicts the most likely outcome to be a foul. This makes sense because it was the outcome that occurred most often by a wide margin. But given the ambiguity of the foul in terms of value to the pitcher and hitter, and the quality of contact required to cause a foul, any type of positive analysis will be ambiguous. A statement cannot be made about the value of this outcome except the value changes from the pitcher to the hitter depending on the count.

Since the goodness of fit measure is rather low, the model is not a good fit for the data. This result does not mean the model is not predictive. Rather, it means there are other variables influencing the quality of contact and the outcome of the pitch that are not included in the model. In some ways, this makes sense: baseball is a complicated game and the outcome of a four-seam fastball depends on much more than just velocity and movement. Things such as the location of the pitch, the sequencing of the previous pitches, the handedness of the pitcher and batter, the base/out state, and the count play a large part in determining the outcome of the pitch. If some of these variables were included in the model then its predictive power and goodness of fit would have most likely increased.

Taking the average fastball velocity from table 3, 91.96 MPH, plugging it into equation 26 and ignoring the error term, the value is 3.82, which falls in the interval for foul, as expected. But, in order for the speed to result in a swinging strike, it needs to travel around 147 MPH, or 19 standard deviations above the mean. This doesn’t fit very well with reality—no pitcher will ever throw a pitch at 147 MPH and plenty of hitters swing and miss four-seam fastballs with velocity around the mean. If velocity were the only thing determining the outcome, it would require 147 MPH to result in a swing and miss. But velocity is not the only determinant; it has only a small influence over the outcome of the pitch. This supports the conclusion the model does not fit the data very well and the error term is probably rather large relative to the estimated coefficient for velocity.

In the extremely competitive environment of major league baseball where teams flesh out the smallest advantage to give them an edge over their competitors, it makes sense for them to put a greater emphasis on velocity. It does have an influence on generating favorable outcomes for the pitcher. Therefore the trend in baseball is likely to continue and velocity is going to continue to increase.

A Proposal for Regression Analysis of a Four-Seam Fastball

by DFendlason

November 2, 2014

Hello, I am new to this, and this is my first post. I think I should introduce myself first. My name is Daniel Fendlason, and I am a first year graduate student at Tulane University, New Orleans, Louisiana, and I am studying Economics, which is very fun stuff. I did my undergraduate studies at Northeastern University, Boston, Massachusetts, which is where I majored in Finance and minored in Economics.

Ok, now on to the point for doing this in the first place. I am taking Econometrics this semester, and it requires a research paper researching something that we find interesting. Since I am interested in baseball I decided to do my research paper on baseball. A proposal is due in a few days, and below is that proposal. Please read and tell me what you think. I will follow up and submit the full paper when it is due, which is in December. So, without further digressions, enjoy.

Proposed Title: “The effectiveness of the speed and movement of a four-seam fastball”

In my investigation, I would like to better understand the sport of baseball by answering the following questions: is it more difficult to hit a faster moving four-seam fastball than one that is slower moving? Also, is it more difficult to hit a four-seam fastball if it is moving in a more horizontal manner or a more vertical manner? My hypothesis is twofold: if a pitch is faster, it will be more difficult to hit, and if a pitch moves more, it will be more difficult to hit. If my hypothesis is true, then more speed and more movement will make a ball more difficult to hit. The ball from a specific pitch is difficult to hit if a skilled batter swings his bat and does not make contact with the ball, or the contact that is made is poor and results in the batter making a strike, if he swings and misses, or an out, if he puts the ball in play.

Independent Variables

A pitcher can throw many types of pitches. The pitcher can try to deceive the batter by throwing a pitch that has a lot of movement, like a curveball or slider, or a pitch that is slower than it looks when the ball leaves the pitcher’s hand, like a change-up. But the four-seam fastball is the only pitch the pitcher is not trying to intentionally deceive the hitter with movement or deception-of-speed. When a pitcher throws a four-seam fastball he is simply trying to throw it as hard, and as accurate, as he can.

Even though a pitcher is not trying to induce movement when he throws a four-seam fastball, the ball still moves—in fact, the ball can move horizontally, vertically, or both horizontally and vertically. This unintended movement has an effect on the batter to make contact, which means that there will be three independent variables: speed, vertical movement plus horizontal movement, and total movement plus speed. Since there are three independent variables, to analyze this situation three models will need to be created. This should not be difficult, as all that has to change is the variable on the left side of the equation; the dependent variables will remain the same for each model.

Dependent Variables

The dependent variables will be all of the possible per-pitch outcomes that involve the batter attempting to hit the pitch by swinging his bat; this excludes pitches that an umpire calls a strike or a ball. These two outcomes are excluded, because the batter did not swing his bat, which means that the speed or movement of the pitch having any effect on avoiding contact, or inducing poor contact, cannot be discerned.

In addition, because the outcomes are per-pitch, the walks and strikeouts are excluded, because those outcomes are already accounted for. More specifically, if the batter walks, then he did not swing at the pitch and is therefore excluded. If the batter strikes out, then he swung and missed, which is accounted for with the swinging-strike outcome, or the umpire calls him out which is excluded, because the batter did not swing his bat.

The included outcomes are: swinging strike, foul ball, ground-out, infield fly-out, outfield fly-out, line-out, single, double, triple, and home run. I’ve included many types of outs, because each type of out can tell us what type of contact was made. For example, if the contact was poor, then the result will either be a ground-out or an infield fly-out. If the contact was solid, but the batter still made an out, then the result will be a line-out, or an outfield fly-out. If the contact did not result in an out, then it will be assumed that the contact was solid.

Error Term

The error term will include the sequencing of the previous pitches, the count, the base-out state, the location of the pitch, and the quality of the defense.

Each pitch will be context neutral; the pitches that preceded it will not be accounted for. This can affect the outcome of the pitch, because the absolute speed of the pitch may not matter as much if the previous pitches that a batter has seen in an at bat have been much slower than that of the four-seam fastball.

The count of the at bat can affect the outcome of the pitch, because batters know that, in some counts, pitchers are more likely to throw a four-seam fastball. In this case, the batter may be anticipating the four-seam fastball, which will give the batter an advantage. The base-out state can affect the outcome of the pitch, because it can dictate what pitch a pitcher is more likely to throw. The location can affect the outcome of the pitch, because some locations are more difficult for a batter to reach with his bat when he swings. The quality of the defense can affect the outcome of the pitch as well, because it can turn hits into outs, if the defense is good, or it can turn outs into hits, if the defense is poor.

Data

The data will be collected from www.baseballsavant.com. This website contains data on every pitch thrown from the seasons of 2008 to 2014. The website allows the user to apply filters, which means that the data can be filtered by pitch type, and pitch outcome.

The data will include every four-seam fastball that was thrown in seasons 2008 to 2014. Statistics for the fastballs will include speed, horizontal movement, vertical movement, and all outcomes except walks, strikeouts, called strikes, and balls. Since the outcomes are not numerical values, a numerical code will need to be assigned to each outcome. Table 1 illustrates the numerical code that will be used in this study.

Each year’s worth of data contains approximately 50,000 lines of data. Hence, the initial assumption is that the data is normally distributed and linear. Since there are seven years of data, each model can be run seven different times. This will render a much more unbiased coefficient for each dependent variable.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG