A Proposal for Regression Analysis of a Four-Seam Fastball

Hello, I am new to this, and this is my first post. I think I should introduce myself first. My name is Daniel Fendlason, and I am a first year graduate student at Tulane University, New Orleans, Louisiana, and I  am studying Economics, which is very fun stuff. I did my undergraduate studies at Northeastern University, Boston, Massachusetts, which is where I majored in Finance and minored in Economics.

Ok, now on to the point for doing this in the first place. I am taking Econometrics this semester, and it requires a research paper researching something that we find interesting. Since I am interested in baseball I decided to do my research paper on baseball. A proposal is due in a few days, and below is that proposal. Please read and tell me what you think. I will follow up and submit the full paper when it is due, which is in December. So, without further digressions, enjoy.

Proposed Title: “The effectiveness of the speed and movement of a four-seam fastball”

In my investigation, I would like to better understand the sport of baseball by answering the following questions: is it more difficult to hit a faster moving four-seam fastball than one that is slower moving? Also, is it more difficult to hit a four-seam fastball if it is moving in a more horizontal manner or a more vertical manner? My hypothesis is twofold: if a pitch is faster, it will be more difficult to hit, and if a pitch moves more, it will be more difficult to hit. If my hypothesis is true, then more speed and more movement will make a ball more difficult to hit. The ball from a specific pitch is difficult to hit if a skilled batter swings his bat and does not make contact with the ball, or the contact that is made is poor and results in the batter making a strike, if he swings and misses, or an out, if he puts the ball in play.

Independent Variables

A pitcher can throw many types of pitches. The pitcher can try to deceive the batter by throwing a pitch that has a lot of movement, like a curveball or slider, or a pitch that is slower than it looks when the ball leaves the pitcher’s hand, like a change-up. But the four-seam fastball is the only pitch the pitcher is not trying to intentionally deceive the hitter with movement or deception-of-speed. When a pitcher throws a four-seam fastball he is simply trying to throw it as hard, and as accurate, as he can.

Even though a pitcher is not trying to induce movement when he throws a four-seam fastball, the ball still moves—in fact, the ball can move horizontally, vertically, or both horizontally and vertically. This unintended movement has an effect on the batter to make contact, which means that there will be three independent variables: speed, vertical movement plus horizontal movement, and total movement plus speed. Since there are three independent variables, to analyze this situation three models will need to be created. This should not be difficult, as all that has to change is the variable on the left side of the equation; the dependent variables will remain the same for each model. 

Dependent Variables

The dependent variables will be all of the possible per-pitch outcomes that involve the batter attempting to hit the pitch by swinging his bat; this excludes pitches that an umpire calls a strike or a ball. These two outcomes are excluded, because the batter did not swing his bat, which means that the speed or movement of the pitch having any effect on avoiding contact, or inducing poor contact, cannot be discerned.

In addition, because the outcomes are per-pitch, the walks and strikeouts are excluded, because those outcomes are already accounted for. More specifically, if the batter walks, then he did not swing at the pitch and is therefore excluded. If the batter strikes out, then he swung and missed, which is accounted for with the swinging-strike outcome, or the umpire calls him out which is excluded, because the batter did not swing his bat.

The included outcomes are: swinging strike, foul ball, ground-out, infield fly-out, outfield fly-out, line-out, single, double, triple, and home run. I’ve included many types of outs, because each type of out can tell us what type of contact was made. For example, if the contact was poor, then the result will either be a ground-out or an infield fly-out. If the contact was solid, but the batter still made an out, then the result will be a line-out, or an outfield fly-out. If the contact did not result in an out, then it will be assumed that the contact was solid.

Error Term

The error term will include the sequencing of the previous pitches, the count, the base-out state, the location of the pitch, and the quality of the defense.

Each pitch will be context neutral; the pitches that preceded it will not be accounted for. This can affect the outcome of the pitch, because the absolute speed of the pitch may not matter as much if the previous pitches that a batter has seen in an at bat have been much slower than that of the four-seam fastball.

The count of the at bat can affect the outcome of the pitch, because batters know that, in some counts, pitchers are more likely to throw a four-seam fastball. In this case, the batter may be anticipating the four-seam fastball, which will give the batter an advantage. The base-out state can affect the outcome of the pitch, because it can dictate what pitch a pitcher is more likely to throw. The location can affect the outcome of the pitch, because some locations are more difficult for a batter to reach with his bat when he swings. The quality of the defense can affect the outcome of the pitch as well, because it can turn hits into outs, if the defense is good, or it can turn outs into hits, if the defense is poor.

Data

The data will be collected from www.baseballsavant.com. This website contains data on every pitch thrown from the seasons of 2008 to 2014. The website allows the user to apply filters, which means that the data can be filtered by pitch type, and pitch outcome.

The data will include every four-seam fastball that was thrown in seasons 2008 to 2014. Statistics for the fastballs will include speed, horizontal movement, vertical movement, and all outcomes except walks, strikeouts, called strikes, and balls. Since the outcomes are not numerical values, a numerical code will need to be assigned to each outcome. Table 1 illustrates the numerical code that will be used in this study.

Each year’s worth of data contains approximately 50,000 lines of data. Hence, the initial assumption is that the data is normally distributed and linear. Since there are seven years of data, each model can be run seven different times. This will render a much more unbiased coefficient for each dependent variable.





6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Matthias
9 years ago

Is it possible to filter for handedness of batter and pitcher? If so, a dummy variable could be used to indicate a same-handed at bat. That would also necessitate multiple linear regression. Is that on the table?

A. R.
9 years ago

It’s a little unclear to me what exactly your dependent variables are going to be. Are you creating separate regression models for each outcome, or just one model where the dependent variable is essentially “quality of contact?”

Peter Jensen
9 years ago

Daniel – The effectiveness of speed and movement on fastballs has been studied many times so you are not going to be doing any groundbreaking work here. Given that I would limit the number of variables in your study. Look just at the 3 ball 2 strike count and just at right handed pitchers and right handed batters. Also, start your study with 2010 data instead of 2008 because there was much less consistency in labeling of pitches in 2008 and 2009. I don’t know how baseball savant is classifying pitches. You may want to look at Dan Brooks site. Dan and his partner Harry Pavlidis use an improve pitch classification algorithm which I believe is consistent from year to year. They also may do some error correction to the raw Pitch Fx data.

I don’t know why you would think that non-swings are not affected by speed and movement. Increased speed gives less time for the batter to decide whether to swing or not which should increase the number of mistakes the batter makes. Increased movement should also creates more batter errors of whether to swing or not. Ideally, one would have Hit Fx data of hit ball speed to judge effectiveness of contact on struck balls. Since that is not available to you using hit ball types (line drive, fly ball, pop up, ground ball) is probably a better option than using actual outcomes as you are proposing. If you determine league average values for those four hit ball types plus league average values for strikes and balls both swinging and called (but not foul balls which would result in a do over at 3 and 2), then you can just use runs as a single dependent variable.

Limiting to 3 and 2 means that the pitcher has a high motivation to throw a good pitch and the batter has a high motivation to swing at any good pitch. Limiting to right hander versus right hander means that the pitch is always going to move up and in and the only variables will be how much of each and speed.

Jonah Pemstein
9 years ago
Reply to  Peter Jensen

The Brooks Baseball data isn’t available for mass download, unfortunately. Baseball Savant data is.

Thanks, Comcast
9 years ago

I imagine you’re not looking for any more batted ball outcomes, but I think batted ball direction (in the case of fly balls especially) is significant when investigating fastball velocity. In other words, a fly ball to the pull side has a significant positive effect on run expectancy (I fully expect to HTMLfail on this link, for the record), while those hit to the opposite field and center third do not.

Rob Mainsmember
9 years ago

Hey Daniel, I’m sure this study is going to be a lot more entertaining for your prof than the umpteenth analysis of the impact of monetary policy on the broad economy (“Hey, maybe I can use this paper to land a job at the Fed!”) or the impact on various company or stock events on share prices (“Hey, maybe I can use this paper to land a job with a quant fund!”)

When I started reading this, the first thing I thought about was count–there’s a lot of difference between a four-seamer on a three-ball count and one when the pitcher’s ahead, so I’m glad you’re including that in your analysis. My other thought is that, as you suggest, defense has an impact on batted ball outcomes. Think of Game Six of the Series, in which the Royals blooped and bled the Giants to a 10-0 victory–not a lot of solid contact, but balls just went where fielders weren’t. Or Aoki’s screamer down the left field line that Juan Perez caught in Game Seven–the batter nailed the ball, but it became an out because of good positioning and a good catch. Basically, watch out for BABIP fluctuations. I’d include a lot more caveats regarding contact than I would regarding whiffs, foul balls, and home runs.

Good luck with this, I’ll look forward to your conclusions.