Comprehensive Contact Quality Model Using MLBAM Batted-Ball Data (Version 0.0)

Contact quality is a recurring sabermetric theme.  Much discussion over the last decade has centered around how we interpret Voros McCracken’s groundbreaking analysis, where he showed that the majority of variance in a pitcher’s ERA was driven by the rates at which he recorded strikeouts, walks, and home runs allowed.  This led to the conclusion by many that the batting average on balls in play (excluding homers) was largely outside of a pitcher’s control, and further research has probed the influence of team defense, home ballpark, and other outside factors on differences in BABIP.

Nevertheless, pitchers like Dallas Keuchel and Chris Young seem to have above-average success in “pitching to contact”,  even after allowing for outside factors.  To better understand such outliers from the standard fielding-independent pitching model, I have developed a new bottom-up  framework to analyze the quality of contact allowed, using the newly-available batted-ball data from MLB Advanced Media (via Baseball Savant).  This model takes all batted balls (including homers) and calculates the expected run value based upon how hard the ball was hit (“exit velocity”) and the estimated angle at which it left the bat (“vertical angle”).  In addition to the contact quality model, I’ve also developed a parallel model to estimate the defense-independent expected run value from batted-ball data (yes, contact quality and defense-independent run value are two different things.)

Relationship to FIP

The key difference between the Comprehensive Contact Quality Model and FIP is the integration of expected home runs allowed into the analysis.   Various metrics such as xFIP have attempted to account for the volatility in HR% by normalizing this rate as a fixed percentage of fly balls allowed.  A different perspective is to treat home runs as one extreme in a broad spectrum of contact quality:

           Swinging strike < Foul tip < Weakly-hit fair ball < Well-hit fair ball

This spectrum ranks how well the hitter has “squared up” on the ball, with better-struck balls further to the right. Home runs can be considered a subset of well-hit fair balls, where the likelihood of actually becoming a four-bagger depends primarily upon the distance travelled, which itself is a function of exit velocity, vertical angle, and a host of other factors.   So, when we talk about a pitcher’s ability to limit the long ball, what we’re really talking about is his ability (if any) to prevent the ball from being hit hard at an optimum angle to leave the park.

With that brief introduction, let’s outline the framework for valuing the contact quality on any batted ball.  First, for balls hit in the air:

Step 1  – Estimate the Probability of a Home Run

For this first iteration of the model, I made the following simplifying assumptions:

  • Exactly 1/30 of all outfield fly balls are hit in each MLB ballpark
  • The direction of these balls is distributed 20% LF to LC, 30% LC to CF, 30% CF to RC, and 20% RC to RF
  • Outfield dimensions are as currently posted in Wikipedia

Also since distance in the MLBAM data is measured to the assumed landing point, we also need to adjust for the height of the outfield wall.   To do this, I used Dr. Alan Nathan’s excellent trajectory calculator to estimate the complete distance traveled by a ball that is W feet above the ground when it passes over the outfield wall, where W is the height of the wall.   Note that this distance will be further for line drives than it will be for high flies, so the necessary distance for a home run will depend upon both the listed distance to the wall and the vertical angle of the batted ball.

[Caution – next section is somewhat technical; you can safely skip and not miss the gist of this article]

One problem with the MLBAM data found on Baseball Savant is that batted-ball angles are only available for home runs.  For other batted balls, we can use the fact that we have both the batted-ball velocity and distance to back-solve for the vertical angle:

1.  Make grid of distance = f(exit vel, angle), using the default settings in Dr. Nathan’s trajectory calculator:

(Key values shown below – columns are vertical angle, rows are exit velocity)

0 5 10 15 20 25 30 35 40 50 60
60 49 79 111 138 159 173 182 186 185 169 137
65 54 91 129 159 182 198 207 210 208 188 152
70 60 105 148 182 207 223 232 234 231 208 166
75 66 120 169 207 233 249 258 259 254 227 180
80 72 136 192 233 260 276 284 284 277 246 194
85 79 155 217 260 288 304 311 309 301 265 207
90 87 175 244 289 317 332 338 334 324 283 220
95 95 198 272 318 346 361 365 360 347 302 233
100 105 223 302 349 376 389 392 385 370 320 245
105 115 249 332 380 406 418 419 410 393 338 256
110 127 277 363 411 436 446 445 434 415 355 268
115 141 307 394 442 466 474 471 458 437 371 278
120 156 338 426 472 495 502 497 482 458 387 288

2.  Distance peaks at a certain “optimal” vertical angle then decreases.  This means that there are 2 possible solutions for the vertical angle when doing a lookup based upon distance and exit velocity.  Lacking any other information, I used the batted-ball type recorded by the Baseball Scoresheet stringers to guide which value to use:

LD uses lower of the two angles, PU uses higher of the two, FB uses mean of the two

This becomes our estimate of vertical angle on the batted ball.

[End of technical note]

Now, for each of the 30 MLB ballparks, we can use the combination of distance and vertical angle to estimate the probability of a homer, assuming the pull/center/opposite mix assumed above (note – version 0.0 of this model does not reflect batted ball direction).  After averaging across all ballparks, we get a grid of home run probabilities for any outfield fly ball:

0 5 10 15 20 25 30 35 40 50 60 Actual
300 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
310 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0%
320 0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 0.1% 0.2% 0.2% 0.3% 0.1%
330 0.0% 0.0% 0.1% 0.2% 0.3% 0.4% 0.5% 0.7% 1.0% 1.4% 1.7% 0.6%
340 0.0% 0.0% 0.2% 0.6% 1.3% 2.5% 3.9% 5.0% 6.0% 7.4% 8.2% 1.7%
350 0.0% 0.0% 0.9% 4.4% 8.1% 11.1% 13.2% 14.7% 15.9% 17.6% 18.5% 3.2%
360 0.0% 0.1% 6.0% 13.4% 18.7% 22.1% 24.3% 25.9% 27.1% 28.8% 29.8% 11.1%
370 0.0% 0.4% 15.3% 24.0% 30.4% 34.0% 36.3% 38.0% 39.2% 41.0% 42.0% 15.3%
380 0.0% 2.7% 25.7% 36.8% 43.0% 46.3% 48.3% 49.9% 51.0% 52.8% 53.8% 29.1%
390 0.0% 10.0% 36.1% 50.0% 55.2% 58.7% 61.3% 63.2% 64.8% 67.0% 68.3% 40.7%
400 0.0% 19.0% 47.8% 63.5% 70.1% 74.5% 77.5% 79.8% 81.5% 84.0% 85.3% 54.1%
410 0.0% 28.1% 61.8% 80.1% 86.6% 90.0% 91.7% 92.9% 93.7% 94.8% 95.4% 76.3%
420 0.0% 37.6% 78.0% 92.2% 95.2% 96.4% 97.0% 97.5% 97.9% 98.2% 98.3% 90.3%
430 0.0% 50.1% 88.9% 96.6% 97.7% 98.3% 98.7% 98.8% 98.9% 99.0% 99.1% 94.8%
440 0.0% 64.6% 93.5% 98.0% 99.0% 99.4% 99.5% 99.6% 99.7% 99.7% 99.8% 96.7%

Step 2 – Estimate BABIP if Not a Home Run

One big benefit of hitting the ball over the fence is that virtually no chance of making an out.  For balls hit in the air to the outfield, however, there typically three guys whose goal it is to catch the ball in order to get the batter out.  Now while a little bit of extra loft on a hard-hit OF fly can improve the chance of a dinger, for balls that stay in play the relationship between BABIP and vertical angle is essentially linear (using first-half 2015 data):

    BABIP if hit in the air to OF = .9698 – .0256 * MIN(37.5, angle)

We will use this in conjunction with the next step to determine the run value of a non-homer fly/popup/liner.

Step 3 – Estimate Expected Run Value If A Hit (Non-HR)

For balls not caught by the outfielder, the chances for an extra-base hit vary by vertical angle and also increase for higher exit velocities.  Regressing the first-half 2015 data (using hits to the outfield only) results in this estimate:

RV if hit to OF =  -1.06 + 0.0206*velocity – 0.00006*velocity^2 + 0.0223*angle – 0.000318*angle^2

We can now calculate the contact-quality run value as:

     CQRV = (1.38 x HR Probability) + (RV if hit to OF x (1 – HR Probability))

Contact Quality Run Values for Ground Balls

For ground balls, the expected run value increases with increasing exit velocity.  We can estimate the CQRV directly from the following regression equation:

CQRV = 0.35-0.0174*velocity+0.00014*velocity^2, if velocity > 65; else CQRV = -0.19

Note that the expected run value is set to -0.19 for velocity less than 65 MPH.  This is because the run expectancy actually improves for grounders hit at a very low speed (basically dribblers and slow rollers).  Because this is a model of contact quality, we are not going to penalize the pitcher for poor batted-ball luck when the actual quality of contact is low.

This leads us to a discussion of the last key feature of the model….

Contact Quality vs. Expected Batted-Ball Result

The CQ model is designed to produce higher run values for better quality of contact.   However, as discussed in Tony Blengino’s enlightening series on batted-ball outcomes, real-life BABIP doesn’t improve continuously with higher batted-ball velocity, but instead actually decreases over the stretch between balls hit relatively shallow and balls hit to the deeper parts of the outfield.  The CQ model calculates BABIP as a function of vertical angle in order to avoid rewarding pitchers for the better-struck balls that fall into the “donut hole” near the depths where outfielders normally position themselves.

I chose vertical angle to model BABIP for the CQ framework because of its close relationship to hang time, which in turn is a key component of the likelihood of the outfielder making the putout.  In reality, batted-ball location also plays an important role in determining whether a fielder can range into position to catch the ball.  To model this more realistic BABIP, I estimated what proportion of balls hit a certain distance would be reachable by one of the three outfielders, given a certain amount of hang time (note – hang time can be estimated by Dr. Nathan’s trajectory calculator based upon exit velocity and vertical angle).    For example, an arc 320 feet from home plate is roughly 502 feet long from foul line to foul line.   If we assume that each outfielder can cover 52 feet in 3.0 seconds, then we can draw a circle with a 52 foot radius from each fielder’s initial position and estimate the overlap between the arc and these circles to be about 237 feet.  So we assign a 47% chance (237 divided by 502) of catching a fly ball hit 320 feet with a 3.0 second hang time.  If we increase the hang time to 4.0 seconds, the coverage circles now have an 87 foot radius, and 479 feet of the arc are covered, for a 95% chance of an out.

Here is how the more realistic BABIP varies based upon both batted-ball distance and hang-time.  Note the “donut hole” for balls hit around 300 feet with hang times in the neighborhood of 4 seconds.

           1.0            1.5            2.0            2.5            3.0            3.5            4.0            4.5            5.0
200    1.000    1.000    1.000    1.000    1.000    1.000    0.711    0.400    0.005
210    1.000    1.000    1.000    1.000    1.000    0.925    0.589    0.318          –
220    1.000    1.000    1.000    1.000    1.000    0.761    0.523    0.217          –
230    1.000    1.000    1.000    1.000    0.889    0.666    0.485    0.161          –
240    1.000    1.000    1.000    0.960    0.772    0.618    0.353    0.136          –
250    1.000    1.000    0.971    0.828    0.696    0.528    0.254    0.061          –
260    1.000    0.932    0.857    0.757    0.646    0.403    0.180          –          –
270    0.919    0.863    0.802    0.717    0.555    0.314    0.120          –          –
280    0.886    0.838    0.783    0.678    0.468    0.258    0.073          –          –
290    0.884    0.834    0.762    0.598    0.419    0.217    0.035          –          –
300    0.918    0.823    0.721    0.579    0.413    0.218    0.038          –          –
310    0.956    0.853    0.741    0.588    0.414    0.211    0.020          –          –
320    0.941    0.916    0.857    0.663    0.470    0.263    0.059          –          –
330    0.943    0.919    0.891    0.807    0.556    0.330    0.104          –          –
340    0.962    0.936    0.908    0.869    0.714    0.444    0.205    0.029          –
350    1.000    0.967    0.931    0.883    0.830    0.576    0.315    0.118          –
360    1.000    1.000    0.985    0.911    0.843    0.726    0.434    0.212    0.043
370    1.000    1.000    1.000    0.977    0.870    0.783    0.559    0.317    0.144
380    1.000    1.000    1.000    1.000    0.933    0.799    0.691    0.428    0.248
390    1.000    1.000    1.000    1.000    1.000    0.856    0.712    0.525    0.339
400    1.000    1.000    1.000    1.000    1.000    0.956    0.759    0.603    0.420
410    1.000    1.000    1.000    1.000    1.000    1.000    0.866    0.716    0.487
420    1.000    1.000    1.000    1.000    1.000    1.000    1.000    0.749    0.574
430    1.000    1.000    1.000    1.000    1.000    1.000    1.000    0.827    0.637
440    1.000    1.000    1.000    1.000    1.000    1.000    1.000    0.944    0.704

This neatly explains why fly balls hit at 85 MPH often result in an out, while line drives hit that hard are most often base hits.

Angle 0 5 10 15 20 25 30
Distance          79        155        217        260        288        304        311
Hang Time        0.7        1.4        2.2        2.9        3.5        4.1        4.5
BABIP    1.000    0.669    0.229    0.032          –

If we substitute the hang-time based BABIP for the vertical-angle based BABIP used in the CQ model, we obtain a batted-ball-data expected run value that is more realistic and truly fielder-independent.  Unfortunately, this metric (let’s call it BBRV) doesn’t do as well as CCRV in measuring the actual quality of contact, since it rewards a pitcher allowing an 85 MPH/25 degree angle fly (.032 expected BABIP) more than a pitcher who gives up a 75MPH/25 degree bloop (.537 expected BABIP).

In short, we can see that fielding-independent pitching consists of two parts:  contact quality allowed, and batted-ball luck.

Some Actual Results…

Well, with all that said, what does CCRV version 0.0 tell us about pitchers so far in 2015?

First, let’s look at the actual run expectancy above average allowed on batted balls (using linear weights).  Here are the top 5 and bottom 5 through the first half of 2015:

Sonny Gray         (18.9)
Zack Greinke         (18.8)
Dallas Keuchel          (16.3)
Jacob deGrom          (12.1)
Chris Young          (11.5)
Ian Kennedy            20.4
CC Sabathia            21.5
Kyle Lohse            21.7
Kyle Kendrick            22.2
James Shields            23.0

No real surprises for those who’ve followed this year’s FIP/BABIP outliers (though Greinke’s never been this successful on batted balls – maybe he’s the guy who’s heisted Kyle Lohse’s secret formula for contact management.)

Now, let’s look at CQRV:

Pitcher CQRV Expected Run Value Actual Run Value
Sonny Gray              (8.8)              (9.2)            (18.9)
Brad Ziegler              (7.1)              (6.5)            (11.2)
Clayton Kershaw              (6.6)              (3.2)                7.3
Brandon Maurer              (6.4)              (6.1)            (10.0)
Alex Wilson              (6.0)              (6.9)              (3.3)
Kyle Lohse                13.7                12.9                21.7
Jerome Williams                14.3                16.0                18.7
Phil Hughes                16.9                16.4                17.5
Josh Collmenter                17.6                18.6              14.0
Kyle Kendrick               23.3                22.5               22.2

The only mildly interesting name in the bottom five is Phil Hughes, who has returned to allowing a high HR% after conquering the gopher ball in 2014.  In the top five, we see saber-fave Brad Ziegler, whose ridiculous .177 BABIP/0.45 HR/9 combo is driven far more by low contact quality than by batted ball/defensive luck.  We also see two very surprising names at #4 and #5.   Brandon Maurer has allowed a .238 BABIP along with just 1 HR in 44 innings, thanks to a career high 27% soft hit percentage alongside a career low 21% hard hit percentage.  Alex Wilson has likewise improved his contact management numbers (25% soft hit/21% hard hit) to drive a .270 BABIP with just 2 longballs allowed.

Finally, it’s interesting to note Clayton Kershaw’s numbers.  Despite having a BABIP north of .300 for the first time since his rookie season, Kershaw has been well above average in terms of stifling contact quality.  But, between having fewer fly balls than average dying in the outfield “donut holes” (3 runs) and other batted-ball/defensive factors (10 runs), Kershaw has been a few runs worse than average on balls in play. (Not that he needs any help to remain brilliant).

Conclusion

I have chosen to call this version 0.0 of the CCQM framework because in essence this is as much a “proof of concept” as a potential tool.   Two key areas will require continuous research and review to fully power up this model.

First, the raw data used to develop the model is new and evolving.  As more MLBAM data becomes publically available, there will be a more robust historical track record of fundamental physical stats behind every play made, which will improve the reliability of the model.

Second, the framework itself needs to be tested further to make sure that any variables that truly affect contact quality are considered.  For example, I consciously chose to not include batted-ball direction as a factor for this first version of the model in order to avoid extra complexity.  In effect, this was equivalent to a null hypothesis that pitchers cannot influence batted-ball direction.  It would be foolish not to test the validity of this assumption for future iterations of the model to see if there are pitchers who consistently show the ability to improve their performance by influencing the batted-ball direction, all other factors being equal.

My hope is that the CCQM model sparks a fresh round of discussions on the whole notion of contact quality, leveraging this whole new generation of metrics at our disposal.





tz posts stuff in Fangraphs comments section from time to time. He has taken to putting his TLDR posts into the Community Research section to spare other commenters.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
evo34
8 years ago

Great stuff.

Ryan
8 years ago

Awesome job

redsman
8 years ago

Really cool. I look forward to version 1.0, or maybe 0.1?

Alan Nathan
8 years ago

I am puzzled how you arrived at

“BABIP if hit in the air to OF = .9698 – .0256 * MIN(37.5, angle)”

Do you have non-hr angle data from which you derived the relationship. Somehow basing BABIP only on angle and not also on exit speed doesn’t seem right to me. I am interested in what you have to say about that.

tz
8 years ago
Reply to  Alan Nathan

For the non-HR angle data, I made an estimate based upon the exit speed, distance traveled, and the batted-ball type classification in the data, using your trajectory calculator to develop the relationships. For example, a ball with a 90 MPH exit velocity that traveled 283 feet would have to be hit at either a 14 degree or 50 degree angle, based upon the trajectory calculator (using the default settings for all the other parameters). If the stringer classified the ball as a pop-up, I used 50 degrees for the estimated angle, if it was classified as a line drive I used 14 degrees for the angle, and if it was classified as a fly I used the mean of the two (32 degrees). I then used these estimates on the current 2015 batted-ball data to derive the relationships in the model.

I fully agree that BABIP does depend on both angle and exit speed, and I basically used both of these to calculated the expected BABIP for my model of expected batted-ball run value (BBRV). However, for the contact-quality model only, I wanted the value to be strictly increasing for increasing exit velocity, which of course doesn’t happen in real life (adding a few MPH to a Texas Leaguer turns it into a routine OF flyout).

So, I decided to build the contact-quality value from two components which individually satisfy that criteria:

1. The expected run value, conditional on the ball being a hit. In essence, this is the run value you’d have if the catching the ball in the air was no longer an out. This value would be higher the longer it takes for the outfielder to field the ball and throw it back to the infield, which in turn is higher the farther the ball is hit and the longer it is in the air.

2. A BABIP based upon vertical angle only. By not including exit velocity for this component, we are effectively smoothing out the impact of the “donut holes” in the graph of BABIP vs. distance and hang-time that we have in the BBRV (realistic expected run value) model.

gwosdz69
8 years ago

Glad I took the time to read this. Nice job.

Umpire Weekend
8 years ago

I didn’t see any new version of FIP in your results. Is there any way you can compare this with FIP?

tz
8 years ago
Reply to  Umpire Weekend

It shouldn’t be too difficult to do. The formula for FIP implicitly assumes the same run value for any non-HR ball in play, so you’d have to replace the sum of that and the run value of the HRs allowed with the CQRV.

I can add that to the next list. It would be interesting to see how the CQRV-based version of FIP compares to FIP and xFIP.

Alan Nathan
8 years ago

I will have to think long and hard about your article and response to my comment. Regarding BABIP, I would take a different approach. Given the available information (exit speed and distance), I would set up a matrix much like you have done with distance and time, and populate each cell with the actual league-average BABIP. In effect, distance is a surrogate for vertical launch angle (as you have argued). I would then apply some smoothing algorithm (e.g., loess), which then gets us a smooth 2D distribution of BABIP as a function of exit speed and distance. Then I would start to investigate individual pitchers to see how their actual BABIP (averaged over all pitches) compares with expected BABIP based on the smooth 2D function. The same thing can be done for individual batters.

Perhaps some variation of that is what you did, but I will have to read your article more carefully to figure that out. I prefer using measured quantities (exit speed and distance) rather than calculated quantities (angle and hang time). I think the latter is what you did.

Let me not end this comment w/o mentioning that I very much like the type of analysis you have done. I want to understand it better, but that should not detract from the important contribution you have made. Nice work!

tz
8 years ago
Reply to  Alan Nathan

I did use the calculated angle and hang time to develop the various components of the model, in effect using them as proxies for data which will hopefully become available soon, and because they better describe the theory behind the bottom-up framework.

But I haven’t been fully comfortable using the estimated values because of the potential range of error within them. For example, I played around with the backspin parameter in your trajectory calculator and saw very substantial differences if I moved to say 1000 or 2000 rpm*. So I have to agree that using the measured quantities is a more solid foundation for estimating the model’s parameters.

Thanks for your suggestions. I’ll look into some of the options for smoothing to develop the contact-quality metric.

Alan Nathan
8 years ago

tz: I am embarrassed to admit that I don’t know who you are. Please send me an e-mail so we can start up a dialogue (a-nathan at illinois dot edu). Thx.

sanderbubble
8 years ago

You mean James Shields at the bottom of your list isn’t a surprise? I know he hasn’t been great this year but I never expected him to be last.