Predicting Pitcher Breakouts from Small Sample Sizes
Most FanGraphs readers know that even the fastest-stabilizing statistics take almost a quarter of a season to mean anything. With the availability of PITCHf/x data, we can look at individual pitch data, which can give us hundreds of data points for an individual pitcher just from one start. Instead of waiting until near the All-Star break to see if Aaron Sanchez has really made a leap forward or if the league has adjusted to Dallas Keuchel, we can use statistics that stabilize quickly (both “approach” stats and “results” stats) to guide these decisions.
The “results” stats that I used are:
- Zone Contact%
- Zone Whiff%
- Zone Take%
- Out-of-Zone Contact%
- Out-of-Zone Whiff%
- Out-of-Zone Take%
- First Pitch Strike%
First, I used a regression model to create a formula that used only these statistics to produce an expected ERA (or SIERA, actually, as I wanted to filter out any BABIP and HR/FB luck).
The formula ended up as: -3.11 + (12.48 * Z-Con%) + (3.08 * Z-Take%) + (11.96 * O-Con%) – (14.19 * O-Whiff%) + (13.06 * O-Take%) – (3.46 * F-Strike%)
Using 2015 data (and only pitchers who threw more than 1,500 pitches), I get an r-squared of 0.68. I’m going to call this statistic “PD-SIERA” since it uses only plate-discipline data to produce an expected SIERA.
The PD-SIERA leaders for 2015 were:
- Clayton Kershaw, 2.47
- Chris Sale, 2.75
- Max Scherzer, 2.78
- Carlos Carrasco, 2.78
- Chris Archer, 2.92
The r-squared is good enough, and those names pass the sniff test, so I’m pretty comfortable that this produces a good approximation of pitcher performance.
I will use this to calculate a Results Change% = (year2_PD-SIERA – year1_PD_SIERA)/(year1_PD_SIERA). For example, Drew Smyly had a 3.73 PD-SIERA in 2014 (year2) and a 2.33 PD-SIERA in April of 2015 (year1). The calculation would then be: (3.73 – 2.33) / (3.73) = +37.5%
[This number can be positive or negative to indicate a positive or negative change in results]
Now, just looking at the plate discipline statistics isn’t enough. We need to see if there was a reason for a pitcher to have a better or worse PD-SIERA than he had the previous year. PITCHf/x to the rescue again, as we can look at what I will call “approach” stats: a pitcher’s pitch mix and velocity. Since these are things almost completely under the pitcher’s control, they should stabilize quickly.
In order to calculate a pitcher’s “Approach Change%,” I calculate the change in his pitch mix + the percentage of velocity change from the previous year. An example of the calculation is below:
- Drew Smyly, 2014 (full): 89.9 mph, 51.9% FB, 15.9 % CT, 28.5% CB, 3.8% CH
- Drew Smyly, 2015 (April): 90.2 mph, 46.4% FB, 30.1% CT, 23.5% CB, 0.0% CH
Velocity change = (year1_velo – year2_velo)/(year2_velo) = (90.2-89.9)/89.9 = 0.3%
[If this value ended up negative, we would use the absolute value, as we are only interested in the amount of change, not positive/negative change]
Pitch Mix change = -5.5% FB, +14.2% CT, -5.0% CB, -3.8% CH = (take the absolute value of all of these changes and then divide by two) = (28.5%) / 2 = 14.3%
[Dividing by two makes sure that each percentage change is only counted once – a +1% increase in FB% combined with a 1% decrease in CH% equals only a 1% chance in pitch mix]
Approach Change% = Velocity change + Pitch mix change = 14.3% + 0.3% = 14.6%
In order to see if this formula would work for 2016, we can look backwards to see how it would have done predicting 2015 breakouts/blow-ups.
Looking at the data from 2014 (full season) to 2015 (April only), we can multiply Approach Change% * Results Change% to see if we can identify early-season breakout/blow-up candidates. The three highest rated “breakout” candidates in April 2015 were:
- Drew Smyly: 14.6% Approach Change%, +37.5% Results Change%… Improved SIERA from 3.69 (2014) to 3.25 (2015)
- Chris Archer: 13.7% Approach Change%, +36.1% Results Change%… Improved SIERA from 3.80 (2014) to 3.08 (2015)
- Dillon Gee: 13.4% Approach Change%, +36.6% Results Change%… SIERA increased slightly from 4.30 to 4.41 (groin injury in May, lost his rotation spot, and ended up in the minors for most of the second half)
Not bad – two of the clear top three breakout candidates actually improved their SIERA by over 10% from 2014. How about the bottom of the list? We have a clear top four:
- Homer Bailey: 14.2% Approach Change%, -34.7% Results Change%… SIERA jumped from 3.60 to 5.65 (injured after two starts)
- Jake Peavy: 21.9% Approach Change%, -14.9 Results Change%… SIERA increased slightly from 4.11 to 4.33
- Tyler Matzek: 23.9% Approach Change%, -13.6% Results Change%… SIERA jumped from 4.08 to 6.45 (injured after five starts)
- Wade Miley: 10.2% Approach Change%, -31.5% Results Change%… SIERA jumped from 3.67 to 4.24
Bailey and Matzek were both headed for season-ending injury (maybe this formula is a good predictor of an aching arm?), Miley went from above-average to below-average, and Peavy got a bit worse.
To show why we need both the Approach and Results Change%, consider these two pitchers:
- James Shields: 5.5% Approach Change%, +26.5% Results Change%… SIERA increased slightly from 3.59 to 3.72
- Edinson Volquez: 5.2% Approach Change%, +23.5% Results Change%… SIERA increased slightly from 4.20 to 4.35
Both pitchers had significantly better results in April of 2015 than they did in 2014, but their approach barely changed at all. As the change in results was not backed by any change in approach, they both ended up being essentially the same pitcher for the remainder of 2015 as they had been in 2014.
I’ve run the numbers for the first week of 2016, but will wait until we get about a month’s worth of data before releasing the actual numbers. For those that would like a sneak peak (caution: most of these are using ONE game’s worth of data!):
Breakout candidates: Alfredo Simon, Wade Miley, Jose Fernandez, Jacob deGrom, Noah Syndergaard, Aaron Sanchez
Blow-up candidates: Dallas Keuchel, Stephen Strasburg, Jerad Eickhoff, Chris Sale, Taijuan Walker, Masahiro Tanaka, James Shields
Just re-ran the numbers after yesterday’s games for anyone interested (only looking at SP with > 2 GS in both 2015 & 2016):
Breakout:
Noah Syndergaard (+1 mph, +21.5% SL… +7.7% Whiff%, -6.5% O-Take%)
Drew Pomeranz (-1.5 mph, +9.5% CB… +6.3% O-Whiff%, -6% O-Con%)
Alfredo Simon (-1.1 mph, +13% FB… +5.4% Z-Whiff%, -7.6% Contact%)
Wade Miley (-0.4 mph, +7.8% SL… +4% Z-Whiff%, -8.4% O-Take%)
Brandon Finnegan (-0.4 mph, +17.9 CH… -4.7% O-Con%, +3.2% O-Whiff%)
Vincent Velasquez (-0.5 mph, +4.9% CB… +7.2% Z-Whiff%, +7.8% Z-Take%, -7.8% O-Con%)
Blowup:
Collin McHugh (-1.1 mph, +45.5 CT… +6.9% Z-Con%, -3.5% O-Whiff%)
Chris Sale (-1.7 mph, +8.0% FB… +5.2% Z-Con%, -4.8% O-Con%)
James Shields (-0.5 mph, +5.6% CT… +12.8% Z-Con%, -5.5% O-Whiff%)
Dallas Keuchel (-1.9 mph, +16.1% CT… +4.1 Z-Take%, -5.0 O-Whiff%)
Shelby Miller (-1.1 mph, +8.4% CH… +6.9% O-Take%)
Syndergaard has more than double the highest ‘breakout’ score of anyone else, and McHugh (thanks to his 0.1 IP, 6 ER doozy) has a ‘blowup’ score almost three times higher than anyone else.
I’m wondering if the velocity component should be weighted a bit more. The other stats rely so heavily on the qualities of the opponent.
Also, in developing the regression formula for Results Change, wouldn’t it be better to take only April stats as your inputs and have May-Sep SIERA as your output? You’d have to look at several seasons to overcome the limited size of the dataset.
Increasing the significance of velocity is something that I’m still looking at. Comparing it to only last year’s April data would be ideal, but also limits the data set.
I’m looking at ways to add in contact data (GB/LD/FB, Pull/Cent/Oppo, Hard/Med/Soft), but it just takes so much longer to stabilize.
The regression formula is actually done with raw statistics, not deltas… PD-SIERA essentially gives you the ERA that the pitcher with those plate discipline statistics SHOULD have had, assuming that he had average results on contact.
Looking forward to the results at the end of the month.
Nice call on Sale.LOL. That 9 inning 1 hit shutout tonight will really make him a bomb candidate.
The FB usage was up, the velocity was down… and batters were making contact on pitchers IN the zone instead of OUT of the zone.
I haven’t seen the Pitchf/x data for tonight, but I’d say it looks like he’s ironed things out!
Funny thing… I re-ran Sale’s numbers after his gem… and he looks WORSE!
Keep in mind, this is comparing him to the 2015 version of himself, not to a league average pitcher – his 2016 PD-SIERA is still good (3.66; it was 2.75 last year)
Approach: velocity down 1.6 mph… +7.9% FB, +2.7% SL, -10.6% CH
Results: +6.8% Z-Con%, -8% F-Strike%!
Personally I think he’ll be fine… but he HAS had one of the biggest jumps in Zone-Contact allowed and biggest drops in First-Pitch-Strikes… will be interesting to see what comes!
Forgot to mention… my first thought was that maybe he was getting better at allowing managing contact, but his contact profile is pretty close to last season’s. He’s an odd case!
Keep working on it! There is potential.
Yes! Excited to see how this looks at the end of April.