Predicting Pitcher Breakouts from Small Sample Sizes

Most FanGraphs readers know that even the fastest-stabilizing statistics take almost a quarter of a season to mean anything. With the availability of PITCHf/x data, we can look at individual pitch data, which can give us hundreds of data points for an individual pitcher just from one start. Instead of waiting until near the All-Star break to see if Aaron Sanchez has really made a leap forward or if the league has adjusted to Dallas Keuchel, we can use statistics that stabilize quickly (both “approach” stats and “results” stats) to guide these decisions.

The “results” stats that I used are:

  • Zone Contact%
  • Zone Whiff%
  • Zone Take%
  • Out-of-Zone Contact%
  • Out-of-Zone Whiff%
  • Out-of-Zone Take%
  • First Pitch Strike%

First, I used a regression model to create a formula that used only these statistics to produce an expected ERA (or SIERA, actually, as I wanted to filter out any BABIP and HR/FB luck).

The formula ended up as: -3.11 + (12.48 * Z-Con%) + (3.08 * Z-Take%) + (11.96 * O-Con%) – (14.19 * O-Whiff%) + (13.06 * O-Take%) – (3.46 * F-Strike%)

Using 2015 data (and only pitchers who threw more than 1,500 pitches), I get an r-squared of 0.68. I’m going to call this statistic “PD-SIERA” since it uses only plate-discipline data to produce an expected SIERA.

The PD-SIERA leaders for 2015 were:

  1. Clayton Kershaw, 2.47
  2. Chris Sale, 2.75
  3. Max Scherzer, 2.78
  4. Carlos Carrasco, 2.78
  5. Chris Archer, 2.92

The r-squared is good enough, and those names pass the sniff test, so I’m pretty comfortable that this produces a good approximation of pitcher performance.

I will use this to calculate a Results Change% (year2_PD-SIERA – year1_PD_SIERA)/(year1_PD_SIERA). For example, Drew Smyly had a 3.73 PD-SIERA in 2014 (year2) and a 2.33 PD-SIERA in April of 2015 (year1). The calculation would then be: (3.73 – 2.33) / (3.73) = +37.5%

[This number can be positive or negative to indicate a positive or negative change in results]

Now, just looking at the plate discipline statistics isn’t enough. We need to see if there was a reason for a pitcher to have a better or worse PD-SIERA than he had the previous year. PITCHf/x to the rescue again, as we can look at what I will call “approach” stats: a pitcher’s pitch mix and velocity. Since these are things almost completely under the pitcher’s control, they should stabilize quickly.

In order to calculate a pitcher’s “Approach Change%,” I calculate the change in his pitch mix + the percentage of velocity change from the previous year. An example of the calculation is below:

  • Drew Smyly, 2014 (full): 89.9 mph, 51.9% FB, 15.9 % CT, 28.5% CB, 3.8% CH
  • Drew Smyly, 2015 (April):  90.2 mph, 46.4% FB, 30.1% CT, 23.5% CB, 0.0% CH

Velocity change = (year1_velo – year2_velo)/(year2_velo) = (90.2-89.9)/89.9 = 0.3%

[If this value ended up negative, we would use the absolute value, as we are only interested in the amount of change, not positive/negative change]

Pitch Mix change = -5.5% FB, +14.2% CT, -5.0% CB, -3.8% CH = (take the absolute value of all of these changes and then divide by two) = (28.5%) / 2 = 14.3%

[Dividing by two makes sure that each percentage change is only counted once – a +1% increase in FB% combined with a 1% decrease in CH% equals only a 1% chance in pitch mix]

Approach Change% = Velocity change + Pitch mix change = 14.3% + 0.3% = 14.6%

In order to see if this formula would work for 2016, we can look backwards to see how it would have done predicting 2015 breakouts/blow-ups.

Looking at the data from 2014 (full season) to 2015 (April only), we can multiply Approach Change% * Results Change% to see if we can identify early-season breakout/blow-up candidates. The three highest rated “breakout” candidates in April 2015 were:

  1. Drew Smyly: 14.6% Approach Change%, +37.5% Results Change%… Improved SIERA from 3.69 (2014) to 3.25 (2015)
  2. Chris Archer: 13.7% Approach Change%, +36.1% Results Change%… Improved SIERA from 3.80 (2014) to 3.08 (2015)
  3. Dillon Gee: 13.4% Approach Change%, +36.6% Results Change%… SIERA increased slightly from 4.30 to 4.41 (groin injury in May, lost his rotation spot, and ended up in the minors for most of the second half)

Not bad – two of the clear top three breakout candidates actually improved their SIERA by over 10% from 2014. How about the bottom of the list? We have a clear top four:

  1. Homer Bailey: 14.2% Approach Change%, -34.7% Results Change%… SIERA jumped from 3.60 to 5.65 (injured after two starts)
  2. Jake Peavy: 21.9% Approach Change%, -14.9 Results Change%… SIERA increased slightly from 4.11 to 4.33
  3. Tyler Matzek: 23.9% Approach Change%, -13.6% Results Change%… SIERA jumped from 4.08 to 6.45 (injured after five starts)
  4. Wade Miley: 10.2% Approach Change%, -31.5% Results Change%… SIERA jumped from 3.67 to 4.24

Bailey and Matzek were both headed for season-ending injury (maybe this formula is a good predictor of an aching arm?), Miley went from above-average to below-average, and Peavy got a bit worse.

To show why we need both the Approach and Results Change%, consider these two pitchers:

  • James Shields: 5.5% Approach Change%, +26.5% Results Change%… SIERA increased slightly from 3.59 to 3.72
  • Edinson Volquez: 5.2% Approach Change%, +23.5% Results Change%… SIERA increased slightly from 4.20 to 4.35

Both pitchers had significantly better results in April of 2015 than they did in 2014, but their approach barely changed at all. As the change in results was not backed by any change in approach, they both ended up being essentially the same pitcher for the remainder of 2015 as they had been in 2014.

I’ve run the numbers for the first week of 2016, but will wait until we get about a month’s worth of data before releasing the actual numbers. For those that would like a sneak peak (caution: most of these are using ONE game’s worth of data!):

Breakout candidates: Alfredo Simon, Wade Miley, Jose Fernandez, Jacob deGrom, Noah Syndergaard, Aaron Sanchez

Blow-up candidates: Dallas Keuchel, Stephen Strasburg, Jerad Eickhoff, Chris Sale, Taijuan Walker, Masahiro Tanaka, James Shields

newest oldest most voted

I’m wondering if the velocity component should be weighted a bit more. The other stats rely so heavily on the qualities of the opponent.

Also, in developing the regression formula for Results Change, wouldn’t it be better to take only April stats as your inputs and have May-Sep SIERA as your output? You’d have to look at several seasons to overcome the limited size of the dataset.


Looking forward to the results at the end of the month.


Nice call on Sale.LOL. That 9 inning 1 hit shutout tonight will really make him a bomb candidate.


Keep working on it! There is potential.


Yes! Excited to see how this looks at the end of April.