Do Catchers Influence Pitcher Performance? The Story of Spanky and Sluggo

From Opening Day to April 20th, Red Sox pitchers posted a 7.14 ERA when Jarrod Saltalamacchia was behind the plate versus a 2.40 ERA when Jason Varitek started. The resulting hubbub about this split made one fact extremely clear, when comparing the influence of different catchers, sample size is really really important.

Already by June 24th, Varitek and Salty’s split has been greatly reduced, with pitchers now throwing a 3.44 ERA to the veteran captain and a 4.36 ERA to the new guy. I would bet that these numbers will continue to converge as the season drags on, but even after 182 games it’s unlikely that either catcher will have enough innings to statistically test whether one is calling a better game. This is the difficulty of assessing catcher performance: comparing catchers between teams is near impossible (because the pitching staffs are different), and comparing catchers within teams is difficult (because sample sizes are small and different pitchers use different catchers). Nevertheless, many still believe that catchers do influence pitcher performance. Where can we find the data to support this hypothesis?

Enter Jim Leyland. From 1990 to 1992, Leyland’s Pirates deployed one of the longest-running catcher platoons in baseball history on their way to three straight NLCS losses. The main catcher was Mike “Spanky” LaValliere, a strong defender with a cannon arm (career 0.992 field rate, league leading 45% CS rate and GG in 1987) and mediocre offensive production (.269/.355/.342/ through 1992, good for a 97 OPS+). Backing him up was Don “Sluggo” Slaught, who had the reputation of a horrible defender (led the league in SB allowed in 1986, errors in 1988) but hit for more power (.280/.331/.417 through 1992) and crushed lefties (.301/.358/.446 career). Leyland split time between Spanky and Sluggo about 60/40, and during this three-year span five Pirates starters – Doug Drabek, Bob Walk, Randy Tomlin, Zane Smith, and John Smiley – acquired over 100 IP with each catcher. In total, these five pitched 1,347 innings in games that LaValliere started, and 821.2 innings in games that Slaught started. If pitchers did indeed throw differently to Spanky and Sluggo, this should be a large enough sample to observe it.

Thanks to Baseball Reference, I pulled up 339 games between 1990 and 1992 where one of these pitchers and one of these catchers started. Basic box scores were recorded (IP, H, ER, BB, SO, HR, Pit, Str, GB, FB, and LD), and from this some more interesting metrics were calculated (ERA, SO/9, BB/9, Str%, H/9, WHIP, HR/9, GB%, FB%, LD%, HR/FB, and babip). Finally, one advanced sabremetric stat was considered (RE24). These stats were then organized to compare the performance of each pitcher with each of the catchers, and standard t-tests were run to assess whether pitcher performance in these areas significantly changed depending on whether Spanky or Sluggo was starting.

First off, it is interesting to see the parts of pitching that the two catchers apparently didn’t influence. Differences in SO/9, BB/9, H/9, HR/9, HR/FB, IP/game, Str%, GB%, FB%, and LD% were statistically insignificant between catchers, although for some reason Zane Smith seemed to get jacked up when Slaught was catching for him (0.89 HR/9 vs. 0.42 for LaValliere). However, there were some pitching performance indicators that showed very real differences depending on who was catching.

ERA

Doug Drabek started 52 games with LaValliere and 43 games with Slaught during this span, pitching a 3.25 ERA to Spanky and a 2.42 ERA to Sluggo. This 0.83 ER difference is statistically strong (p=0.09 in a t-test, meaning we can say with 91% confidence that the difference wasn’t random). Randy Tomlin and John Smiley also had a noticeably lower ERA with Slaught behind the plate (p= 0.28 and 0.32), while Zane Smith pitched better with LaValliere (p=0.15). Bob Walk was essentially a push, with his slight edge towards LaValliere being statistically meaningless (p=0.73).

RE24

ERA is a flawed statistic in many ways, and more advanced sabremetric calculations exist that more specifically pinpoint the contribution the pitcher makes to his team during the course of a game. One such stat is RE24, which measures the change in Run Expectancy (RE) before and after every play. A pitcher’s RE24 for a start effectively measures how many net runs he either prevented or allowed in that start.

For each pitcher, RE24 tracks very closely with ERA. Drabek prevented an additional 0.73 runs when Slaught was catching for him (p= 0.08), while Smiley prevented an additional 0.85 runs per start (p=0.14). Tomlin also showed some improvement with Slaught behind the plate (0.65 prevented runs, p=0.28), while Smith still preferred LeValliere but not with the same statistical strength as with ERA (0.50 prevented runs, but p=0.46). Bob Walk again showed little statistical significance in his result (p=0.58).

BABIP

Both ERA and RE24 sum up a pitcher’s performance, but neither does a great job of identifying how a pitcher is being successful. Because SO/9 and BB/9 rates were essentially the same between the two catchers for all pitchers, what’s most interesting to us if what happened to the balls that did make it into play: BABIP.

Drabek and Smiley had significantly lower batting average on balls in play with Slaught behind the plate (p= 0.07 and 0.20), while Tomlin had a less significant difference (p=0.36), Smith very weakly favored Smith (p=0.49), and Walk continued to be statistically meaningless (p=0.65). BABIP is a frustrating statistic in that it is unclear how fluctuation in the average is attributable to the pitcher, but some pitchers – by increasing their GB% and lowering their LD% – are able to lower their BABIP in a very real way. Unfortunately that is not the case with any of these starters, as GB/FB/LD% were almost identical between the two catchers (varying by less than 2% in almost all cases).

ANOVA Analysis

None of the analysis above calculates Slaught’s or LeValliere’s total ERA or BABIP, the 339 starts are split into five different samples by pitcher instead of being thrown into one giant group like analysts have done with Varitek and Saltalamacchia. The reason for this is that the Pirate’s best pitchers – Drabek and Smiley – pitched to Slaught in 45% of their starts while the other three only pitched to him 33% of the time. Averaging all the starts together would bias the results towards Slaught, rewarding him for catching more often for the 1990 Cy Young winner and the 1991 American league wins leader.

Luckily, there is a more complex statistical test known as an ANalysis Of VAriance or ANOVA (more specifically a General Linear Model ANOVA in this case). An ANOVA test allows me to simultaneously examine how two different independent variables (pitcher and catcher) influence a particular dependent variable (such as SO/9). To oversimplify, if games where Smiley started and Slaught caught resulted in an average of 3.13 ER, and games where Smith started and LeValliere caught resulted in an average of 3.89 ER, an ANOVA test helps me compare how much that difference was caused by switching pitchers and how much it was caused by switching catchers. Like in the t-test, a low p-value for a particular stat indicates that the pitcher or catcher had a significant amount of influence or control over that metric, while a high p-value indicates that there was not a significant amount of influence. I ran GLM ANOVA tests for all the non box score stats I collected, with results below:

First of all, it is interesting to note how consistent this data is with what we already know about pitcher performance. These five pitchers showed significant control over the lengths of their starts, their SO and BB rate, their Strike %, their HR rate, and their GB and FB rates (p < 0.2), while having less control over factors such as ERA, RE24, H/9, WHIP, HR/FB, BABIP, all stats that rely heavily on fielding and a healthy dose of random chance.

The catcher results are more intriguing. For starters, catcher’s had no influence over the strike % (p=0.920), which may surprise some who would expect that either LeValliere or Slaught was better at framing pitches. The only two categories where the catcher appeared to have at least some control were RE24 and BABIP.

So this is the odd conclusion that we come to with this data analysis. There was at least a somewhat significant difference in pitcher performance between the two catchers, but surprisingly it was the offensive Slaught that appeared to catch a slightly better game. More surprisingly, Slaught’s strength seemed to be that he somehow exerted control over the BABIP of those that pitched to him without significantly influencing the ratio of groundballs, flyballs, and line drives that went into play. This requires a great deal more research, but I would extend the hypothesis that catcher’s do influence pitcher performance, in that different catchers call different games that result in balls being more weakly put into play. It’s possible that Jim Leyland recognized this skill in Slaught (releasing LaValliere before the 1993 season), but it’s unlikely. Even with a large sample size it is difficult to actually parse out the influence of the catcher, so whenever you read any article that prattles on about catcher ERA, you better take it with a huge grain of salt.

And in case you were wondering, Saltalamacchia’s Catcher BABIP for the Red Sox this year is 0.273. Varitek’s? 0.279. Not a big difference there.

We hoped you liked reading Do Catchers Influence Pitcher Performance? The Story of Spanky and Sluggo by Rhubarb35!

Please support FanGraphs by becoming a member. We publish thousands of articles a year, host multiple podcasts, and have an ever growing database of baseball stats.

FanGraphs does not have a paywall. With your membership, we can continue to offer the content you've come to rely on and add to our unique baseball coverage.

Support FanGraphs




newest oldest most voted
Frank
Guest
Frank

With such a large number of categories, isn’t it statistically likely to see p values in the range of 0.10-0.20 even if indeed there is no difference?

If you were to repeat the analysis a couple more times with different catchers (yet ones in a similar situation) I might be more inclined to believe that these results are any more than noise.

evo34
Guest
evo34

Would be pretty interested in seeing what the avg. BABIP skill level was of opposing batters in games Slaught caught vs. games LaValliere caught. The two catchers would never have a truly random sample of the Pirates overall schedule, as opposing teams with a lot of LHP starters would draw Slaught usually, for example. If these LHP-heavy teams happened to have a high-BABIP offense, the apparent correlation between catcher and BABIP could show up. Also, the analysis assumes that both catchers had team-average defense around them. This is probably not the case if there were any other frequent platoons on… Read more »

zenbitz
Guest
zenbitz

Yes, none of the those p-values are meaningful. Even with a marginal one of 0.07 – 0.09 (Drabek) you would expect that in 10 tests.

zenbitz
Guest
zenbitz

I agree that the weak BABIP correlation (which is all Drabek) is probably a defense issue. The RE/24 is also just the BABIP.

J. Cross
Guest

If you’d normally require p < 0.05 with one comparison, with 14 comparisons you might require p < .0036. If you'd use p < .2 then with 14 comparisons you'd need p < .016 — although .2/14 will get you a rough estimate I think it's probably better to do 1 – (1-.2)^(1/14) .

So, if I'm interpreting your numbers correctly, there's really no evidence of a catcher effect.

jose
Guest
jose

Anyone here read this latest study on Strike Three: Do MLB Umpires Express Racial Bias in Calling Balls and Strikes? Daniel Hamermesh 07/01/2011 | 12:33 pm Our paper on discrimination in baseball has finally been published (the June issue of the American Economic Review). While it received a lot of media and scholarly comment in draft, the final version contained a whole new section. The general idea is that those discriminated against will alter their behavior to mitigate the impacts of discrimination on themselves. But while reducing the impacts, these changes are not costless. For example, if you’re an Hispanic… Read more »

GiantHusker
Guest
GiantHusker

That was a lot of work to prove nothing. How did this make it to the Big Blog?

Jon
Guest
Jon

“That was a lot of work to prove nothing. How did this make it to the Big Blog?” Statistically/scientifically “disproof” is often just as important as “proof.” This really shows that catchers don’t have much effect on pitchers. Those p-values are very high given the number of comparisons made, as was pointed out before. The only thing significant in these tests are the pitcher differences (which is to be expected). Only concern is this: You had a large sample of games with two very different catchers. But what you didn’t have was a large sample of pitchers. I still think… Read more »

Dann M
Guest

Jim Leyland is a lover of platoons. In 1990, for example, there were three L/R platoons in use by Pittsburgh. Slaught and LaValliere split time behind the plate, starting 61 and 87 games, respectively. Likewise, first basemen Sid Bream (L/L) and Gary Redus (R/R) started 100 and 58 games, respecively. And at the hot corner, Wally Backman (L/R) split time with Jeff King (R/R) 68 games to 86. As one would expect, the LaValliere/Bream/Backman and Slaught/Redus/King trios appear more often than not in the 1990 Pirates defensive lineups (http://www.baseball-reference.com/teams/PIT/1990-lineups.shtml). It should also be noted that Redus was pulled late in… Read more »

J. Welderson
Guest
J. Welderson

There have been a couple of great articles recently on THT that use PitchFX data (which is much more granular than what you’re doing here) and found that a catcher with good strike-framing abilities can have an impact of up to a full additional win above replacement over the course of a season (which is shorter for catchers). Before you ask, yes, they normalized for pitcher strikezone, batter strikezone, and umpire strikezone.

evo34
Guest
evo34

Jon: this article doesn’t prove or disprove anything. As referred to above, here is an example of how the article should have been done: http://www.hardballtimes.com/main/article/evaluating-catchers-framing-pitches-part-3.

Bill Sweet
Member
Member

I thought it was an interesting read. Not sure where the haterade is coming from.

biscuit pants
Guest
biscuit pants

As mentioned above, those catcher p-values are all insignificant. I don’t mean to “pour on the haterade,” because I really thought this was an interesting read, but the reason they’re insinificant is rather subtle, and many people with serious training in statistics often mess this up. Here’s a basic explanation of the mistake: http://xkcd.com/882/