Do Catchers Influence Pitcher Performance? The Story of Spanky and Sluggo by Rhubarb35 July 1, 2011 From Opening Day to April 20th, Red Sox pitchers posted a 7.14 ERA when Jarrod Saltalamacchia was behind the plate versus a 2.40 ERA when Jason Varitek started. The resulting hubbub about this split made one fact extremely clear, when comparing the influence of different catchers, sample size is really really important. Already by June 24th, Varitek and Salty’s split has been greatly reduced, with pitchers now throwing a 3.44 ERA to the veteran captain and a 4.36 ERA to the new guy. I would bet that these numbers will continue to converge as the season drags on, but even after 182 games it’s unlikely that either catcher will have enough innings to statistically test whether one is calling a better game. This is the difficulty of assessing catcher performance: comparing catchers between teams is near impossible (because the pitching staffs are different), and comparing catchers within teams is difficult (because sample sizes are small and different pitchers use different catchers). Nevertheless, many still believe that catchers do influence pitcher performance. Where can we find the data to support this hypothesis? Enter Jim Leyland. From 1990 to 1992, Leyland’s Pirates deployed one of the longest-running catcher platoons in baseball history on their way to three straight NLCS losses. The main catcher was Mike “Spanky” LaValliere, a strong defender with a cannon arm (career 0.992 field rate, league leading 45% CS rate and GG in 1987) and mediocre offensive production (.269/.355/.342/ through 1992, good for a 97 OPS+). Backing him up was Don “Sluggo” Slaught, who had the reputation of a horrible defender (led the league in SB allowed in 1986, errors in 1988) but hit for more power (.280/.331/.417 through 1992) and crushed lefties (.301/.358/.446 career). Leyland split time between Spanky and Sluggo about 60/40, and during this three-year span five Pirates starters – Doug Drabek, Bob Walk, Randy Tomlin, Zane Smith, and John Smiley – acquired over 100 IP with each catcher. In total, these five pitched 1,347 innings in games that LaValliere started, and 821.2 innings in games that Slaught started. If pitchers did indeed throw differently to Spanky and Sluggo, this should be a large enough sample to observe it. Thanks to Baseball Reference, I pulled up 339 games between 1990 and 1992 where one of these pitchers and one of these catchers started. Basic box scores were recorded (IP, H, ER, BB, SO, HR, Pit, Str, GB, FB, and LD), and from this some more interesting metrics were calculated (ERA, SO/9, BB/9, Str%, H/9, WHIP, HR/9, GB%, FB%, LD%, HR/FB, and babip). Finally, one advanced sabremetric stat was considered (RE24). These stats were then organized to compare the performance of each pitcher with each of the catchers, and standard t-tests were run to assess whether pitcher performance in these areas significantly changed depending on whether Spanky or Sluggo was starting. First off, it is interesting to see the parts of pitching that the two catchers apparently didn’t influence. Differences in SO/9, BB/9, H/9, HR/9, HR/FB, IP/game, Str%, GB%, FB%, and LD% were statistically insignificant between catchers, although for some reason Zane Smith seemed to get jacked up when Slaught was catching for him (0.89 HR/9 vs. 0.42 for LaValliere). However, there were some pitching performance indicators that showed very real differences depending on who was catching. ERA Doug Drabek started 52 games with LaValliere and 43 games with Slaught during this span, pitching a 3.25 ERA to Spanky and a 2.42 ERA to Sluggo. This 0.83 ER difference is statistically strong (p=0.09 in a t-test, meaning we can say with 91% confidence that the difference wasn’t random). Randy Tomlin and John Smiley also had a noticeably lower ERA with Slaught behind the plate (p= 0.28 and 0.32), while Zane Smith pitched better with LaValliere (p=0.15). Bob Walk was essentially a push, with his slight edge towards LaValliere being statistically meaningless (p=0.73). RE24 ERA is a flawed statistic in many ways, and more advanced sabremetric calculations exist that more specifically pinpoint the contribution the pitcher makes to his team during the course of a game. One such stat is RE24, which measures the change in Run Expectancy (RE) before and after every play. A pitcher’s RE24 for a start effectively measures how many net runs he either prevented or allowed in that start. For each pitcher, RE24 tracks very closely with ERA. Drabek prevented an additional 0.73 runs when Slaught was catching for him (p= 0.08), while Smiley prevented an additional 0.85 runs per start (p=0.14). Tomlin also showed some improvement with Slaught behind the plate (0.65 prevented runs, p=0.28), while Smith still preferred LeValliere but not with the same statistical strength as with ERA (0.50 prevented runs, but p=0.46). Bob Walk again showed little statistical significance in his result (p=0.58). BABIP Both ERA and RE24 sum up a pitcher’s performance, but neither does a great job of identifying how a pitcher is being successful. Because SO/9 and BB/9 rates were essentially the same between the two catchers for all pitchers, what’s most interesting to us if what happened to the balls that did make it into play: BABIP. Drabek and Smiley had significantly lower batting average on balls in play with Slaught behind the plate (p= 0.07 and 0.20), while Tomlin had a less significant difference (p=0.36), Smith very weakly favored Smith (p=0.49), and Walk continued to be statistically meaningless (p=0.65). BABIP is a frustrating statistic in that it is unclear how fluctuation in the average is attributable to the pitcher, but some pitchers – by increasing their GB% and lowering their LD% – are able to lower their BABIP in a very real way. Unfortunately that is not the case with any of these starters, as GB/FB/LD% were almost identical between the two catchers (varying by less than 2% in almost all cases). ANOVA Analysis None of the analysis above calculates Slaught’s or LeValliere’s total ERA or BABIP, the 339 starts are split into five different samples by pitcher instead of being thrown into one giant group like analysts have done with Varitek and Saltalamacchia. The reason for this is that the Pirate’s best pitchers – Drabek and Smiley – pitched to Slaught in 45% of their starts while the other three only pitched to him 33% of the time. Averaging all the starts together would bias the results towards Slaught, rewarding him for catching more often for the 1990 Cy Young winner and the 1991 American league wins leader. Luckily, there is a more complex statistical test known as an ANalysis Of VAriance or ANOVA (more specifically a General Linear Model ANOVA in this case). An ANOVA test allows me to simultaneously examine how two different independent variables (pitcher and catcher) influence a particular dependent variable (such as SO/9). To oversimplify, if games where Smiley started and Slaught caught resulted in an average of 3.13 ER, and games where Smith started and LeValliere caught resulted in an average of 3.89 ER, an ANOVA test helps me compare how much that difference was caused by switching pitchers and how much it was caused by switching catchers. Like in the t-test, a low p-value for a particular stat indicates that the pitcher or catcher had a significant amount of influence or control over that metric, while a high p-value indicates that there was not a significant amount of influence. I ran GLM ANOVA tests for all the non box score stats I collected, with results below: First of all, it is interesting to note how consistent this data is with what we already know about pitcher performance. These five pitchers showed significant control over the lengths of their starts, their SO and BB rate, their Strike %, their HR rate, and their GB and FB rates (p < 0.2), while having less control over factors such as ERA, RE24, H/9, WHIP, HR/FB, BABIP, all stats that rely heavily on fielding and a healthy dose of random chance. The catcher results are more intriguing. For starters, catcher’s had no influence over the strike % (p=0.920), which may surprise some who would expect that either LeValliere or Slaught was better at framing pitches. The only two categories where the catcher appeared to have at least some control were RE24 and BABIP. So this is the odd conclusion that we come to with this data analysis. There was at least a somewhat significant difference in pitcher performance between the two catchers, but surprisingly it was the offensive Slaught that appeared to catch a slightly better game. More surprisingly, Slaught’s strength seemed to be that he somehow exerted control over the BABIP of those that pitched to him without significantly influencing the ratio of groundballs, flyballs, and line drives that went into play. This requires a great deal more research, but I would extend the hypothesis that catcher’s do influence pitcher performance, in that different catchers call different games that result in balls being more weakly put into play. It’s possible that Jim Leyland recognized this skill in Slaught (releasing LaValliere before the 1993 season), but it’s unlikely. Even with a large sample size it is difficult to actually parse out the influence of the catcher, so whenever you read any article that prattles on about catcher ERA, you better take it with a huge grain of salt. And in case you were wondering, Saltalamacchia’s Catcher BABIP for the Red Sox this year is 0.273. Varitek’s? 0.279. Not a big difference there.