Peer Learning Among MLB Umpires

A growing group of social scientists are researching peer learning, looking to answer the question “does an individual learn from their network?” In this post, I’ll present some evidence that MLB umpires “learn” from their peers in their assigned crews.

To quantify this, I calculate “call quality” for each umpire in each season from 2008 to 2019. Call quality is determined in a similar way to many umpire score card measures: I take PITCHf/x data for each game that a given umpire was assigned to home plate, subset to all called strikes and called balls, and overlay the true strike zone to calculate the proportion of correct calls.

I’m specifically interested in whether an umpire’s call quality is driven by the call quality of umpires they have been assigned to work with in the past.

Crew Assignment and Potential for Learning

MLB umpires present an excellent testbed for investigating peer learning. To study peer learning, you need large amounts of data in order to be able to:

  1. Track individuals over time and across many different teams / peer networks.
  2. Identify individual-level mistakes or quality.

Umpiring in the majors has both: umpires are assigned to crews at the start of the season and are frequently “shuffled” within seasons (due to vacations, injuries, illnesses, etc.) and we can use pitch tracking to determine quality for every call made.

There are also a few different channels through which peer learning might work with umpires. Although the final strike or ball call is made by a single home-plate umpire, the umps in a crew travel together, review game footage and calls together, and are encouraged by the league to work as a unit to officiate games. This creates a setting in which umpires (particularly umpires new to the majors) might pick up on tools of the trade from high-quality peers and generally improve their accuracy.

As a particular example, experienced and high-quality umpires might have a lot of knowledge about pitch-framing and how how catchers can try to manipulate the call. A new umpire paired with a good group of senior umpires might learn more about how to deal with pitch-framing and hence make better calls in the future.


As I mentioned above, my measure of umpire call quality is the fraction of “correct” calls (true strikes called strikes or true balls called balls) when a given umpire is behind the plate, based on PITCHf/x data. I calculate this game-by-game and then aggregate into an average call quality for a given umpire in a given season. Below is a histogram of this call quality, with 1,047 umpire-seasons in the data and an average season-level call quality of 0.885 (i.e. the average umpire gets 88.5% of strike/ball calls correct across the season).

I’m interested in whether there is a relationship between call quality of a given umpire and the average call quality of the network of umpires that they worked with in games in the previous season. There’s one main issue to take care of — a general trend of improvement of umpires’ calls over time. To account for this, I convert the umpire call quality measure into a z-score by season so that in each season the average umpire has a call quality of 0 with a standard deviation of 1.

I then run a regression of call quality of an umpire on the average call quality of umpires they worked with in the previous season. I’d expect the regression coefficient to be positive if there is peer learning, and that is indeed what I found. The coefficient on past-network quality is 0.1442 (with a standard-error of 0.0549, for a 95% CI of [0.03649, 0.25198]). In practical terms, that means that improving the average quality of an umpire’s network last season by one standard deviation raises call quality by 0.1442 standard deviations.

Is It All Just Noise and Mean Reversion?

While I think the results are interesting, there are some other things to consider. For example, it could always just be noise or mean reversion. Consider a hypothetical league in which there’s no peer learning among umpires, but umpires sometimes have good or bad seasons, and the league likes to match bad umpires with good umpires so that the average quality of a crew is roughly equal. In this situation, an umpire who has a bad year will be matched with umpires who had good years. If there is mean reversion, we would expect that in the following season, the umpire who did poorly will improve and hence we will see a (spurious) relationship between good umpires in the past and good umpires today. This is the main concern, and I can test for it in two ways.

Firstly, I can check whether there’s any call quality relationship in assignment. Using information from Retrosheet and Steve O’s Baseball Umpire Resources, I can see the start-of-season crew assignments and in-season crew assignments (after “shuffles” to crews for various reasons). I can test for call-quality-based assignments by running a regression of an umpire’s last-season call quality with the average last-season call quality of umpires in the crew(s) that they are assigned to. If MLB assigns umpires based on call quality, there should be a relationship there.

But there isn’t. The coefficient is 0.00267 (with a standard error of 0.00233, for a 95% CI of [-0.001896, 0.0072415]), which means that when an umpire is assigned to a new crew, the quality of that crew is essentially random.

Secondly, I can do a placebo test by running a regression of the call quality of an umpire on the average call quality of umpires they worked with in the following season (instead of the previous season). If there is season-to-season mean reversion and quality-based crew assignment, then an umpire who does well one year will tend to be assigned to a crew with umpires who did poorly. If there is mean reversion, then we’d expect those who did poorly to improve, and so there will be a positive relationship between doing well one year and having a better crew next year. Of course, if the result is driven by peer learning, then we’d expect there to be no relationship — you can’t learn from people you haven’t worked with yet!

And the regression shows there’s no relationship. Running this regression yields a coefficient on next season’s network quality of -0.00407 (with a standard error of 0.0594, for a 95% CI of [-0.12059, 0.11245]).


It looks as though umpires learn from their peers in their assigned crews, so an umpire assigned to a crew that makes better strike and ball calls will tend to have better quality calls in the future.

Jed Armstrong is currently working on a PhD in labor economics and writing up these findings into an academic article. If you have any suggestions or comments, or would like to see the draft, feel free to get in touch in the comments or on Twitter.

newest oldest most voted

Consider if it makes any sense to use variance as opposed to SD in ‘z-scoring’ the data. Particularly if you intend to use residual analysis to the fit as a metric of the ‘models’ accuracy. Very interesting work.