Using Count Data To Find Unsustainable Performances

In this project I attempted to find the counts in which hitters were most successful during the 2019 season, and then find the hitters that were ending their at-bats in these counts the most in an effort to identify which players could potentially be under- or overperforming both in the past and going forward.

The data for this project was gathered by scraping Baseball Savant, which I used to create a dashboard to assist me in my analysis. I could not analyze every individual outlier performance from 2019 in this post, but the visualization I created can be accessed here, and the Github Repository for my project can be found here so you can take a look for yourself!

As the chart above shows, MLB hitters performed their best in counts with one or no strikes and their worst in two-strike counts. Using this data, I then explored individual performances in each count on the dashboard I had built to attempt to find outliers and discover who was ending at-bats in each count the most. Once players were identified, I would investigate why their performances were outliers and if their performances were sustainable. This post will highlight two of the more interesting unsustainable cases in hitters I found: Paul DeJong and
Javier Báez. Read the rest of this entry »


SEAM Methodology for Player Matchup Evaluations

Introducing SEAM Methodology

This article introduces the SEAM (Synthetic Estimated Average Matchup) method for describing batter-versus-pitcher matchups, both numerically and visually. We provide a Shiny app, available here, which you can use to follow along.

This app allows users to visualize synthetic spray chart distributions for any batter-pitcher matchup that has or could have occurred in the past five years (which is when Statcast data exists). Our app also reports performance metrics that are calculated directly from the displayed synthetic spray chart distribution. This includes the expected number of singles, doubles, triples, and home runs, as well as the expected batting average on balls in play (xBABIP) and the expected bases on contact (xBsCON), which can be thought of as slugging percentage except the denominator is BIP + HR instead of AB. These matchup-dependent metrics allow for any user to assess the expected performance of batters and pitchers when they face each other.

The SEAM method estimates spray chart distributions in the form of heat maps that are smoothed versions of conventional spray charts. We construct these by combining separate batter spray chart distributions that are constructed for each of the pitches that the pitcher throws. The final combination is also weighted to the usage for each pitch.

One challenge to this approach is the sparsity of some batter-pitcher matchup data. We alleviate this concern with the development of synthetic batters and pitchers with similar characteristics as the batter and pitcher under study. Our synthetic player creation methodology is inspired by the notion of similarity scores like those motivating PECOTA and Bill James’s work. However, unlike the similarity scores presented in the past, we construct similarity scores using a nearest neighbor approach that is based on the underlying batter and pitcher characteristics of the players under study instead of observed statistics. Read the rest of this entry »


Evaluating the Mechanics of a Grievance Filed by the MLBPA

On June 23, it was reported that Major League Baseball and the MLB Players Association had agreed to terms to resume play in 2020 following the sport’s suspension due to the COVID-19 pandemic. This agreement came on the heels of the now-infamous March 26th agreement that was the subject of debate and contention between both sides of the bargaining table. Among other things, the agreement does not foreclose the right of the MLBPA to file a grievance and seek financial damages as it relates to the interpretation of the Agreement.

Specifically, the players may look to challenge whether the league did in fact negotiate in good faith as it relates to how many games were to be played in the abbreviated 2020 season. Further, the agreement also states that the Office of the Commissioner’s effort to issue a schedule for the 2020 season shall only be performed to the extent it is “…practicable and economically feasible.”

The term “economically feasible” is likely another point of dispute, as the league did not reveal any financial data, supposedly requested by the union, that would help to justify their claim that a season without fans would be a detriment to its bottom line. As the season now rolls along, we can explore the process by which the MLBPA may file a grievance to have its claims be heard and adjudicated.

What Is A Grievance?

Article XI of the parties’ 2017–2021 Basic Agreement (the CBA) sets forth the terms and conditions of grievance procedure. As defined, a “grievance” is “a complaint which involves the existence or interpretation of, or compliance with, any agreement, or any provision of any agreement, between the [MLBPA] and the Club…or between a Player and a Club.” Presumably, the March 26th Agreement would fall within this definition. Read the rest of this entry »


Reworking and Improving the Outcome Machine

This post was inspired by a couple of articles that I remembered reading from Jonah Pemstein back in 2014. The intention of those posts was to predict the result of any given batter/pitcher matchup, dubbed the “Outcome Machine.” Have you ever wondered what the probability Mike Trout strikes out when he steps into the box against Justin Verlander? Of course, there are variables that are specific to any plate appearance (umpires/situation/stadium/etc.) that are harder to quantify, but it set out to predict the outcome in a vacuum. Trout vs. Verlander and nothing else (For the record, in 2020, I would estimate the answer is about 27.5%).

Being able to predict the outcomes in sports would take most of the fun out of being a spectator, sure, but I still found myself coming back to those articles. While reading and re-reading in an attempt to understand the logic and fool around with the equations, I came to a few questions of my own:

  • With all of the hubbub of juiced balls and increased launch angles, do equations that were based on data from 2003-13 still apply to the game today?
  • The regression equations were composed of the at-bat result and the stats of the batter and pitcher from the same year. This stuck out to me as an issue because it means the player’s performance later in the season, say in July, influences the prediction of an at-bat in May, and to a lesser extent, the result of that specific at-bat is already baked into that season’s performance. Shouldn’t you use data exclusively before a given at-bat to predict the outcome? Hindsight is 20/20, after all.

Eventually curiosity got the best of me and I decided to emulate the original exercise. Before I really start to nerd out on the inner workings, you can find this iteration of the Outcome Machine as a Google Sheet here. You can either select a pitcher/batter combination through the dropdown or hard key in the rates in a custom, hypothetical matchup below that. League average is set by default to projections for 2020 but can be updated as desired in the custom matchup. I would note that the preset statistics in this tool are total projections for 2020 but not broken out into L/R splits, as to my knowledge that data is currently behind a paywall. Read the rest of this entry »


The Rise of the Hit by Pitch

With the current trend of thinking in the front offices in MLB, we are seeing many aspects of the game at all-time highs and lows. Strikeouts, home runs, and a lack of stolen bases get a lot of content created about them, but there is also something else at an all-time high: hit by pitches. We saw 1,984 hit by pitches in 2019, the most in MLB history, surpassing the previous high of 1,922 in 2018. There has not been this rate of hit by pitches per game since 1900, and baseball is very different from how it was then.



You may say that there are more pitches thrown in games than ever before, so the rate per game may be rising because of that. But when we account for that and look at the rate per pitch (which we can do so from 2008 onwards), you can still see that sharp increase in the last two seasons.



What has gone on here? There should be something responsible for this increase in hit by pitches. Is it the pitchers? Do we have guys who can throw hard but have less command, so they are hitting more batsmen? Are they throwing inside more often? Is it the hitters? Do we have guys who are getting tighter to the plate or players who are just more willing to take the hit to get on-base?

Let’s start with the pitchers. Thanks to the PITCHf/x and TrackMan data we have the location of every pitch since 2008. I will be using the Statcast zones to bucket the data. Read the rest of this entry »


Modeling Strikeout Rate with Plate Discipline Part 1: Hitters

Strikeout and walk rates are perhaps the most popular and widely used peripheral statistics, particularly for pitchers. However, with pitch level data, these statistics now have “peripherals” of their own. I was curious if I could create an accurate-yet-interpretable model using FanGraphs’ plate discipline metrics that could offer insight on what drives the differences in strikeout and walk rates between players.

While many have noted individual correlations between a single statistic and strikeout rate, I have not seen many unifying models that incorporate several plate discipline metrics. For the first part in this study, I will focus on hitter strikeout rate, but I intend on also looking at walk rate and, later on, pitchers’ strikeout and walk rates.

If you are not a fan of mathematical details, feel free to skim or skip these next few sections to get to my overall conclusions.

Methodology

Plate Discipline Flash Card 12-29-15

Note: I used BIS discipline statistics rather than PITCHf/x. I do not think this made a significant difference, but I think it is important to keep in mind.

FanGraphs gives us nine plate discipline statistics to work with. However, several of them can be removed as they can be derived using the other statistics. In a regression setting, this phenomenon is called perfect multicollinearity, which is when an explanatory variable can be perfectly formulated by other explanatory variables. With a high degree of multicollinearity, it can be extremely difficult to tell which particular variable is responsible for a change in the response variable, which is problematic for inference. Using some basic dimensional analysis, I found formulas for all three of these: Read the rest of this entry »


Finding Ray Fagan: A Minor League Mystery

Sometimes numbers tell a story. Sometimes that story is a mystery.

I came across the Baseball-Reference page for Raymond Fagan and was stunned by what I saw. It says Fagan went 13-0 with a 1.16 ERA for the Class D Oklahoma City Senators in 1915. Now the stunning part – it says it was his only professional season. Despite those dominant results, it appears Fagan never pitched again.

What happened to Raymond Fagan? Did he suffer a career-ending injury? Did he get into legal trouble and change his name? A Google search yielded no answers. This mystery required a deeper dive. Read the rest of this entry »


Third Grader Hits the Library: An Origin Story

We usually think of origin stories as the province of fictional superheroes or the real-life super rich. It could be an ordinary boy bitten by a radioactive spider or arriving on earth as refugees from an annihilated planet. Perhaps we think of a nearly destitute J.K. Rowling toiling away at her first novel in a coffee shop, or Jeff Bezos creating an empire from scratch on a computer in his living room. Yet many of us who came from humble origins and went on to live simple, unremarkable lives also have a narrative that informs who we became. Mine happened in third grade.

I am a husband, a father, and a teacher. To these three descriptors of my identity I would add one more, just slightly less central. I am a baseball fan.

I am not one of the true obsessives who grew up playing Strat-O-Matic and graduated to planning his whole calendar around the SABR conference or spending countless hours with multiple fantasy leagues (two is my limit). But I have been a fantasy league commissioner since 1992, and the majority of text messages that my adult son and I exchange have some connection to the top Atlanta Braves prospects for the coming year. I also get to sleep most nights not by counting sheep, but by silently reciting World Series winners backward from 1970.

Baseball, its present and its past, is deeply ingrained in my outlook on life. My bookshelf is 70% baseball, 30% history and politics.

Baseball on the field was part of my youth, first as a fourth-rate Little League catcher and then as a minor league batboy for the Class A Lynchburg Mets.

Family vacations have often included trips to Baltimore or Atlanta for games. My son’s youth and high school games with me as spectator, coach, or scorekeeper were part of the rhythm of our family life for over a decade. Our baseball bond defines our relationship.

As the immortal lyric of David Byrne plaintively asks, “well, how did I get here?” Read the rest of this entry »


Why Most Fans Blame the Players

I imagine most people reading this have a favorite team. And over time, you’ve likely had numerous players on that team whom you particularly enjoyed watching play. But when push comes to shove, who receives your greatest loyalty, the team or the players?

I’m a Cardinals fan, and I greatly enjoyed Albert Pujols‘ contributions to the Redbirds’ success during his 11 years wearing the birds on the bat. Since he’s left St. Louis? Sure, I’ve been happy for him when he’s done well — getting his 3,000th hit as well as his 500th and 600th home runs — but it’s not the same. He’s an Angel now, not a Cardinal, so I’m simply not as invested in his accomplishments.

This stance is probably understandably similar for most of you. Teams are (mostly) eternal, while players are ephemeral. Can I name the starting eight position players for the 2011 Cardinals? Probably not, but I still know they won the World Series that year.

When it gets flipped, however, is when we go off the playing field and into the negotiating room. When the owners and players are battling over matters of the game — particularly the divvying up of the loot — I largely stand behind the players. The owners become the faceless, monolithic corporations that extort billion-dollar ballparks from their communities and work extremely hard to give the players as small a portion of the pot as possible, while the players have short careers and are positioning themselves to take care of their families as much as possible before their careers end.

Of course, it’s not that cut-and-dried. Both sides have their virtuous and unseemly characteristics. Each group is willing to put their interests before others.

But regardless of who sticks it to whom for their own benefit, it’s largely the players who suffer the vitriol of the fans and media when the two sides clash. The question is, why is that? The answers actually make a lot of sense — even if they really don’t. Read the rest of this entry »


All Stolen Bases Were Not Created Equal

Fielding percentage is often criticized for the selection bias introduced by a player’s range (good defenders attempt more difficult plays, leading to more errors). A similar issue of selection bias is present in stolen bases. On any given pitch, it is at the sole discretion of the runner if he will steal a base or not. Naturally, the runner will only attempt a stolen base when he believes he has an advantage over the pitcher and catcher.

Ivan Rodriguez caught 46% of base-stealers throughout his career, topping out at a 60% caught stealing rate in his prime and leading the league in CS% in nine seasons. Knowing that stealing against Pudge is little more than a pipe dream for most, only the best baserunners would dare to attempt a steal. If this assumption holds, Rodriguez’s CS% would in fact be far more impressive than initially reported due to the level of competition he faces relative to a typical catcher.

To adjust for selection bias in stolen-base attempts, I developed an ELO model. For those unfamiliar, ELO ratings are a method of calculating the relative skill levels of players in zero-sum games. You might recognize ELO from chess rankings or FiveThirtyEight’s sports prediction models. These ratings can be used to directly estimate the probability of winning a match between two individuals or teams. The ratings change after each match, rewarding a win by an underdog more than a win by the favorite.

On a stolen-base attempt, the runner, pitcher, and catcher all play a major role in the outcome of the play. An argument could also be made for the importance of the fielder receiving the throw, especially when considering the select few who can make tags like this: Read the rest of this entry »