Batter Performance vs. Pitcher Clusters by Jordan Fromm January 24, 2019 Managers are always attempting to optimize their lineups for success. Whether they make in-game decisions like double-switches and lefty-righty matchups, or choose to change things up based on recent or historical performances, every move is meant to give their team the competitive advantage. What if they also made alterations based on pitcher groupings? In this article, I will attempt to determine if batter performance is impacted by pitcher clusters that are organized by pitch speed and pitch proportion. The parameters used to cluster pitchers are below: Proportion of Pitch Thrown Average Pitch Speed These statistics were calculated for the following pitch types: Changeup Curveball Eephus Cutter Four-seam fastball Sinker Two-seam fastball Knuckle-curve Knuckleball Slider Splitter *All data in this study is from 2010-July 2017 (MLB Gameday). Optimal Number of Clusters In order to properly cluster pitchers, I first had to determine which factors I would use to group them. I chose pitch speed and pitch proportion because speed may only represent a pitch that a player threw once, but if they threw it consistently and effectively over their career, it could impact a batter’s mindset when they face them multiple times. Using a hierarchical clustering tree, I was able to visually determine the similarity between pitchers. Below, I’ve included a random sample of 30 players. All of these players are somehow related, even though their selection was completely random. As you navigate down the tree, pitchers become more similar. However, I wanted to be sure that I didn’t arbitrarily select a number of clusters to represent all of the pitchers in my dataset. Therefore, I used the silhouette and elbow methods to determine the optimal number of clusters. The elbow method uses the total within-cluster sum of squares to make sure that the cluster is as compact as possible. You will notice a bend in the graph, which indicates the optimal number of clusters using this method. The silhouette method calculates the quality of clustering. You want to maximize average silhouette width within each cluster. As you can see from the graphs above, the optimal amount of clusters according to the elbow method are 4 and 8, and the silhouette method chooses 3, 5, 8, and 10. As eight clusters was the only overlap, I chose to separate the pitchers in my dataset into eight groups. Additionally, I wanted to mention that while it is generally understood that the more clusters you use, the more similar the pitchers are, I thought it was necessary that there were enough pitchers in each cluster where I could easily compare them. Furthermore, having two or three pitchers per cluster to compare against each other would be pointless. Pitcher Clustering The 1,574 pitchers were split into eight clusters. Examples of the clusters are included below: Crafty Starters (Cluster 1) Slider-Heavy Relievers (Cluster 2) Hard-Throwing Starters (Cluster 3) Starters with a Cutter (Cluster 4) Fastball-Heavy Relievers (Cluster 7) Sinker-Heavy Relievers (Cluster 8) Standout Batters In order to determine batters that outperform their standard career performance, I graphed their mean wOBA (against all clusters) against their range wOBA (best performance vs. cluster minus worst performance vs. cluster). Then, based on percentiles, I separated the graph into sections. The top left area showcases batters who have performed consistently well throughout their career. Additionally, they have low variability in their wOBA against all clusters. These batters are perennial All-Stars and Hall of Famers. On the other hand, the batters in the lower right box in the graph have extremely high variability, yet have below average career numbers. The graph is included below: High Variance, Low Mean Batters While the top left is important because it identifies great batters, the batters located in the bottom right area of this graph represent players that a manager can platoon dependent on the cluster in which the opposing pitcher belongs. They shouldn’t be everyday players as they usually underperform, but when utilized in the right matchup, they provide great value to a team. Additionally, the parameters that define this section can be expanded. However, I used the top and bottom 10% of the data for range and mean in order to maintain consistency. Furthermore, it is important to identify certain pitcher clusters that standout amongst others. Cluster 4 (Starters with a Cutter) has the lowest overall wOBA against. Of the top 13 pitchers (according to FanGraphs WAR) in 2016, four pitchers (Max Scherzer, Johnny Cueto, Madison Bumgarner, and David Price) appear in Cluster 4. Moreover, historically great batters perform very well against this cluster. Therefore, it can be inferred that these batters are great because they perform exceptionally well regardless of the opposition. This is a very exciting project as there can always be more variables added to the clusters. Eventually, I hope to include pitcher handedness into the grouping process to see how the clusters are impacted. Feel free to let me know if you have any suggestions regarding the future of this project. The code for this project is located on my GitHub.