A PCA for Batter Similarity Scores (Part 1: Basic Methodology)

by Matt Malkus

March 2, 2015

This is the first in a series of pieces on a tool I’ve been working on. Admittedly, right now it’s quite raw, and probably needs some adjustments, which I’ll elaborate on towards the end of this post. It’s also quite lengthy – set it aside for when you have ample time to follow along, as there are some example calculations included to demonstrate the process.

Most of you are familiar with the “Similarity Scores” feature on Baseball Reference. If not, the explanation can be found here. The idea is to provide player comps using the player’s statistics. This has been around a while, and is based on a fairly simplistic “points-based” approach. Such an approach has the advantage of being easy to follow and intuitive, and as a quick tool to create fun conversation, it’s nice. However, it’s not very useful for purposes of projection for many reasons – not the least of which being that the points used are arbitrary and the statistics used are result statistics (hits, HRs, RBIs, etc) rather than being process-driven. It’s also intended to work on a player’s entire career. Some players have one or more drastic shifts in results over the course of their careers – and, to project a player in 2015 from his work in 2013-2014, we need to isolate data by season.

With the mountains of granular data available since Similarity Scores were first published, I thought it would be interesting to take a cut at creating something new in the same vein. My primary objectives were to create a similarity metric that (a) compared individual seasons rather than entire careers; (b) was based primarily on a hitter’s “process” or approach at the plate rather than strictly on results which are influenced heavily by luck; and (c) was mathematically defensible, in other words, non-arbitrary.

I downloaded batted ball and plate discipline data for all seasons 2002-2014 with 250+ plate appearances. This yielded 4,020 qualifying player seasons. I removed counting statistics (for example, number of infield hits), leaving only rate statistics in the dataset. I also removed any statistic which was derived from other statistics in the dataset (for example, GB/FB ratio, which of course is a ratio of the GB% and FB% statistics already in the dataset). Finally, I augmented the data with a few additional variables: K%, BB%, and ISO. Although these variables contain results which can be influenced by luck, they offer much-needed context used to interpret the ultimate results of the analysis, and tend to be more driven by a player’s underlying skill set over a 250+ PA sample.

I then performed a principal component analysis on the dataset. Without getting too far into the weeds on how PCA works, the best way to explain it is that it allows the data to speak for itself. Correlations between variables are taken into account by the process, so as to accurately represent the variability in the system. For example, K% and swinging strike % are highly correlated, and therefore shouldn’t be double-counted.

The great thing about PCA is that it creates a set of linear combinations of the variables in the dataset (eigenvectors) which explain the maximum amount of variation in the dataset. The first eigenvector explains the maximum amount of variation by any linear combination of variables, similar to a simple multivariate linear regression. The second eigenvector explains the maximum amount of variation by any linear combination after the variation explained by the first eigenvector has already been accounted for; and so forth, iteratively, until there are n linear combinations, where n is the number of variables being observed. These linear combinations can then be interpreted by the user. Ideally, each linear combination will be intuitive or explain some separate skill a hitter possesses, or some phenomenon a hitter endures.

Results of the PCA are summarized in the following table:

Eigenvalues	5.7517	3.3687	2.1167	1.7569	1.4742	1.0679	0.9719
BABIP	0.0014	0.0376	*-0.5039*	-0.0161	0.1951	-0.3996	-0.1602
LD%	-0.0621	-0.0076	-0.3039	-0.0077	0.4901	-0.3100	0.4311
GB%	-0.2000	0.1923	-0.3332	0.1313	-0.2881	0.3746	-0.2901
FB%	0.2191	-0.1803	0.4543	-0.1222	0.0562	-0.2194	0.0846
IFFB%	0.0204	0.0339	0.4500	0.1588	-0.1623	-0.2093	-0.0065
HR/FB	0.3150	-0.1494	-0.0847	-0.1163	0.1151	0.0464	-0.3903
IFH%	-0.0308	0.1262	-0.0635	0.1010	-0.3767	*-0.6465*	-0.3750
O-Swing%	0.0628	0.3926	0.0408	-0.4780	-0.0456	-0.0237	0.0132
Z-Swing%	0.2173	0.2649	0.0781	0.1262	0.3864	0.1856	-0.2030
Swing%	0.1383	0.4568	0.1221	-0.0124	0.2664	0.0666	-0.1362
O-Contact%	-0.2991	0.0578	0.0822	-0.4378	-0.0147	-0.0566	-0.0617
Z-Contact%	-0.3719	-0.0137	0.1083	-0.1036	0.1639	-0.0254	-0.1141
Contact%	-0.3915	-0.0543	0.1209	-0.0605	0.1612	-0.0335	-0.1290
Zone%	-0.0996	0.0025	0.1108	*0.6586*	0.1612	-0.0717	-0.0568
F-Strike%	0.0036	0.4160	0.0464	0.0673	-0.0562	-0.1479	0.1350
SwStr%	0.3795	0.1873	-0.0724	0.0668	-0.0654	0.0431	0.0758
BB%	0.0975	-0.4485	-0.1303	-0.0628	-0.0465	0.0831	-0.0357
K%	0.3280	0.0110	-0.1681	-0.0384	-0.3081	-0.0522	0.3201
ISO	0.2919	-0.2020	0.0451	-0.1407	0.2195	-0.0940	-0.4241

OK – an explanation of this table is in order.

Row 1 is the eigenvalue. A simple way of thinking about this is that the relative size of this number represents the share of variation explained by the linear weights in that column. The table is sorted by eigenvalue – the most important set of linear weights is on the left, representing about 30.3% (5.7517/19, where 19 is the number of variables) of the total variation in the dataset. The next column represents an additional 17.7% of the variation, after the first 30.3% is already accounted for. And so forth. The data above represents about 86.9% of all hitter variation.

Going down each column are sets of linear weights assigned to each variable. Each column can be used to “score” a player. For example, let’s take Nick Markakis‘ 2014 season as an example to build around. Using this data, we would calculate the “score” on the first component by multiplying the weights in the first column of the first table by Markakis’ values on each variable. Starting from the top, Markakis had a 2014 BABIP of .299, a LD% of 19.6%… you get the point. So:

(0.299*0.0014)+(0.196*-0.0621)+(0.459*-0.2000)+….+(0.118*0.3280)+(0.111*0.2919) = -0.7176

We have a number. Great! What does that number mean? Well….nothing, really. It’s not in any sort of unit of measure we can comprehend. It’s just a number. To interpret it, we’ll need to know what the average score is for the dataset, and the variance of scores. Then we can see how far above or below average this score was in context.

We’ll also need to interpret what a high score for this metric means, and what a low score means. Take a look at the weights in the first column of the first table, which were used to compute this score. Numbers that are bolded or underlined carry a lot of weight in the score. In this case, to get a high score, a player would probably need to have:

A high HR/FB rate
A high whiff rate
A high strikeout rate
A low contact rate, particularly on pitches inside the strike zone (Z-Contact%).

To a lesser extent, the underlined values show that a high score would probably represent:

A high ISO
Poor contact outside the zone (O-Contact%), in addition to inside the zone.

What do these characteristics suggest? Interpretation can be tricky, but the combination in the lists above seem to suggest that high scorers are “selling out for power” – they are swinging hard, missing a lot, but hitting more homers because of it.

None of this really sounds like Markakis, so intuitively, we’d think that he should score pretty low here. Indeed, the average score was -0.491; Markakis’ 2014 season was about 1.60 standard deviations below average. By contrast, you might have just thought of someone like Chris Davis when you read that last paragraph. Indeed, Chris Davis’ 2014 season was 2.19 standard deviations above average, and Davis has never logged a season that wasn’t at least 1.92 standard deviations above average in his career. The two are different hitters, which is obvious watching them. But now we have systematic proof.

Going through the same process, we can come up with scores for each of the other columns in the first table as well. Again, we’ll need to examine what a “high score” means for each column, so that we can interpret the results. In my best judgment, I assigned names to each score/column. The description of each score is below, along with the highest scorer in each category for the 2014 season.

Vector 1: “Sell Out for Power” – already described above. George Springer
Vector 2: “Impatient Hacker” – high scorers are swinging a ton consistently, and are walking quite a bit less than average. Wilson Ramos
Vector 3: “Weak FB Hitter” – high scorers have very low BABIPs because they are popping up and hitting lots of weak flies instead of hitting liners and grounders. Chris Heisey
Vector 4: “Pitchers Attack” – for some reason, high scorers are being thrown a ton of strikes. They don’t swing a lot when they are thrown balls. They have marginally lower power than average, so maybe pitchers just aren’t afraid of these guys. George Springer (again)
Vector 5: “Balanced Masher” – high scorers are good all-around hitters. They swing at lots of strikes, mash line drives, and don’t strike out very much. Freddie Freeman
Vector 6: “Slow GB Hitter” – high scorers are hitting a ton of ground balls, but they aren’t getting many infield hits. Bad combination. Everth Cabrera
Vector 7: “Put On a Glove” – my favorite category name. High scorers are striking out a lot, and though they hit a lot of line drives when they connect, they aren’t getting on base much. They should probably go put on a glove. Eugenio Suarez

Note that these vector names might not capture everything about what the vector represents. For example, no one is suggesting that Everth Cabrera is slow, necessarily – maybe he was just unlucky – but he did hit a whopping 66.9% of balls on the ground, and is over 60% career. Admittedly, these names could be better, and I’m rather open to other suggestions.

Now we can look at z-scores (+/- standard deviations from average score) on each of these 7 metrics and get an idea of what kind of hitter we have on our hands. Continuing with the Markakis and Davis examples…

Name	Year	Sell Out for Power	Impatient Hacker	Weak FB Hitter	Pitchers Attack	Balanced Masher	Slow GB Hitter	Put On a Glove
Nick Markakis	2014	-1.596	-0.293	0.344	-1.698	-0.603	-0.173	-0.131
Chris Davis	2014	*2.188*	0.005	-0.560	-0.616	-0.343	0.059	1.785

Markakis comes out looking like the patient, contact-oriented hitter he was, while Davis looks like a guy who was swinging from the heels and failing a lot. Promising start.

A caveat – as I said, this is very rough at this point. One thing that I should do, which I did not do to this point, is to adjust the data by season and possibly also by ballpark so that different seasons are more comparable (along the same lines as OPS+). I anticipate that the ordering and even the interpretation of the vectors might change once I do this. Particularly, the “Pitchers Attack” score might be highly correlated with time – Zone% has been decreasing by nearly a full percentage point per year over the sample, whether due to a smaller strike zone or for some other reason that doesn’t immediately come to mind. I might consider removing steroid-era Barry Bonds from the dataset as an extreme outlier with his absurd 25-35% walk rates, as well.

My next piece will either revolve around de-trending the data to standardize data by season, or how this system would be used to compare player seasons. Sure, Nick Markakis and Chris Davis might not be very similar, but who else are they similar to? The order I do this in probably depends on what sort of feedback I get, and how difficult I find the de-trending process.

Pitch Grades vs. Relative Pitch Grades

K-BB vs. the RotoGraphs Top Starting Pitcher Rankings

I'm a statistician and a baseball junkie based in NYC.

14 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Michael L Williams

10 years ago

Great Work. We appreciate all the insight. Especially the 2014 comparisons.

jim

10 years ago

Awesome idea – I think you should add even more categories to describe hitters in even more detailed ways, but this is a great start. It would be amazing to track this from year to year in a database somewhere–for instance, to see Chris Davis change from 1 category to another, would be quite useful!

Matt Malkus

10 years ago

Reply to jim

Thanks for the kind words. I do have this data based on the current version of scores described in the article. I’ve noticed that some hitters are very consistent and compare favorably to themselves over time (for example, Nelson Cruz 2014 is very close to every other Nelson Cruz season from 2011-2013). Other guys – especially younger guys – tend to have more drastic changes year to year.

There are a lot of potential applications and next steps to this, and one of them is to demonstrate that there is an extension of this model that has predictive power for future seasons, based on historical season data in combination with the player’s age, etc.

Frank

10 years ago

Cool stuff! I did a less ambitious version of this last year (linked below)—good to know there’s at least a couple of similarities here (e.g. our first components are qualitatively similar).

http://clownhypothesis.com/2014/02/09/principals-of-hitter-categorization/

Matt Malkus

10 years ago

Reply to Frank

Excellent! Yes, precisely the same idea, it seems I’ve just included some more variables/components and evaluated player-seasons as separate observations. I’m glad others have thought of doing this, and I agree with what you wrote towards the top – exploratory data analysis is fascinating, particularly in explaining such complex skills as a hitter possesses to succeed.

10 years ago

Although I’m very eager to see how players compare to each other, now that I’ve read through the article I think I’d prefer seeing the same analysis on “normalized” numbers (season and park-adjusted like OPS+), though I’d suggest not going too crazy with it (for example, only park-adjusting the batted-ball outcome numbers).

The reason I say this is that the vectors further on down the chain could become completely different based on the ripple effect of any changes in the first few. Intuitively, the first two categories look like logical starting points for a decision tree for describing a hitter (power vs. contact, aggressive vs. patient), but after that it’s less clear what should arise. (It’s also nice that Frank’s model has similarities on the first few components)

This is fascinating stuff. I’m a huge fan of similarity scores, and I love the approach you are taking with this. Can’t wait to see more!

Matt Malkus

10 years ago

Reply to tz

Thanks! In thinking through the right way to seasonally adjust the data, I’ve hit a couple snags. For example, O-Swing% is increasing over time, but I can produce a linear regression for O-Swing% on other variables where time becomes insignificant, and the R-squared of that regression is about .98. So, this leads me to conclude that O-Swing% isn’t changing based on some exogenous factor that isn’t in my model. On the other hand, I can’t do anything of the sort with Zone%, which is one of the strongest trending variables in the dataset. I’m going to look at available Pitch f/x data to see if I can identify the cause of this (if there’s work out there on this phenomenon anyone can point me to I would greatly appreciate it).

Bottom line though, I want to identify which variables are exogenous, caused by things beyond a batter’s control (the way umpires are calling the strike zone, or things that can logically be attributed to changes in pitching), and only adjust those variables by season – rather than bluntly adjusting every variable that moves.

Springer Comp Fail

10 years ago

Don’t think Springer was the best comp for Vector 4 – marginally lower power than average?

Matt Malkus

10 years ago

Reply to Springer Comp Fail

I probably shouldn’t have alluded to power there as ISO was negative but not a significant factor on the score. More importantly, Springer was in the bottom 10% of O-Swing%, had the third lowest O-Contact%, and was in the top 20% of Zone% for the league last year. In his case, rather than pitchers not being afraid to come inside the zone, he seemed to have a patient approach which likely led to favorable counts and better pitches to hit, but was whiffing much more than usual when he made mistakes and chased.

Vector 4 is the one that I had the most trouble with because it seems correlated with season – league average z-score for it was -0.85 which is way off from zero. So this is one reason I’m taking the detrending seriously. Springer’s z-score was +0.77 on that vector so it wasn’t a major identifier of who he profiled as, in the grand scheme of things.

That said, the closest comp to Springer’s 2014 was Justin Maxwell’s 2012. Springer is much younger and is expected to improve dramatically over that – but looking at those two seasons in a vacuum, the comp appears reasonable.

10 years ago

How did you decide on having seven vectors?

PAG

10 years ago

Also curious to know why you chose 7 factors. Judging by your eigenvalues, I’d say 3 would be most appropriate (although one might be enough).

And did you use a rotation or did you leave these orthogonally rotated?

Matt Malkus

10 years ago

Hi guys, good questions. 7 factors is admittedly arbitrary; there was a significant dropoff between 7 and 8 where the eigenvalue approximately halved. Of course the intuition gets fuzzy on some of the less important vectors when you use so many. My hope is that upon seasonally adjusting the data appropriately, the most important vectors will be both intuitive and represent more of the variation, so that using less vectors becomes an easy decision.

Regarding the rotation, they are orthogonally rotated.

prestomagnetic

10 years ago

Reply to Matt Malkus

Cool – is there any way I can email you about this? I have some thoughts I’d like to share that might improve this…

Grant Gates

10 years ago

I’m familiar with eigenvalues from linear algebra/differential equations type contexts. Can you point me to a resource to understand where you eigenvalues came from and what they represent?

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG