A PCA for Batter Similarity Scores (Part 1: Basic Methodology)

This is the first in a series of pieces on a tool I’ve been working on. Admittedly, right now it’s quite raw, and probably needs some adjustments, which I’ll elaborate on towards the end of this post. It’s also quite lengthy – set it aside for when you have ample time to follow along, as there are some example calculations included to demonstrate the process.

Most of you are familiar with the “Similarity Scores” feature on Baseball Reference. If not, the explanation can be found here. The idea is to provide player comps using the player’s statistics. This has been around a while, and is based on a fairly simplistic “points-based” approach. Such an approach has the advantage of being easy to follow and intuitive, and as a quick tool to create fun conversation, it’s nice. However, it’s not very useful for purposes of projection for many reasons – not the least of which being that the points used are arbitrary and the statistics used are result statistics (hits, HRs, RBIs, etc) rather than being process-driven. It’s also intended to work on a player’s entire career. Some players have one or more drastic shifts in results over the course of their careers – and, to project a player in 2015 from his work in 2013-2014, we need to isolate data by season.

With the mountains of granular data available since Similarity Scores were first published, I thought it would be interesting to take a cut at creating something new in the same vein. My primary objectives were to create a similarity metric that (a) compared individual seasons rather than entire careers; (b) was based primarily on a hitter’s “process” or approach at the plate rather than strictly on results which are influenced heavily by luck; and (c) was mathematically defensible, in other words, non-arbitrary.

