Draft prep: Framing the problem
So you’re preparing for your fantasy draft. You’re caught up on FanGraphs, checked for recent injuries at Rotoworld, maybe skimmed a few headlines from your other top 11 baseball news sites. Maybe you’ve even downloaded the FanGraphs positional rankings, and are planning to keep the file open during the draft as a reality check against the pre-set rankings of the site your league uses.
But really, what do the guys at FanGraphs know? Sure, they know a lot about baseball, and statistics, and this year’s projections, and a handful of underlying stats that tend to predict future performance. But what they don’t know is whether your league uses OBP instead of AVG, or OPS, SLG, or batters’ strikeouts, or maybe holds and FIP and pitcher fielding percentage. If this is your situation, then I feel your pain. My fantasy league uses eight statistics for batters and pitchers, three each beyond the usual five. (In case you’re curious, the mysterious six are: Batter hits, K’s, & OPS; Pitcher holds, losses & complete games).
These differences matter. If your league uses OBP, Joey Votto turns from a fantasy player who’s solid in four categories (including average, where his impact is limited because he walks all the time) to a guy with a truly elite skill. Maybe it’s easy for you to account for the relative value of a Joey Votto, but how well can you project the 25th through 35th outfielders? Some might be much better or worse in your league. If you have batter strikeouts, as in my league, how do you value Mark Trumbo and his home run power against the elite contact skills of Norichika Aoki?
Generating your own rankings
One answer, and the one I opted for, is to generate rankings based on your own league’s stats. Now, this may sound a bit too work-intensive and time-consuming for most of you (especially those of you with relatively normal priorities), but in reality it wasn’t as time-consuming as I expected.*
First of all, there’s no need to reinvent the wheel. There are lots of projection systems out there that are available to the public, and some of them are quite good. I decided I would simply download all the projections listed on FanGraphs, and average them out. And then, after thinking for a little while about the costs and benefits of that approach, I decided I wouldn’t do that at all, and instead would use the results of just one projection system. But which one should I use? Luckily, that’s yet another bit of analysis we don’t need to bother with, because the Interwebs are full of crazy mathematicians who love baseball and have nothing better to do. After searching for a few articles that evaluate projection systems, like this one and this meta-one, I decided that the forecasts I trusted most (and were easiest to obtain) were Steamer for batters and FanGraphs fans for pitchers. (The high accuracy of the latter shocked me at first, but then I realized that fans assimilate the results of all the projection systems into their own player projections, departing from them only as dictated by common sense, inside scoop, and hope.)
Operationalizing the Solution
Here’s where it gets tricky. What advanced data manipulation packages and techniques are best for downloading reams of data from the FanGraphs site into your spreadsheet? Certainly there was no need for me to copy and paste the data 50 players at a time like someone living the dark ages, was there? No, of course not. And I probably never really did that.
Instead – bear with me if you’re not technically inclined – I hit the gray “Export Data” button to the upper right of my chosen projection page. This involved a lot of loading the correct page, hovering my mouse over the text, and clicking, but in the end it was worth all the work, because 5 minutes of sweat, plus a beer, had finally paid off in spreadsheets full of data.
*If you’re not interested in these details, the fun stuff is posted in a couple of tables towards the end. (I like writing, so this is likely to go on for a while.)
Z-scoring your data points
Z-scoring batter projections is easy. The problem lies in determining what set of players to use in order to calculate means and standard deviations.
This is an important question, at least to the extent that any question in fantasy baseball is important. For example, if you must use every hitter in the league, including the guys projected for 8 at-bats, you create the illusion that lots of players bat .220 or score only 4 runs, as opposed to your league’s reality in which .270 with 70 runs is pretty ordinary. For a little math fun, I compared the results generated using means and deviations 500 players deep (the equivalent of a 25-team league that rosters 20 position players) versus one with more reasonable assumptions. It caused huge increases in variance in runs and rbi’s, so a guy who drove in and scored 100 compared no better to the mean either way (~2+ standard deviations), but smaller increases in the variance in SB’s, HR’s, and OPS, which, together with the lower means, meanings this system overvalues guys who produce in these categories. Martin Prado and Torii Hunter were made sad, whereas Billy Hamilton was elevated to a demigod (or at least a top-40 hitter).
So how do you generate values that represent your player pool?
One method – and a very reasonable one – is to use the final statistics compiled by your league the previous year. With this data, it’s easy to generate per-slot averages based on last year’s performance, and to compare projected performance against it. But I did not choose this method. A more savvy number-cruncher might say that projection systems, while designed to be as accurate as possible for each player, may be systematically biased on the whole, and therefore determining the value of this year’s projections based on last year’s actual statistics is tantamount to comparing apples and oranges.
I was more worried about lazy owners. Any league can have a couple of careless owners who are in it just for fun (the gall!), or who keep BJ Upton when he can’t even see the Mendoza line, because of that one time his cousin shook BJ’s hand at a Jay-Z concert. I know of what I speak. If your goal is to win your league, you want to base your evaluation on the best players available, rather than the happenstance of which Atlanta outfielders spent the whole year on someone’s roster.
I generated means using very precise data, plus a random stab in the dark. First, I looked up the exact number of players at each position in my league from the previous year. Then I mostly ignored this data. Although it’s true that player values vary greatly between leagues depending on how many players start, and how many are rostered, this is the sort of thing you can keep track of during the draft. Don’t draft another first baseman if you already have three of them and no shortstop, and don’t draft a first baseman just because he’s ranked ahead of a shortstop if there are another seven first basemen ranked close behind.
My league rostered only 123 regulars last year. Not a deep league. I used a lot more than 123 in my calculations in an effort to lower the means a bit, to account for the existence of catchers and second basemen. I then haphazardly created sort variables so I could bring the best 150 to 180 players to the fore, with the goal of getting a fair representation of the quality of players in my league. I tried various formulas like [(HR+1) * R * RBI * (SB +1) * AVG * OPS] (adding 1’s so as not to exclude players projected for 0 HR’s or SB’s ) and PA * wOBA. Virtually every one of them produced a good representation of the best hitters projected for regular playing time. In the end, the best way to evaluate the sort is to look at the list and see if the guys near the cutoff are fringe players who are familiar from last year’s waiver wire.
Calculating projected player values
Once you determine which players you want to include, Excel is happy to instantaneously calculate averages and standard deviations for each stat. Once you have these values, you can re-include the entire player pool, or as much of it as you wish, and the formula for each player in each category is simply (his projected value – the average projected value)/standard deviation.
The next challenge is to generate ranks from the Z-scores. The simplest way is simply to add them together (being sure to subtract ones where lower scores are better, such as pitcher walks or batter strikeouts). But here, I discovered another issue. A potential superstar who might not have a full-time job could end up ranked about the same or below a mediocre player who was guaranteed to start. If I wanted my draft rankings to make sense at a glance when I have just 90 seconds to pick a player while eating a sandwich, I needed to distinguish accumulators from guys with potential.
Ranking performance and potential
It matters whether a player is an okay guaranteed performer or a unpredictable potential star. If I find myself with no second basemen in the 22nd round, I might want to take the best guy who’s pretty much guaranteed 140 days in the starting lineup, like an Anthony Rendon or a Howie Kendrick. If my roster’s pretty much set, I might prefer a hitter who has a better chance to bust out and hit 45 home runs, like Chris Carter (unless I’m in my league, in which his 80% strikeout rate falls 37 standard deviations below the mean).
What I decided to do was generate two rankings for each batter, one based on projected totals, and one based on projections per plate appearance. Luckily, Steamer has already done the work for us by projecting everyone in both ways. For instance, Everth Cabrera is projected as the 479th-best player by wOBA, with 74 runs and 45 stolen bases. At the other extreme, Colorado’s Kris Parker is projected to be the 50th-best hitter in the league, just ahead of Dustin Pedroia, with a .279 batting average and .465 slugging percentage, despite getting only one plate appearance, and not getting a hit.
At this point, there are 2 sets of columns for each batter: 1 set of columns for his Steamer projections for each relevant stat, and 1 for the associated Z-scores. To this, I added 2 more sets of columns: 1 for per plate-appearance projections for each stat, and 1 for those associated Z-scores. (Dividing hits into plate appearances rather than at-bats feels unnatural, but that’s what you need to do if your league counts total hits.) Calculating per-PA quality is then easy, as you can just add the Z-scores (or subtract for negative statistics). But once you have projected rate statistics in your per-PA rankings, it becomes apparent that it doesn’t make sense to include the exact same values in your projected accumulated totals.
To handle this, I weighted the Z-scores for the rate stats. I multiplied the Z-score for AVG by projected AB’s/average projected AB’s, and you can do the same for OBP, using PA’s. My league uses OPS, a value generated by adding two fractions with different denominators (aka OBP & SLG), so to weight those Z-scores I multiplied them by projected (AB’s + PA’s)/average projected (AB’s + PA’s). I then added these weighted Z-scores to the other Z-scores for projected totals. The result of adding these weights is that a player who is one standard deviation above average in both AVG and OPS, and who has an average number of AB’s and PA’s, would get +2 from these categories in the variable used to rank projected totals. By the same lights, the aforementioned Kyle Parker’s AVG and OPS would essentially get no weighting at all, and have no effect at all on his projected totals, just as in real life his performance is not expected to have any effect at all on the rate stats of your team.
The Fun Stuff
And that’s about it. Once you have Z-scores, it’s very easy to rank players, to change the formulas to rank them by different systems, or to sort players by certain categories to see who stands out the most.
Two common variations on the traditional 5 stats are to include OBP instead of AVG, or to play in a points league. (For a points league, just change the Z-score weighting to reflect the point system). Here are the top players in these alternate systems using this evaluation method (I threw my own league in too, just for kicks):
(Note: I evaluated points leagues the same way as the other leagues, generating both a points total and a points/PA score for each player. I scaled the two values to give them approximately equal weight, and ranked players by the mean of the two.)
I expected Joey Votto to be a stud in OBP leagues, but in reality Joey Bats benefits more. Jason Heyward too. Meanwhile, CarGo is top 3 in every other system, but falls to the bottom half of the first round in a points league. In my own crazy league, Norichika Aoki projects as a contact-hitting top-40 stud, while Mark Trumbo’s contact deficiencies show up in strikeouts and hits, as well as AVG, and he drops to 82nd.
I also thought it would be cool to see which players project to be affected most under different scoring systems. Here are the players with the largest variation in ranks between systems (weighted to prefer higher-ranked and therefore more interesting players):
Billy Hamilton projects to be a one-category stud in any system that ranks stolen bases, but many people doubt whether he’ll be an especially good ballplayer in 2014, and the points system shares their skepticism. Carlos Santana will benefit enormously from any league using deeper measures than AVG, while Adam Dunn jumps from irrelevance to potential rosterability in OBP leagues only. A couple more notable players: Alex Rios is vastly more valuable in leagues with the standard five categories, and least valuable in points league, and Adam Jones follows a very similar, if somewhat less drastic, pattern.
And there you have it – the results of one approach to generating player values for leagues with alternative categories.