Shut the (Heck) Up About Sample Size
The analytics revolution in sports has led to profound changes in the way in which sports organizations think about their teams, players play the game, and fans consume the on-field product. Perhaps the best-known heuristic in sports analytics is sample size — the number of observations necessary to make a reliable conclusion about some phenomenon. Everyone has a buddy who loves to make sweeping generalizations about stud prospects, always hedging his bets when the debate heats up: “Well, we don’t have enough sample size, so we just don’t know yet.”
Unfortunately for your buddy, sample size doesn’t tell the whole story. A large sample is a nice thing to have when we’re conducting research in a sterile lab, but in real-life settings like sports teams, willing research participants certainly aren’t always in abundant supply. Regardless of the number of available data points, teams need to make decisions. Shrugging about a prospect’s performance, or a newly cobbled together pitching staff, is certainly not going to help the bottom line, either in terms of wins or dollar signs.
So the question becomes: How do organizations answer pressing questions when they either a) don’t have an adequate sample size, or b) haven’t collected any data? Fortunately, we can use research methods from social science to get a pretty damn good idea about something — even in the absence of the all-powerful sample size.
Qualitative Data
Let’s say you’re a baseball scout for the Yankees watching a young college prospect from the stands. You take copious notes about the player’s poise, physical stature, his hitting, fielding ability, and running abilities, as well as his throwing arm power. For instance, you might write things like, “good approach to hitting” and “lacks pure run/throw tool.”
All of these rich descriptions of this player are qualitative data. This observational data from one game of this college player is a sample size of 1, but you’ve got a helluva lot of data. You could look for themes that consistently emerge in your notes, creating an in-depth profile of the prospect; you could even standardize your observations on a scale from 20-80. Your notes help build a full story about the player’s profile, and the Yanks like the level of depth you bring to scouting.
Mixed-Methodology
You’ve worked as a scout for a few years, and the Yankees decide to bring you into their analytics department. It’s the end of the 2011 season, and one of your top prospects, Jesus Montero, just raked (.328/.406/.590, in 69 PAs) in the final month of the season. The GM of the Yankees, Brian Cashman, knocks on your door and says that they’re considering trading him. What do you say?
You compile all of Montero’s quantitative stats from the last month of the season and the minors, as well as any qualitative scouting reports on him. Good job. You’ve mixed quantitative and qualitative data to provide a richer story given a small sample of only 69 PAs. You’ve also reached the holy grail of social science research, triangulation, by which you examined the phenomenon from a different angle and, bingo, arrived at the same conclusion that your preliminary performance metrics gave you. Montero is a bum. Trade him, Brian.
Resampling Techniques
It’s four years later and Cashman knocks on your door again (he’s polite, so he waits for you to say, “come in”). It’s early October and you’ve just lost to the Houston Astros in a one-game playoff. Cashman asks you about one of the September call-ups, Rob Refsnyder, who Cashman thinks is “pretty impressive.” You combine Refsnyder’s September stats (.302/.348/.512, in 46 PAs), minor league stats, and scouting reports, but the data don’t point to a consistent conclusion. You’re not satisfied.
A fancy statistical method that might help in this instance is called bootstrapping; it works by resampling Refsnyder’s small 46 PA sample size over and over again, replacing the numbers back into the pool every time you draw another sample. The technique allows you to artificially inflate your sample size with the numbers that you already have. You can redo his sample of 46 PAs 1,000, 10,000, even 100,000 times, seeing each time how he would perform. Based on your bootstrapped estimates, you feel like Refsnyder’s numbers from last year are a bit inflated, but that he’d fit nicely as a future utility guy.
Non-Parametrics
Cashman, who’s still in your office, now wants to know about two pitching prospects who were also called up in the 2015 class: James Pazos (5 IP, 0 ER, 3 H, 3BB, 5.4 K/9, 1.20 WHIP) and Caleb Cotham (9.2 IP, 7 ER, 14 H, 1BB, 10.2 K/9, 1.56 WHIP). If the team can only keep on of these pitchers, who should we keep? Who is better?
Normally you’d use a t-test to make comparisons between the two pitchers, but with such a small sample of innings for each guy, the conclusions wouldn’t be reliable. Instead, you decide to use a Mann-Whitney U test, which is basically the same thing as a t-test, adjusted for small samples. In fact, there’s a whole litany of statistical tests that are adept at handling small sample sizes: Wilcox’s t, Fisher’s exact, Chi-square, Kendal’s tau, and McNemar. You conclude that Pazos is slightly better, and that Cotham might be better suited for the bullpen. Cashman holds on to Pazos and deals Cotham to the Reds in the trade that brings over Aroldis Chapman to the Yankees. You pat yourself on the back.
Questions Need Answering
Having an adequate sample size brings confidence to many statistical conclusions, but it is certainly not a binary prerequisite for analyses. It’s easy for your buddy to watch his hindsight bias autocorrect for his previous wait-and-see approach, but organizations need to answer questions accurately. As amateur analysts and spectators, let’s change the lexicon by changing our methods.