Big Data and Baseball Efficiency: the Traveling Salesman had Nothing on a Baseball Scout by Kyle Serikawa June 6, 2014 The MLB draft is coming up and with any luck I’ll get this posted by Thursday and take advantage of web traffic. I can hope! (ed. note: nope) Anyway, Tuesday on FanGraphs I read a fascinating portrayal of the draft process, laying out the nuts and bolts of how organizations scout for the draft. The piece, written by Tony Blengino (whose essays are rapidly becoming one of my favorite parts of this overall terrific baseball site), describes all the behind the scenes work that happens to prepare a major league organization for the Rule 4 draft. Blengino described the dedication scouts show in following up on all kinds of prospects at the college and high school levels, what they do, how much they need to travel, and especially how much ground they often need to cover to try and lay eyes on every kid in their area. One neat insight for me was Blengino’s one-word description of most scouts as entrepreneurs. You could think of them almost as founders of a startup, with the kids they scout as the product the scouts are trying to sell to upper layers of management in the organization. As such, everything they can do to get a better handle on a kid’s potential can feed into the pitch to the scouting director. I respect and envy scouts’ drive to keep looking for the next big thing, the next Jason Heyward or Mike Trout. As Blengino puts it, scouts play “one of the most vital, underrated, and underpaid roles in the game.” While one might make the argument that in MLB, unlike the NFL or NBA, draft picks typically are years away from making a contribution and therefore how important can draft picks be?, numerous studies have shown that the draft presents an incredible opportunity for teams in building and sustaining success. In fact, given that so much of an organization’s success hinges on figuring out which raw kids will be able to translate tools and potential into talent, one could (and others have) made the argument that scouting is a huge potential market inefficiency for teams to exploit. Although I’ll have a caveat later. But in any case, for a minor league system every team wants to optimize their incoming quality because, like we say in genomic data analysis, “garbage in, garbage out.” As I was reading this piece, I started thinking about ways to try and create more efficiencies. And I started thinking about Big Data. If you read the description of travel by scouts, it sounds crazy and it also sounds familiar. There’s a famous problem in programming known as the Traveling Salesman problem. In this conjecture, one is presented with a salesman with a series of cities to visit and various constraints. In the Wikipedia description, the constraint is to only visit each city once and minimize distance traveled. But there are numerous variations and many applications of this problem to real-world efficiency. To be honest, for all I know MLB offices already use this approach to plan the travel of their scouts and other personnel. And in itself, the Traveling Salesman problem doesn’t constitute a Big Data approach. However, I’ve been learning about the ways in which companies are hoping to employ real-time streaming information to inform business decisions. For example, SpaceCurve, (which, full disclaimer, I know people at) is using their data platform to combine and analyze real-time streaming data from, for example, systems embodying the Internet of Things. When using a Big Data approach, rather than making decisions based on a specific set of information available at a specific time, decision making is treated as an ongoing process that is continually refined, updated and challenged as new information arrives. How might this help baseball scouting? The proximal goal of scouting, it seems, is to maximize the number of chances a scout has to see the prospects in his or her area. There are nuances; a scout also takes into account various factors such as how talented a kid is, if he or she can optimize high quality pitcher-batter matchups, organizational philosophy and needs of the parent team, and extraneous factors like the weather. Scouts figure out their travel plans and I’m sure do a great job (as I mentioned, they may already be using algorithmic computational optimization), but could it be better? Let’s say a scout has the area described in Blengino’s article: the Northeastern US. This is a large area, and while there are potentially fewer key prospects to scout, there’s also a lot of area to cover. From the parent organization there can be general mandates on type and kinds of hitters and pitchers the organization values most, and that can form the initial mandate for the scout to build on. Now input schedules for all teams a scout might want to visit, names of all known prospects, their presumed level of quality, locations, potential travel routes, etc. This can be rolled into the creation of an initial plan. Just like in war, however, these plans could go to hell in a second once the seasons start. Say a late summer storm causes a rainout of a game in the Cape Cod League. Or a freeway accident leads to a miles-long pileup. Big data analytical platforms can be set up to continually monitor all pertinent information that could get in the way of a scout maximizing his views of prospects and help the scout replan, on the fly, how to either get him or her to the right place through an alternate route, or redirect that scout to a secondary opportunity. Essentially, continually finding the maximally efficient path for a scout, given ever changing conditions. You know, kind of the opposite of when you get into the cash-register line at the supermarket and whatever line you pick or jump to suddenly becomes the slowest. So that’s one way Big Data could provide efficiencies. Are scouts already pretty good at doing this themselves? Certainly. So maybe this provides, at most, a few additional looks at a couple of players every season. Is that worth it? Maybe not. But there are other things a real-time data collection system can help with. Capturing information from personal tracking devices, for one thing. I’ve written in the past about the potential for personal monitoring devices to help with understanding athletic performance. If prospects could be outfitted with devices to monitor elements of their performance such as heartrate, movement, and acceleration, this real-time information could be captured and sent back to the data management and analysis platform. The scout wouldn’t have to be there all the time and there would now be player-specific measurements for every game to go along with boxscores and anecdotal reports. There could also be passive, ongoing measurements of temperature, humidity and windspeed, among other things, to overlay on the analysis. Add video capture, and suddenly every prospect has terabytes of data that can be correlated to their performance. This could jibe really nicely with the work already being developed at the Major League level by MLB Advanced Media. The systems that MLBAM is hoping to roll out over the next few years in major league ballparks will provide algorithmic analysis of video images to provide teams with a much better sense of speed and movement of the ball and the players on the field. Why not push some elements of that into amateur scouting? Outfit two scouts with Google glass and sit them at different positions in the same ballpark to get different views of prospect performance, as well as recording their immediate verbal impressions and running commentary. A Big Data platform could take all those kinds of data inputs and use them to look for connections, correlations and insights that build on the acumen of the scouts. And more data is better, right? Well maybe. One of the problems with Big Data is that as the amount of data collected grows, so do the odds of finding a spurious correlation. I really urge people who haven’t seen it yet to go to this site, the source of many fine and completely random correlations among datasets. Or, as Lisa Suennen puts it succinctly, “data is not the same as information, much less knowledge…” Indeed, studies have shown that paradoxically having more information can mislead people into making the wrong decisions. So let me back up a little and say that Big Data can certainly be helpful in specific things like optimizing your travel route, especially when real-time problems crop up. Whether it can help a team really find and rank the best prospects to draft? Big data can give you lots of hypotheses about how to measure quality and value. But it’ll be a while before we’d be able to collect the data to confirm any of those hypotheses are true.