Enhancing Prospect Outlooks Using Scouting Report Text
Wander Franco is the latest prospect to be discussed as a top player in the game before stepping on a major league field field. Vladimir Guerrero Jr. was likely the recipient of even more hype in 2018, though he has reminded us at times that there are no automatic superstars in baseball. Franco and Guerrero Jr. have the unique distinction as the only two players to be given the maximum “hit tool” score of 80 on MLB.com’s prospect rankings. Guerrero Jr. (in 2018) scored higher on “power” while Franco has the edge in running and fielding. They were both rated 70 overall and were the respective No. 1 prospects in baseball at the time.
When comparing the two players’ ratings, we might stop at this point and declare a virtual tie. The same could be said for any number of lower level prospects with similar ratings. However, there is still a significant amount of data available describing the players: the words used in the scouting reports. On MLB.com, below the numeric ratings, there is a blurb detailing the prospects’ exploits. At first glance, we might not think the text provides information that can separate players, as many of the writeups are similar in both style and substance. Yet there is a possibility that there are indicators in the text that are not obvious to a human reader (or at least a human reader with my minimal experience analyzing text).
To examine the importance of the scouting report text, I developed two models — one with the text data and one without — to predict whether a prospect has made his major league debut as of the end of the 2020 season. Both models use variables such as year, position, numerical skill ratings, etc. to account for all of the non-text information available on MLB.com. Thus, if there is a difference in model effectiveness, it will be a result of the text data adding information that is not captured by the other features. Read the rest of this entry »