Enhancing Prospect Outlooks Using Scouting Report Text by Chris Russo May 4, 2021 Wander Franco is the latest prospect to be discussed as a top player in the game before stepping on a major league field field. Vladimir Guerrero Jr. was likely the recipient of even more hype in 2018, though he has reminded us at times that there are no automatic superstars in baseball. Franco and Guerrero Jr. have the unique distinction as the only two players to be given the maximum “hit tool” score of 80 on MLB.com’s prospect rankings. Guerrero Jr. (in 2018) scored higher on “power” while Franco has the edge in running and fielding. They were both rated 70 overall and were the respective No. 1 prospects in baseball at the time. When comparing the two players’ ratings, we might stop at this point and declare a virtual tie. The same could be said for any number of lower level prospects with similar ratings. However, there is still a significant amount of data available describing the players: the words used in the scouting reports. On MLB.com, below the numeric ratings, there is a blurb detailing the prospects’ exploits. At first glance, we might not think the text provides information that can separate players, as many of the writeups are similar in both style and substance. Yet there is a possibility that there are indicators in the text that are not obvious to a human reader (or at least a human reader with my minimal experience analyzing text). To examine the importance of the scouting report text, I developed two models — one with the text data and one without — to predict whether a prospect has made his major league debut as of the end of the 2020 season. Both models use variables such as year, position, numerical skill ratings, etc. to account for all of the non-text information available on MLB.com. Thus, if there is a difference in model effectiveness, it will be a result of the text data adding information that is not captured by the other features. Data The player data was collected from MLB.com’s archived prospect lists from 2014-2020. The 2014 rankings were the first to use the 20-80 scale. The “Top 30 by Team” lists were used for 2015-2020 and the “Top 20 by Team” lists for 2014, totaling 6,000 entries. For each player, we have the year, position(s), “Overall” rating, and scouting report text. For batters, the additional metrics are scores for the five tools: hit, power, run, arm, field. For pitchers, other variables include scores for each of their pitches as well as a score for their control. There are a few two-way players who have ratings in both areas, and a few players with missing data in some or all categories. An issue with this data set is that many of the players are listed more than once across different years. Although they are not perfectly repeated observations — players may have different scores in different years — they are likely to be very similar. To solve this problem, the data is grouped by player and combined to create the following variables: Position: primary position in the most recent season. Note that we count a few players listed as INF in the shortstop category and one listed as CF in the outfield category. This leaves eight positions: RHP, LHP, C, 1B, 2B, 3B, SS, OF. Age: player age (in days) as of 1/1/21. Year: average of the years of observation (i.e. for a player that was listed from 2015-18 the value is 2016.5). Hit: average hit rating. Power: average power rating. Run: average run rating. Arm: average arm rating. Field: average field rating. Fastball: average fastball rating. Control: average control rating. Secondary: highest average rating among secondary pitches. Avg_secondary: average of the rating of secondary pitches. Num_secondary: average number of secondary pitches. Overall: average overall rating (for two-way players, we average overall batting/pitching separately and take the higher value). Text: combined scouting report text. After removing a few players with missing entries, we are left with 2,615 unique players, 851 of which have played in the major leagues. The remaining entries without values (such as hit tool for a pitcher) are filled with zeroes. Preprocessing The Position column is transformed using one-hot encoding, meaning we replace the Position column with eight new columns (RHP, LHP, C, 1B, 2B, 3B, SS, OF), which take the value 1 if the prospect plays that position and 0 otherwise. The numerical columns, which are all of the rest aside from Text, are standardized. For the Text column, we first remove all capitalized words, numbers, punctuation, and special characters. This eliminates “words” that are actually names, statistics, league names, schools, etc. The reason we do not want to consider these types of words is they are related to player information and are not available for all prospects on MLB.com. For this exercise, the goal is to see whether seemingly generic descriptive words — that are not unique to a particular player — can tell us something about the player’s outlook. We also remove a group of words called “stop words,” which are a predefined group of words that are not significant to the meaning of the text, such as “a,” “the,” “if,” etc. Using the condensed text, we build a “vocabulary” for the model. The vocabulary includes any single words or two-word phrases that occur in any of the text entries. We then narrow down the full vocabulary to those words/phrases that occur in at least 10 different entries. The remaining elements of the vocabulary become new columns in the data set, with the values in the column equal to the number of times the word/phrase occurs in that player’s text. The resulting total number of text indicator columns is 8,256. With the preprocessing complete, the data is split into training and testing sets using an 80-20 split. Initial Model To get an idea of how the non-text variables relate to our outcome, consider a simple logistic regression model fit on the training data. It has a training accuracy of 84.0% and a testing accuracy of 83.7%. The following confusion matrix shows the breakdown of the performance on the test data, where 0 indicates the player has not played in the Major Leagues and 1 indicates that they have: The model is more effective at predicting the 0’s, as 87.5% of the true 0’s are identified compared to 76.0% of the true 1’s. Note the coefficients: Variable Coefficient Age 2.26 Control 1.19 Overall 1.11 Hit 0.70 Arm 0.67 C 0.20 3B 0.16 SS 0.16 OF 0.06 Fastball 0.03 1B -0.03 LHP -0.05 Secondary -0.10 Field -0.16 2B -0.17 Avg_secondary -0.27 Num_secondary -0.27 Year -0.31 RHP -0.33 Run -0.33 Power -0.58 The best indicators of a player having made their debut are a high age, control rating, overall rating, hit rating, and arm rating. Being a catcher is also measured as advantageous, which makes sense as there are fewer catchers than other positions. Conversely, the most common position of right-handed pitchers indicates a lower likelihood of making the Majors. In general, the coefficients make logical sense as might be expected considering the 83.7% accuracy. Full Model and Results We now consider four types of models: logistic regression, random forest, support vector machine, and gradient boosting. For each method, we also tune hyper parameters and use three-fold cross validation, and we apply the best performing model to the test data. For the model trained without text, the best training accuracy achieved was 86.4% by a gradient boosting model. The testing accuracy was 83.9% — almost identical to the initial logistic regression model. However, the performance is more balanced: Here we see that 84.4% of 0’s were correctly predicted compared to 83.0% of 1’s. Next we add in the text data and use the same methods, tuning, and cross-validation (with the exception of logistic regression, as we have many more variables than observations). Again, a gradient boosting classifier worked best, attaining a training accuracy of 90.4%. The testing accuracy was not quite as high, but we did see an improvement to 86.0%. The results remained balanced: With this method, 86.6% of 0’s and 84.8% of 1’s were identified. Overall, 11 additional players were classified correctly. Discussion The words used in the prospect scouting reports did give an indication — beyond what could be inferred from the other features — of whether the player has made it to the major leagues. This result leads to the question of what other ways written reports can be used for player evaluation from a statistical modelling perspective. Teams have access to much more in-depth reports than those found on MLB.com, and it would be interesting to see if a larger text sample to draw from could improve the results further. This analysis could also be expanded upon by predicting the amount of time before reaching the majors, rather than just a simple yes/no. Another option would be to predict player performance, although that would require many more years of data. One final note is that there is a bit of a “chicken or egg” issue here. We have found that the scouting reports give an indication of whether a player will reach the majors. However, could this be because better players get better evaluations and are more likely to get promoted, or is it because players who receive better evaluations are viewed more positively by the organization and are thus promoted? It could be neither of these options, but it is just important to keep in mind that an indication of playing in the majors does not necessarily reflect the player’s relative skill.