Author Archive

Enhancing Prospect Outlooks Using Scouting Report Text

Wander Franco is the latest prospect to be discussed as a top player in the game before stepping on a major league field field. Vladimir Guerrero Jr. was likely the recipient of even more hype in 2018, though he has reminded us at times that there are no automatic superstars in baseball. Franco and Guerrero Jr. have the unique distinction as the only two players to be given the maximum “hit tool” score of 80 on’s prospect rankings. Guerrero Jr. (in 2018) scored higher on “power” while Franco has the edge in running and fielding. They were both rated 70 overall and were the respective No. 1 prospects in baseball at the time.

When comparing the two players’ ratings, we might stop at this point and declare a virtual tie. The same could be said for any number of lower level prospects with similar ratings. However, there is still a significant amount of data available describing the players: the words used in the scouting reports. On, below the numeric ratings, there is a blurb detailing the prospects’ exploits. At first glance, we might not think the text provides information that can separate players, as many of the writeups are similar in both style and substance. Yet there is a possibility that there are indicators in the text that are not obvious to a human reader (or at least a human reader with my minimal experience analyzing text).

To examine the importance of the scouting report text, I developed two models — one with the text data and one without — to predict whether a prospect has made his major league debut as of the end of the 2020 season. Both models use variables such as year, position, numerical skill ratings, etc. to account for all of the non-text information available on Thus, if there is a difference in model effectiveness, it will be a result of the text data adding information that is not captured by the other features. Read the rest of this entry »

A Lineup Construction Experiment

Who should bat second? This question has been debated quite a bit in recent years, as the modern approach has become to slot the best hitter in the 2-hole to increase their total plate appearances in a season. Others argue that the second hitter, like the leadoff man, should be a table-setter and the goal should be to get the best hitters to the plate with runners on base. So which is more valuable: getting your best hitter to the plate with men on or getting them to the plate more often? A simple experiment suggests that we are wasting a lot of energy arguing either side, and it would be time better spent thinking about other elements of lineup construction.


I created nine fictional players that will be referred to by position. I arbitrarily provided probabilities for the players based on seven possible plate appearance outcomes: single, double, triple, homer, walk, hit by pitch, and out. To simulate the lineup playing a game, I used a simple base-to-base style (the runners on base move up the same number of bases as the batter). An oversimplification of play to be sure, but the goal is to get an approximation of potential lineups relative to each other. Each lineup “plays” 100,000 nine-inning games so that the run distribution is virtually identical on multiple simulations. Read the rest of this entry »

Did the Baseballs Carry More in 2019?

As much as baseball fans would like a simple explanation for the astronomical increase in home runs in 2019, it is becoming clearer that many factors have played into the surge. Among the possible reasons are batters prioritizing hitting homers more than ever before, pitchers having difficulty gripping the seams of the baseball, and of course the famous “juiced balls.” Last month, a committee released initial results of a comprehensive study attempting to determine the driving forces behind the home run rate growth.

I am particularly interested in the idea that fly balls were supposedly carrying more in 2019. On multiple occasions throughout the year, I listened to announcers observe that outfielders seemed to be severely misjudging fly balls. For instance, the center fielder would be drifting back toward the wall, as if he had a bead on it, and the ball would end up 15 rows deep. Although this may seem like evidence for increased carry of the baseball, such observations can easily be driven by confirmation bias. There was a tendency this year to believe that every ball in the air would be a homer, so when a ball would carry a lot, it fit with expectations and the belief continued to grow. It may just have simply been the case that the wind was blowing out that day, or that the batter struck the ball in a particular way, and the carry had nothing to do with the ball itself. To determine if the perception was in fact reality, I focus on the following question: Did similarly struck balls travel farther in 2019 than previous years? Read the rest of this entry »

Edwin Diaz’s Running Fastball

Edwin Diaz is having an absolutely miserable season. A year after posting a 1.96 ERA (208 ERA+), he currently holds an ERA of 5.32 (78 ERA+). He has already given up 10 homers in 44.0 innings, whereas last year he gave up just 5 in 73.1 frames. Some of his stats, such as his strikeout rate of 14.5% and walk rate of 3.3%, while less impressive than last year, are sitting at about his career averages. Mets mananger Mickey Callaway has often cited his mechanics as the main problem, and that when he throws more “sidearm,” it is a recipe for disaster. To get a visual of this difference, notice the release point on the following two pitches:

Notice how Diaz’s arm is much flatter in the first picture. The release point is a bit farther from his body and significantly lower. Pitches released in that way have too often resulted in a running fastball:

From this angle, however, it is difficult to see the exact difference in release point because one may be farther forward than the other. Consider the following table that tells a more detailed story: Read the rest of this entry »

Is the Baseball Actually Juiced?

Home runs are on the rise. We all know this. The number of homers per game is at an all-time high in 2019, and has increased by about 36% just since 2015:

Home Run Rate
Year HR/game
2015 1.01
2016 1.16
2017 1.26
2018 1.15
2019 1.37

What we do not know is exactly why.

Commissioner Manfred recently suggested that the current baseballs have less drag through the air, caused by the more perfect “centering of the pill” (the innermost part of the ball). It has basically become an operational fact that there is something going on with the baseballs. Manfred’s explanation implies that the flight of the baseball is the key difference.

To look at this closer, I considered the distance traveled by balls in the air as a function of the exit velocity and launch angle at contact. If the average distance on similarly struck balls has increased over time, it would suggest that the ball itself is more aerodynamically efficient.

Pitch-by-pitch data for the 2015-2019 seasons was collected from Baseball Savant via the Statcast Search page. Two random forest models were built for each year, one using all fly balls and one using home runs. To account for a possible difference in flight due to the warm air in the summer months, only data through June of each year was used. (At the end of the season, the analysis can be applied to the full data set). In both cases, the distance the ball traveled is the response variable and the exit velocity and launch angle are the explanatory variables. The models are applied to a test data set of various exit velocity/launch angle combinations. Read the rest of this entry »

Ballpark Attendance and Starting Pitchers

When I am thinking about buying a ticket to a baseball game, often my first question is “Who’s pitching?” I have always felt that the most enjoyable type of game is one in which a great starter is on the mound. Is this feeling common among fans or do they buy tickets regardless of the starting pitcher?

To answer this question, I trained random forest models to predict attendance for games based on situational factors (not including the starting pitcher). Then I considered how the quality of starting pitchers relates to whether the models overestimate or underestimate the attendance. If the models consistently underestimate attendance when star pitchers are on the mound, it would suggest more tickets are sold because of the starter.


Information about each game was collected from Retrosheet’s game logs. In accordance with Retrosheet’s terms of use, please note the following statement: “The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at” Pitcher performance data was gathered from FanGraphs. In addition, the people.csv data set found here was used to match player ids from Retrosheet to FanGraphs. Read the rest of this entry »

Peter Alonso Has Adjusted — And Fast

Peter Alonso began 2019 by pummeling a belt-high fastball over the center field wall on the first pitch he saw in Spring Training. He has not stopped hitting since. Over his first 66 games in the regular season, he has slashed .254/.337/.596 with 22 homers and a .382 wOBA. The stats are impressive, but perhaps the most notable aspect of his success has been his ability to modify his approach in short order.

Over the first few weeks of the season, Alonso built an early reputation as a low-ball hitter. Even pitches well below the strike zone were getting sent over the fence. His slugging percentage per pitch by zone reflect this low-ball dominance:

Luckily for Alonso, pitchers had not yet caught on to his affinity for the low pitch. The pitch distribution chart below reveals that he was seeing a plurality of pitches at or below the middle of the zone.

This proved a lethal combination, as Alonso steamrolled his way through April. Read the rest of this entry »