How well do you think you can predict the future of a minor leaguer? My computer may be able to help. Towards the end of the regular season, I found the prospects page at FanGraphs and started experimenting with it. I have always had a lot of fun thinking about the future and predicting outcomes, so I decided to try to build a model to predict whether or not a prospect would make it to the majors. I had all the data I needed thanks to FanGraphs, and I had recently been looking into similar models built by others to figure out how I could accomplish this project. I realized that all these articles I was reading detailed the results of their models, but not the code and behind-the-scenes work that goes into creating them.
With that in mind, I decided to figure it out on my own. I had a good idea of what statistics I wanted to use, but there were a few issues I needed to consider before I started throwing data around:
Prospects playing multiple years at a single level isn’t too difficult to deal with because I can just aggregate the stats from those seasons. The fact that not all prospects play in every level of the minor leagues before reaching the majors is tough, however, because that makes for a lot of missing data that needs to be handled before building the model. I decided to replace all the missing values with the means of the existing data, and I created variables to indicate whether or not a player’s season stats for that particular level of the minor leagues were real. To make this model useful, I would want to take out certain variables. For example, I figured I wouldn’t need or want Triple-A stats included in the model because typically once a player has reached that level of the minors, you are more interested in how well they will do in the majors. Read the rest of this entry »
I recently began thinking about how teams can know that they are efficiently spending their money, or where teams actually get the runs that they spend all their money on. With players signing massive contracts in the 2018-19 offseason, I began to wonder if any players were really worth that much money. The process begins with one big question: What is a run worth? I quickly realized that each team theoretically needs to manufacture the same number of runs as all the other teams do if they want a better chance to make the postseason. What is different from team to team is budget. This means that a run is worth a different monetary value to each team, and that each team would be willing to pay a different amount of money for the same number of runs. The problem is that to each player, a run costs the same amount, causing Billy Beane, played by Brad Pitt in the movie Moneyball, to claim that “It’s an unfair game”.
Figuring out what each team values their runs at would enable me to evaluate how efficient the signing of certain contracts was for each team and furthermore would allow me to figure out where the most value comes from in the payroll of a team. First, I had to figure out how to convert the basic statistics of a player into the number of runs that player actually contributed to the team. I eventually came across the Estimated Runs Produced statistic from the 1985 Bill James abstract. Below is the calculation.
ERP = (2 (TB + BB + HBP) + H + SB – (.605 (AB + CS + GDP – H))) .16
This is a stat created by Paul Johnson in order to obtain more accuracy than Runs Created, which he succeeded in doing. I then fired up R and ran some tests on team statistics to see how well it lined up with the actual number of runs that each team scored. I graphed ERP against Runs Scored first for every team dating back to the beginning of the 30-team era in MLB: Read the rest of this entry »
With so many complex statistics out there, I wondered if there was an easier way to project winning percentage or runs, a way that is simple yet more complex than Bill James’ classic Pythagorean Win Expectancy. To create a statistic like that, I would have to create one comprehensive stat for offense and one for pitching. Ultimately, I came up with the following and named them “Run Value” and “Pitching Run Value,” respectively.
RVAL = ( ( TB + BB – SO )/4) + RBI + HR
PRVAL = ( ( ( H + BB – SO )/4 ) + HR) x FIP
These two metrics are used for teams. In the batting RVal formula, the higher the better. I tried to get down to the pure number of runs that a player or team produces by using the very relaxed definition of a run being four bases. In the pitching PRVal formula, the lower the better. I did something very similar to the batting stat by trying to get the pure run total. I then put the two stats into the win expectancy formula:
RVALWinExp = RVal^1.83 / ( RVal^1.83 + PRVal^1.83)
I then ran a program in R to see how closely this stat correlates to actual team win percentage for all teams from the 1998 season through the 2018 season. In addition, I tested to see how Bill James’ win expectancy formula correlates to team win percentage over the same period of time. The results are below. Read the rest of this entry »