With so many complex statistics out there, I wondered if there was an easier way to project winning percentage or runs, a way that is simple yet more complex than Bill James’ classic Pythagorean Win Expectancy. To create a statistic like that, I would have to create one comprehensive stat for offense and one for pitching. Ultimately, I came up with the following and named them “Run Value” and “Pitching Run Value,” respectively.
RVAL = ( ( TB + BB – SO )/4) + RBI + HR
PRVAL = ( ( ( H + BB – SO )/4 ) + HR) x FIP
These two metrics are used for teams. In the batting RVal formula, the higher the better. I tried to get down to the pure number of runs that a player or team produces by using the very relaxed definition of a run being four bases. In the pitching PRVal formula, the lower the better. I did something very similar to the batting stat by trying to get the pure run total. I then put the two stats into the win expectancy formula:
RVALWinExp = RVal^1.83 / ( RVal^1.83 + PRVal^1.83)
I then ran a program in R to see how closely this stat correlates to actual team win percentage for all teams from the 1998 season through the 2018 season. In addition, I tested to see how Bill James’ win expectancy formula correlates to team win percentage over the same period of time. The results are below.
Bill James’ expected W.L.% → Actual W.L.%
r squared = .88
standard error = .0247
RVal W.L.% → Actual W.L.%
r squared = .853
standard error = .0273
Since I found that my stat had pretty good correlation to team win percentage, I decided to figure out if the offensive RVAL was better than pitching PRVAL or vice versa. So I tested RVAL against runs scored and PRVAL against runs against. The results are below:
RVAL → Runs Scored
r squared = .947
standard error = 22.8
PRVAL → Runs Against
r squared = .815
standard error = 94.7
The graphs and stats show that the offensive statistic is much more accurate than the pitching stat, and since I was getting such good results with the offensive stat, I decided to test it further, this time against a currently popular measurement of offensive production in Base Runs. The results are pictured below:
Base Runs → Runs
r squared = .927
standard error = 21.9
My stat had a higher r squared value by .02, which indicates that my statistic is about as good as Base Runs. I realized that my stat must have some sort of significance to be this accurate in predicting runs. I then took a look at a histogram of RVAL and compared it to a histogram of Runs and saw that RVAL was too high and too spread out to mirror a distribution of runs. This is what it looked like:
I then found the mean of the runs data and scaled RVAL down so that the centers of each distribution were the same. I then adjusted the data by shrinking the standard deviation to look similar to the histogram of runs and ended up with the following equation:
AdjustedRVAL = (RVAL/1.753685) – ( ( RVAL-745.1053 ) / 5)
This new formula divides each value to make the mean of the data the same as the mean of the runs data and then subtracts one-fifth of the difference between the value and the mean of the data. That means that it moves each data point one fifth of the way closer to the mean, bringing the data closer together. Comparing the two histograms again, I found this:
The new standard error for AdjustedRVAL was 18.2 and the r squared value remained the same. Now that these distributions look very similar to one another, I could see that my statistic pretty accurately predicts team runs scored. I now went back to the pitching statistic, PRVAL, because it’s results weren’t terrible, so I tried to do something with it. I looked at the histogram:
I already adjusted the mean of the data, which is why it seems centered in the mid-700s. I then did the same thing as I did to RVAL and brought the data closer together and ended up with the equation below:
AdjustedPRVAL = (PRVAL/2.2173) – ((PRVAL – 744.31)/1.6)
This equation gave me the following comparison:
The new standard deviation of the PRVAL was 35.5 and the r squared value remained the same. This depiction of the adjusted pitching statistics shows a pretty good relationship between it and runs against for teams dating back to 1998. The relationship is not as strong given the r squared of .815, but not too shabby.
After creating this statistic and testing it out, I have found that it doesn’t take a complicated metric to accurately make predictions, as even simple stats created by trial and error can have interesting outcomes. I plan to continue to use this statistic and test its accuracy, and I also am developing a version that can be applied specifically to position players as well.
I am a High School Senior at Xaverian Brothers High School in Massachusetts. I am a sabermetrics enthusiast and Red Sox Diehard. I love Moneyball and I have created my own custom statistics. I also use R to analyze baseball stats and am currently learning Java in AP Computer Science.