Created Statistic: Run Value

With so many complex statistics out there, I wondered if there was an easier way to project winning percentage or runs, a way that is simple yet more complex than Bill James’ classic Pythagorean Win Expectancy. To create a statistic like that, I would have to create one comprehensive stat for offense and one for pitching. Ultimately, I came up with the following and named them “Run Value” and “Pitching Run Value,” respectively.

RVAL = ( ( TB + BB – SO )/4) + RBI + HR  

PRVAL = ( ( ( H + BB – SO )/4 ) + HR) x FIP

These two metrics are used for teams. In the batting RVal formula, the higher the better. I tried to get down to the pure number of runs that a player or team produces by using the very relaxed definition of a run being four bases. In the pitching PRVal formula, the lower the better. I did something very similar to the batting stat by trying to get the pure run total. I then put the two stats into the win expectancy formula:

RVALWinExp = RVal^1.83 / ( RVal^1.83 + PRVal^1.83)

I then ran a program in R to see how closely this stat correlates to actual team win percentage for all teams from the 1998 season through the 2018 season. In addition, I tested to see how Bill James’ win expectancy formula correlates to team win percentage over the same period of time. The results are below.

Bill James’ expected W.L.%  → Actual W.L.%

r squared = .88

standard error = .0247

RVal W.L.% → Actual W.L.%

r squared = .853

standard error = .0273

Since I found that my stat had pretty good correlation to team win percentage, I decided to figure out if the offensive RVAL was better than pitching PRVAL or vice versa. So I tested RVAL against runs scored and PRVAL against runs against. The results are below:

RVAL → Runs Scored

r squared = .947

standard error = 22.8

PRVAL → Runs Against

r squared = .815

standard error = 94.7

The graphs and stats show that the offensive statistic is much more accurate than the pitching stat, and since I was getting such good results with the offensive stat, I decided to test it further, this time against a currently popular measurement of offensive production in Base Runs. The results are pictured below:

Base Runs → Runs

r squared = .927

standard error = 21.9

My stat had a higher r squared value by .02, which indicates that my statistic is about as good as Base Runs. I realized that my stat must have some sort of significance to be this accurate in predicting runs. I then took a look at a histogram of RVAL and compared it to a histogram of Runs and saw that RVAL was too high and too spread out to mirror a distribution of runs. This is what it looked like:

I then found the mean of the runs data and scaled RVAL down so that the centers of each distribution were the same. I then adjusted the data by shrinking the standard deviation to look similar to the histogram of runs and ended up with the following equation:

AdjustedRVAL = (RVAL/1.753685) – ( ( RVAL-745.1053 ) / 5)

This new formula divides each value to make the mean of the data the same as the mean of the runs data and then subtracts one-fifth of the difference between the value and the mean of the data. That means that it moves each data point one fifth of the way closer to the mean, bringing the data closer together. Comparing the two histograms again, I found this:


The new standard error for AdjustedRVAL was 18.2 and the r squared value remained the same. Now that these distributions look very similar to one another, I could see that my statistic pretty accurately predicts team runs scored. I now went back to the pitching statistic, PRVAL, because it’s results weren’t terrible, so I tried to do something with it. I looked at the histogram:

I already adjusted the mean of the data, which is why it seems centered in the mid-700s. I then did the same thing as I did to RVAL and brought the data closer together and ended up with the equation below:

AdjustedPRVAL = (PRVAL/2.2173) – ((PRVAL – 744.31)/1.6)

This equation gave me the following comparison:


The new standard deviation of the PRVAL was 35.5 and the r squared value remained the same. This depiction of the adjusted pitching statistics shows a pretty good relationship between it and runs against for teams dating back to 1998. The relationship is not as strong given the r squared of .815, but not too shabby.

After creating this statistic and testing it out, I have found that it doesn’t take a complicated metric to accurately make predictions, as even simple stats created by trial and error can have interesting outcomes. I plan to continue to use this statistic and test its accuracy, and I also am developing a version that can be applied specifically to position players as well.





Sophomore Computer Science and Statistics double major at Villanova. I am a Red Sox Diehard. I have used R to analyze baseball stats and am also proficient in Java and familiar with Python.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
London Yank
5 years ago

Joshua,

You’re doing really good work. I have a few comments:

1. You want to keep in mind what your question is. For example, Pythagorean record asks “What should this team’s record be based on how many runs they have scored and conceded?” Base Runs asks “How many runs should this team have scored and conceded based on their batting and pitching outcomes? And, based on the run differential this team should have had, how many games should they have won?” It isn’t clear what question your statistic is designed to answer because, for example, you have mixed batting outcomes (e.g. total bases) and measures of actual runs in your formulation (e.g. RBI). Do some reading about predictive model building in R. You don’t want to just randomly hunt for the things that give the best R squared. You want to think about what your question is and what factors are important and why. Including runs scored and runs against in your statistic will give you a high R squared, but it will not tell you much about baseball that we don’t already know if your question is how to build a good team going forward.

2. The components of your RVal statistic are not independent of each other. For example, every HR is also an RBI.

3. Following on from comment 1, by including RBI you are basically using a measure of runs, since every RBI is an actual run scored. For the model to be of any real use in a predictive sense you want to just use outcomes in building the metric (e.g. walks, doubles, etc.).

4. Just plot a single line of best fit for each comparison. Breaking these down by team doesn’t make much sense since you are not asking questions about between team variation.

5. Your comparison of your RVAL statistic to actual runs scored doesn’t tell us much. You are including runs scored in building your statistic (in the form of RBI and HR), so a priori we know they aren’t independent measures. This violates the assumptions of the test.

6. Make your plots prettier and more informative. Get rid of the stock ggplot gray background, and label your axes in human readable ways: e.g. + theme_bw() + labs(x=”my X label”, y=”my Y label”)

Finally, this is extremely impressive work for a high school student. You are ahead of many university students. Well done!

D4Pmember
5 years ago

A few comments:

1. By including both RBI and HR in the RVAL formula, it seems like you are double-counting the runs scored by the hitter of a HR, in that said runs count both in HR and RBI. Is that a problem?

2. Using RBI to “predict runs” seems a bit redundant, in the sense that RBI is already a measure of (at least some of the) runs that have been scored. It seems like you’re essentially saying “Runs scored increases with the number of runs that were scored by being batted in”, which doesn’t seem like two different variables as much as two variables that measure much of the very same thing…

London Yank
5 years ago
Reply to  Joshua Mould

At the team level, RBI and Runs Scored are basically the same thing. Every time a team gets one RBI, it also gets one run. The small discrepancy between team runs and team RBI is due to rare run scoring events such as a run that scores on a double play. So, predicting runs from RBI is a bit like predicting people’s heights by measuring them from their feet to their nose. It will be highly accurate but it doesn’t tell you much since the length from foot to nose is just directly measuring 95% of the thing you want to know about.

If you want to incorporate situational hitting you will want to use something that captures batter outcomes by situation. I think if you do this you’ll find that you end up with an overfit model.

Lanidrac
5 years ago

OK, nice work, but is it really necessary to create a new, more complicated metric that is only a little better than what we already have with the other two metrics?