Created Statistic: Run Value

by Joshua Mould

February 5, 2019

With so many complex statistics out there, I wondered if there was an easier way to project winning percentage or runs, a way that is simple yet more complex than Bill James’ classic Pythagorean Win Expectancy. To create a statistic like that, I would have to create one comprehensive stat for offense and one for pitching. Ultimately, I came up with the following and named them “Run Value” and “Pitching Run Value,” respectively.

RVAL = ( ( TB + BB – SO )/4) + RBI + HR

PRVAL = ( ( ( H + BB – SO )/4 ) + HR) x FIP

These two metrics are used for teams. In the batting RVal formula, the higher the better. I tried to get down to the pure number of runs that a player or team produces by using the very relaxed definition of a run being four bases. In the pitching PRVal formula, the lower the better. I did something very similar to the batting stat by trying to get the pure run total. I then put the two stats into the win expectancy formula:

RVALWinExp = RVal^1.83 / ( RVal^1.83 + PRVal^1.83)

I then ran a program in R to see how closely this stat correlates to actual team win percentage for all teams from the 1998 season through the 2018 season. In addition, I tested to see how Bill James’ win expectancy formula correlates to team win percentage over the same period of time. The results are below.

Bill James’ expected W.L.% → Actual W.L.%

r squared = .88

standard error = .0247

RVal W.L.% → Actual W.L.%

r squared = .853

standard error = .0273

Since I found that my stat had pretty good correlation to team win percentage, I decided to figure out if the offensive RVAL was better than pitching PRVAL or vice versa. So I tested RVAL against runs scored and PRVAL against runs against. The results are below:

RVAL → Runs Scored

r squared = .947

standard error = 22.8

PRVAL → Runs Against

r squared = .815

standard error = 94.7

The graphs and stats show that the offensive statistic is much more accurate than the pitching stat, and since I was getting such good results with the offensive stat, I decided to test it further, this time against a currently popular measurement of offensive production in Base Runs. The results are pictured below:

Base Runs → Runs

r squared = .927

standard error = 21.9

My stat had a higher r squared value by .02, which indicates that my statistic is about as good as Base Runs. I realized that my stat must have some sort of significance to be this accurate in predicting runs. I then took a look at a histogram of RVAL and compared it to a histogram of Runs and saw that RVAL was too high and too spread out to mirror a distribution of runs. This is what it looked like:

I then found the mean of the runs data and scaled RVAL down so that the centers of each distribution were the same. I then adjusted the data by shrinking the standard deviation to look similar to the histogram of runs and ended up with the following equation:

AdjustedRVAL = (RVAL/1.753685) – ( ( RVAL-745.1053 ) / 5)

This new formula divides each value to make the mean of the data the same as the mean of the runs data and then subtracts one-fifth of the difference between the value and the mean of the data. That means that it moves each data point one fifth of the way closer to the mean, bringing the data closer together. Comparing the two histograms again, I found this:

The new standard error for AdjustedRVAL was 18.2 and the r squared value remained the same. Now that these distributions look very similar to one another, I could see that my statistic pretty accurately predicts team runs scored. I now went back to the pitching statistic, PRVAL, because it’s results weren’t terrible, so I tried to do something with it. I looked at the histogram:

I already adjusted the mean of the data, which is why it seems centered in the mid-700s. I then did the same thing as I did to RVAL and brought the data closer together and ended up with the equation below:

AdjustedPRVAL = (PRVAL/2.2173) – ((PRVAL – 744.31)/1.6)

This equation gave me the following comparison:

The new standard deviation of the PRVAL was 35.5 and the r squared value remained the same. This depiction of the adjusted pitching statistics shows a pretty good relationship between it and runs against for teams dating back to 1998. The relationship is not as strong given the r squared of .815, but not too shabby.

After creating this statistic and testing it out, I have found that it doesn’t take a complicated metric to accurately make predictions, as even simple stats created by trial and error can have interesting outcomes. I plan to continue to use this statistic and test its accuracy, and I also am developing a version that can be applied specifically to position players as well.

The Compassionate Umpire or The Cold Automated Zone

Wes Johnson Should Help Jose Berrios Develop a Cutter

Sophomore Computer Science and Statistics double major at Villanova. I am a Red Sox Diehard. I have used R to analyze baseball stats and am also proficient in Java and familiar with Python.

6 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

London Yank

6 years ago

Joshua,

You’re doing really good work. I have a few comments:

1. You want to keep in mind what your question is. For example, Pythagorean record asks “What should this team’s record be based on how many runs they have scored and conceded?” Base Runs asks “How many runs should this team have scored and conceded based on their batting and pitching outcomes? And, based on the run differential this team should have had, how many games should they have won?” It isn’t clear what question your statistic is designed to answer because, for example, you have mixed batting outcomes (e.g. total bases) and measures of actual runs in your formulation (e.g. RBI). Do some reading about predictive model building in R. You don’t want to just randomly hunt for the things that give the best R squared. You want to think about what your question is and what factors are important and why. Including runs scored and runs against in your statistic will give you a high R squared, but it will not tell you much about baseball that we don’t already know if your question is how to build a good team going forward.

2. The components of your RVal statistic are not independent of each other. For example, every HR is also an RBI.

3. Following on from comment 1, by including RBI you are basically using a measure of runs, since every RBI is an actual run scored. For the model to be of any real use in a predictive sense you want to just use outcomes in building the metric (e.g. walks, doubles, etc.).

4. Just plot a single line of best fit for each comparison. Breaking these down by team doesn’t make much sense since you are not asking questions about between team variation.

5. Your comparison of your RVAL statistic to actual runs scored doesn’t tell us much. You are including runs scored in building your statistic (in the form of RBI and HR), so a priori we know they aren’t independent measures. This violates the assumptions of the test.

6. Make your plots prettier and more informative. Get rid of the stock ggplot gray background, and label your axes in human readable ways: e.g. + theme_bw() + labs(x=”my X label”, y=”my Y label”)

Finally, this is extremely impressive work for a high school student. You are ahead of many university students. Well done!

Joshua Mould

6 years ago

Reply to London Yank

Thank you for your comments. I was hoping to get some good feedback because I am trying to learn the language on my own and also get a better understanding of what quantitative analysts are thinking about when they analyze baseball stats.

D4PMember since 2019

6 years ago

A few comments:

1. By including both RBI and HR in the RVAL formula, it seems like you are double-counting the runs scored by the hitter of a HR, in that said runs count both in HR and RBI. Is that a problem?

2. Using RBI to “predict runs” seems a bit redundant, in the sense that RBI is already a measure of (at least some of the) runs that have been scored. It seems like you’re essentially saying “Runs scored increases with the number of runs that were scored by being batted in”, which doesn’t seem like two different variables as much as two variables that measure much of the very same thing…

Joshua Mould

6 years ago

Reply to D4P

I was purposefully trying to count both RBI and HR in the formula because both are important in offensive production, home runs often mean a larger number of runs were scored and RBI account for players being able to hit well when their teammates are in scoring position: situational hitting. In a team statistic this is important I believe. Is there a generic problem to that or does it depend on the goal of the stat?

I’m actually not sure if I’m trying to predict runs exactly or not, but most likely not. I’m mostly experimenting and trying to see if I can create a stat of offensive efficiency and production combined, which isn’t the best idea usually, but I definitely understand what you’re saying

London Yank

6 years ago

Reply to Joshua Mould

At the team level, RBI and Runs Scored are basically the same thing. Every time a team gets one RBI, it also gets one run. The small discrepancy between team runs and team RBI is due to rare run scoring events such as a run that scores on a double play. So, predicting runs from RBI is a bit like predicting people’s heights by measuring them from their feet to their nose. It will be highly accurate but it doesn’t tell you much since the length from foot to nose is just directly measuring 95% of the thing you want to know about.

If you want to incorporate situational hitting you will want to use something that captures batter outcomes by situation. I think if you do this you’ll find that you end up with an overfit model.

Lanidrac

6 years ago

OK, nice work, but is it really necessary to create a new, more complicated metric that is only a little better than what we already have with the other two metrics?

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG