A New Predictive Model for Determining Arbitration Salaries
My name is Rich Rieders and I am a 2015 graduate of Rutgers School of Law. Over the winter, I participated in Tulane University’s 9th Annual Baseball Arbitration Competition and we finished in 2nd place overall out of 40 teams. The arbitration cases used in the competition were Jenrry Mejia v. New York Mets, Lorenzo Cain v. Kansas City Royals, and Mark Trumbo v. Arizona Diamondbacks. My team represented the Royals, Mets and Mark Trumbo in those cases. It was a great experience and I learned a tremendous amount. Those of you who are in law school should absolutely participate. Being in New Orleans is an amazing bonus as well! You can read more about the competition from Tulane’s website and Jerry Crasnick’s ESPN article.
Instead of explaining how arbitration works, I highly recommend reading this article as it will give you an excellent basis for understanding the arbitration process. Just ignore the part about free agency since that’s been done away with now.
In order to prepare for the competition, I created a database (going back to 2008) consisting of all arbitration awards and players who signed 1-year contracts avoiding arbitration along with their respective statistics (Note multi-year contracts are not allowable as player comps for arbitration purposes). Using regression analysis, I was able to determine which statistics correlate most with salary.
Here on FanGraphs we pride ourselves on the use of metrics and the abandonment of traditional stats. That all goes out the window for the arbitration process. The arbitrators jointly selected by league and the union have a background in labor law, not baseball. And those that are baseball fans probably aren’t avid FanGraphs readers and their exposure is likely to be limited to Wins, Losses, ERA, H, HR BB, SO, etc. Each side gets 30 minutes to present their case, plus another 15 minutes of rebuttal. You simply don’t have time to teach the panel sabermetrics and argue your case at the same time. And as I will discuss later, the use of predictive stats largely fall outside the scope of an arbitration hearing anyway. However, by using regression analysis we can pinpoint exactly which stats correlate most with eventual salary and which ones don’t.
- SP: W (.6099), IP (.5401), SO (.5368), RA9-WAR (.5166), GS (.4598)
- RP: SV (.7302), SD (.4980), SV% (.3237), SO (.2716), WPA (.2491)
- Hitter: XBH (.7318), RBI (.7188), R (.6382), HR (.6031), PA (.5934)
These stats correlate among the least with future salary:
- SP: ERA (.1018), FIP (.0592), xFIP (.0765), BB% (.0202), HR/FB (.0046)
- RP: ERA (.0202), FIP (.0846), xFIP (.0962), BB% (.0218), LOB% (.0406)
- Hitter: BB% (.0175), BABIP (.0346), Z-Contact% (.0113), UBR (.0035), Def (.0202)
Now that’s not to say only the stats with the highest RSQ matter. Traditional rate stats like K/9 and ERA are still important. Try arguing to a casual fan that a pitcher with an ERA of 2.50 was not as productive as pitcher with an ERA of 4.00 ERA and see how that goes.
What we can take away from this is that:
- Traditional stats have a strong correlation, metrics do not.
- Counting stats have a strong correlation, rate stats do not.
- Offense, particularly power have a strong correlation and defense and baserunning do not.
- The more playing time you receive (PA, IP, G), the more money you are likely to make.
In essence, the overarching principal behind baseball arbitration is that salary is almost wholly dependent on the accumulation of traditional counting stats with traditional rate stats used to highlight the difference between the comparable players and serves in my formula to help prevent outliers.
Individual awards also matter a great deal. In my hearing, it was extremely difficult to try and argue against Lorenzo Cain when he won the ALCS MVP with his breakout postseason fresh in everyone’s mind. Those type of factors are extremely difficult to overcome. For a real-life example, I heard a story from one of our judges that the Giants were planning on going to arbitration with Tim Lincecum in 2010. Lincecum showed up with a Cy Young Award under each arm and within a few hours, a two-year contract was agreed upon.
Also keep in mind that for players going through arbitration for the first time, we also consider their career numbers as well. The correlations are fairly similar for career stats, but with slight improvement for career rate stats. For players going through the process for a second, third or fourth time, we pretty much ignore career statistics.
Before I introduce the model, I want to stress the importance of understanding the purpose of the baseball arbitration process. During the final round in Tulane, we represented the Kansas City Royals against Lorenzo Cain. One of our principal arguments was that Lorenzo Cain had an unsustainable .380 BABIP (highest in MLB mind you) which is why he batted .300 and that his BA (and the rest of his offensive numbers) would likely regress towards his career averages. The expected regression along with his low walk rate would limit his value to the club going forward. An argument most of us on FanGraphs would surely have made at the time, but Lorenzo Cain’s awesomeness is a topic for another day.
While this type of logic works perfectly well for free-agent signings or whether to acquire the player via trade, it does not work for arbitration purposes. The underlying purpose of the arbitration process is to compensate the player for his performance in the previous season, NOT to compensate him based on what we expect he will do the following season. This is absolutely critical. Hence, for arbitration purposes, the fact that a player was lucky, his performance was unsustainable or anything along the lines of “he won’t be as good as he was last season” is not permissible. This works the same for underachievers too as teams will get the benefit at arbitration when a player was “unlucky.”
Keeping all this in mind, what I have been able to do is determine which statistics (and other factors) matter the most when it comes to arbitration salaries and have created a formula that can accurately predict the salaries of future players by plugging in certain statistics. You may have seen similar work featured on MLBTradeRumors.com, however, the raw numbers produced by my formula are more accurate and contain less variance than their model’s adjusted projections. The 2015 arbitration projections on MLBTradeRumors featured an average error of $303,061 with a standard deviation of $334,102. My unadjusted projections yield an average error of $283,094 with a standard deviation of $255,174. Not to mention that my formula does not have any built in restraints or adjustments, which would certainly help increase its accuracy even more.
You can see a side-by-side comparison of the results here.
While these projections aren’t perfect, we can get a pretty good idea of what arbitration-eligible players will receive. Using these projections we should be able to not only predict a player’s salary for the upcoming season, but with good long-range statistical modeling, we can reasonably project a player’s subsequent arbitration salaries as well.
- How much will Matt Harvey earn before he reaches free agency? How many millions will TJS wind up costing him?
- Should Kris Bryant sign an extension this winter or should he try to reach free agency as early as possible? What should each side do? What about someone coming to arbitration for the first time like Nolan Arenado?
- How much money does a team stand to save by avoiding Super-2 or delaying free agency by a year? Should the type of hitter/pitcher influence the decision?
- Were the Reds or Todd Frazier better off by agreeing to a 2-year, $12-million deal this winter instead of going through arbitration twice? What about a defense-first player like Juan Lagares?
- How much money is a rebuilding team like the Phillies costing themselves over the next few years by using Ken Giles as a closer instead of as a “high-leverage reliever?” Should the Marlins not make Carter Capps their closer in 2016?
- Which teams do the best when it comes to arbitration? Which ones do the worst? (More on that next time). What about the agencies?
Using my formula, these are the questions we can begin to answer now.
J.D. 2015 from Rutgers School of Law Runner Up - Tulane's 8th Annual Baseball Arbitration Competition Adjunct Professor - Rutgers School of Law Contact me here: rrieders@rutgers.edu
One year of data doesn’t seem like very much. It would be helpful to see a comparison from 2012 to 2015.
Before you can say that your system is the best, you have to explain to us if what you are testing against was part of your training set.
That is, did you best fit your formula against the data that is in-sample or out-of-sample?
My thoughts too. What kind of testing was performed on the model. But interesting read.
Thanks for the comment Tom.
My arbitration projections (for 2015) were based on a sample of 859 players (I may have missed a few here and there, but I believe it’s all of them) who went through the arbitration process from 2009-2014 and received 1 year contracts. The 182 players from 2015 were not part of the sample used to create the 2015 projections (for obvious reasons), but will be included for 2016 and beyond.
Great, thanks.
One suggestion is to change your coefficients to two decimal places. Why is that? Because now that you are going to add 2015 data, and redo your work, you’re going to get all brand-new coefficients. Which means that now you’ll always only be able to test the next year.
But, if you keep things to two decimal places, or even one, you will have a more stable model, and you can then say that you can test the same formula each year.
That’s going to be your problem is that every year you get new data, you will get yet another new formula, so how can we trust it? And especially if you are going to get wildly different coefficients when you add the 2015 data.
Indeed, I’d like to see the whole formula with 2009-14 data and with 2009-15 data.
The coefficients are already rounded to two decimal places. As to the coefficients changing each year, I understand the concern, but is there another way? The nature of the arbitration system requires us to add data each year so the formula is going to have to change every time new data is added. If you want to see how my projections would look if we incorporated the 2015 data into them, see the link below:
https://docs.google.com/spreadsheets/d/1L6Dqwsi2eMWjZntnX4rZv9Ku0cK2z1j_Ge2r_fG8M1o/edit#gid=0.
As for the formula itself, my projections actually use 6 different formulas (Arb 1 SP, Arb 2+ SP, Arb 1 RP, Arb 2 RP, Arb 1 Hitters, Arb 2+ Hitters).
If you look at the following article (http://www.mlbtraderumors.com/2011/10/mlb-trade-rumors-arbitration-projections.html) you’ll gain some insight (unfortunately his underlying data and formula are not publicly available), but, from what I can gather, my projections are compiled in a very similar way, except I believe the stats I’ve chosen to incorporate work better and I do far less tinkering. I use the same stats for RP as I do for SP, they’re just weighed differently based on the correlations.
For testing purposes, the only thing we can really do is manually test each projection by going into the database and looking at the comps and see whether the projection “makes sense.” For example, there are few arbitration eligible players who are done for the season that we can already project for 2016.
Josh Edgin of the Mets is a first-year eligible reliever. My projection has him at $577,855. Looking back at the database, we have Scott Elbert, a fellow lefty specialist who missed his 2013 platform season due to TJS and have nearly identical career statistics. Both are 3-3. Games pitched 120 to 115, ERA 3.20 to 3.61, 92 SO to 78 SO. fWAR, RA9-WAR, WPA all extremely close as well. Elbert received $575,000 in 2014, so Edgin’s 2016 projection of $577,855 “makes sense.”
I know it’s not ideal, but it’s probably the best we can do, no?