A New Predictive Model for Determining Arbitration Salaries

My name is Rich Rieders and I am a 2015 graduate of Rutgers School of Law. Over the winter, I participated in Tulane University’s 9th Annual Baseball Arbitration Competition and we finished in 2nd place overall out of 40 teams. The arbitration cases used in the competition were Jenrry Mejia v. New York Mets, Lorenzo Cain v. Kansas City Royals, and Mark Trumbo v. Arizona Diamondbacks. My team represented the Royals, Mets and Mark Trumbo in those cases. It was a great experience and I learned a tremendous amount. Those of you who are in law school should absolutely participate. Being in New Orleans is an amazing bonus as well! You can read more about the competition from Tulane’s website and Jerry Crasnick’s ESPN article.

Instead of explaining how arbitration works, I highly recommend reading this article as it will give you an excellent basis for understanding the arbitration process. Just ignore the part about free agency since that’s been done away with now.

In order to prepare for the competition, I created a database (going back to 2008) consisting of all arbitration awards and players who signed 1-year contracts avoiding arbitration along with their respective statistics (Note multi-year contracts are not allowable as player comps for arbitration purposes). Using regression analysis, I was able to determine which statistics correlate most with salary.

Here on FanGraphs we pride ourselves on the use of metrics and the abandonment of traditional stats. That all goes out the window for the arbitration process. The arbitrators jointly selected by league and the union have a background in labor law, not baseball. And those that are baseball fans probably aren’t avid FanGraphs readers and their exposure is likely to be limited to Wins, Losses, ERA, H, HR BB, SO, etc. Each side gets 30 minutes to present their case, plus another 15 minutes of rebuttal. You simply don’t have time to teach the panel sabermetrics and argue your case at the same time. And as I will discuss later, the use of predictive stats largely fall outside the scope of an arbitration hearing anyway. However, by using regression analysis we can pinpoint exactly which stats correlate most with eventual salary and which ones don’t.

  • SP: W (.6099), IP (.5401), SO (.5368), RA9-WAR (.5166), GS (.4598)
  • RP:  SV (.7302), SD (.4980), SV% (.3237), SO (.2716), WPA (.2491)
  • Hitter: XBH (.7318), RBI (.7188), R (.6382), HR (.6031), PA (.5934)

These stats correlate among the least with future salary:

  • SP: ERA (.1018), FIP (.0592), xFIP (.0765), BB% (.0202), HR/FB (.0046)
  • RP: ERA (.0202), FIP (.0846), xFIP (.0962), BB% (.0218), LOB% (.0406)
  • Hitter: BB% (.0175), BABIP (.0346), Z-Contact% (.0113), UBR (.0035), Def (.0202)

Now that’s not to say only the stats with the highest RSQ matter. Traditional rate stats like K/9 and ERA are still important. Try arguing to a casual fan that a pitcher with an ERA of 2.50 was not as productive as pitcher with an ERA of 4.00 ERA and see how that goes.

What we can take away from this is that:

  1. Traditional stats have a strong correlation, metrics do not.
  2. Counting stats have a strong correlation, rate stats do not.
  3. Offense, particularly power have a strong correlation and defense and baserunning do not.
  4. The more playing time you receive (PA, IP, G), the more money you are likely to make.

In essence, the overarching principal behind baseball arbitration is that salary is almost wholly dependent on the accumulation of traditional counting stats with traditional rate stats used to highlight the difference between the comparable players and serves in my formula to help prevent outliers.

Individual awards also matter a great deal. In my hearing, it was extremely difficult to try and argue against Lorenzo Cain when he won the ALCS MVP with his breakout postseason fresh in everyone’s mind. Those type of factors are extremely difficult to overcome. For a real-life example, I heard a story from one of our judges that the Giants were planning on going to arbitration with Tim Lincecum in 2010. Lincecum showed up with a Cy Young Award under each arm and within a few hours, a two-year contract was agreed upon.

Also keep in mind that for players going through arbitration for the first time, we also consider their career numbers as well. The correlations are fairly similar for career stats, but with slight improvement for career rate stats. For players going through the process for a second, third or fourth time, we pretty much ignore career statistics.

Before I introduce the model, I want to stress the importance of understanding the purpose of the baseball arbitration process. During the final round in Tulane, we represented the Kansas City Royals against Lorenzo Cain. One of our principal arguments was that Lorenzo Cain had an unsustainable .380 BABIP (highest in MLB mind you) which is why he batted .300 and that his BA (and the rest of his offensive numbers) would likely regress towards his career averages. The expected regression along with his low walk rate would limit his value to the club going forward. An argument most of us on FanGraphs would surely have made at the time, but Lorenzo Cain’s awesomeness is a topic for another day.

While this type of logic works perfectly well for free-agent signings or whether to acquire the player via trade, it does not work for arbitration purposes. The underlying purpose of the arbitration process is to compensate the player for his performance in the previous season, NOT to compensate him based on what we expect he will do the following season. This is absolutely critical. Hence, for arbitration purposes, the fact that a player was lucky, his performance was unsustainable or anything along the lines of “he won’t be as good as he was last season” is not permissible. This works the same for underachievers too as teams will get the benefit at arbitration when a player was “unlucky.”

Keeping all this in mind, what I have been able to do is determine which statistics (and other factors) matter the most when it comes to arbitration salaries and have created a formula that can accurately predict the salaries of future players by plugging in certain statistics. You may have seen similar work featured on MLBTradeRumors.com, however, the raw numbers produced by my formula are more accurate and contain less variance than their model’s adjusted projections. The 2015 arbitration projections on MLBTradeRumors featured an average error of $303,061 with a standard deviation of $334,102. My unadjusted projections yield an average error of $283,094 with a standard deviation of $255,174. Not to mention that my formula does not have any built in restraints or adjustments, which would certainly help increase its accuracy even more.

You can see a side-by-side comparison of the results here.

While these projections aren’t perfect, we can get a pretty good idea of what arbitration-eligible players will receive. Using these projections we should be able to not only predict a player’s salary for the upcoming season, but with good long-range statistical modeling, we can reasonably project a player’s subsequent arbitration salaries as well.

  1. How much will Matt Harvey earn before he reaches free agency? How many millions will TJS wind up costing him?
  2. Should Kris Bryant sign an extension this winter or should he try to reach free agency as early as possible? What should each side do? What about someone coming to arbitration for the first time like Nolan Arenado?
  3. How much money does a team stand to save by avoiding Super-2 or delaying free agency by a year? Should the type of hitter/pitcher influence the decision?
  4. Were the Reds or Todd Frazier better off by agreeing to a 2-year, $12-million deal this winter instead of going through arbitration twice? What about a defense-first player like Juan Lagares?
  5. How much money is a rebuilding team like the Phillies costing themselves over the next few years by using Ken Giles as a closer instead of as a “high-leverage reliever?” Should the Marlins not make Carter Capps their closer in 2016?
  6. Which teams do the best when it comes to arbitration? Which ones do the worst? (More on that next time). What about the agencies?

Using my formula, these are the questions we can begin to answer now.





J.D. 2015 from Rutgers School of Law Runner Up - Tulane's 8th Annual Baseball Arbitration Competition Adjunct Professor - Rutgers School of Law Contact me here: rrieders@rutgers.edu

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Matt P
8 years ago

One year of data doesn’t seem like very much. It would be helpful to see a comparison from 2012 to 2015.

Tangotiger
8 years ago

Before you can say that your system is the best, you have to explain to us if what you are testing against was part of your training set.

That is, did you best fit your formula against the data that is in-sample or out-of-sample?

My Thoughts Also
8 years ago
Reply to  Tangotiger

My thoughts too. What kind of testing was performed on the model. But interesting read.

tangotiger
8 years ago
Reply to  Rich Rieders

Great, thanks.

One suggestion is to change your coefficients to two decimal places. Why is that? Because now that you are going to add 2015 data, and redo your work, you’re going to get all brand-new coefficients. Which means that now you’ll always only be able to test the next year.

But, if you keep things to two decimal places, or even one, you will have a more stable model, and you can then say that you can test the same formula each year.

That’s going to be your problem is that every year you get new data, you will get yet another new formula, so how can we trust it? And especially if you are going to get wildly different coefficients when you add the 2015 data.

Indeed, I’d like to see the whole formula with 2009-14 data and with 2009-15 data.