Alden Carrithers is twenty-nine years old, has zero major league at-bats, and has never hit more than three home runs in any minor league season.
Which is kind of why he’s so amazing, because he just might be relevant. Perhaps Carrithers won’t be relevant to the all-star game, or perhaps he’ll never even be relevant to major league baseball. What that doesn’t mean, however, is that he shouldn’t be. Carrithers is never going to play in the midsummer classic — that’s an assumption that can be made with a high degree of certainty — but there is a good case to be made that he belongs in Major League Baseball.
Carrithers has only one really plus tool, but that tool is pretty fantastic. This is Carrithers’ seventh season in the minor leagues, and in every single one of them he has more walks than strikeouts. That likely won’t hold up in the major leagues, but it will continue to at least some extent. Reaching base often for free while rarely taking the automatic trip back to the dugout is the kind of thing that generates value, and it’s the kind of thing that Alden Carrithers is. This isn’t a novel, new concept. But it is something that’s hard to always remember and realize, and really factor in to a player that would otherwise be labeled as Quad-A at best.
Because as much as this is, this really isn’t an article about Alden Carrithers. It could be about Mike O’Neill who had a ridiculous 91 walks with only 37 strikeouts last season, or it could be about Jamie Johnson of the Tigers, who’s had many more walks than strikeouts in his past two seasons. But those guys have flaws in their games that are pretty obvious, and they’re all further away from proving themselves ready for the show than Alden. This is really an article about something-for-nothing guys, and how available they are.
A naïve argument against the necessity for an article that points out the possible worth of a player of Alden Carrithers’ type is that these guys exist in bulk. That argument is more arguing against the very premise of WAR itself, however, since replacement level is very intentionally set to where it represents just that — replacement level players. There aren’t an abundance of one WAR guys hanging around. If there were, WAR would simply be wrong. There are, however, an abundance of zero WAR guys hanging around, which is kind of the idea.
Back to Alden, and why he’s probably not just another one of the zero WAR guys. The bane, of course, is his power. He’s never hit more than three dingers, as mentioned earlier, and his ISO has fluttered in the range of .066 without ever reaching higher than his rookie-ball .114. The literally outstanding plate discipline makes up for a lot of that, but in and of itself the plate discipline isn’t enough to make him a major league player.
His defense projects to be solid without being fantastic. There’s a lack of data on defense in the minor leagues, but the majority of scouting reports on him have been generally agreeable without being glowing. He can play second base, which gives him a little immediate boost in value, while also being able to pitch in at third base as well as left field. He’s only really projectable defensively at second base for any long-term stint, but that’s not a terrible thing. Second base is currently the third-weakest position in baseball from a hitting perspective — only catcher and shortstop are worse. And while shortstops and catchers are often able to add value simply by being capable of playing catcher and/or shortstop, the same often isn’t true with second basemen. The basic point is that second base is probably currently the worst position in baseball. It’s really hard to be good enough to play major league baseball, but if one is inclined to do so then second base is probably a good choice.
His speed is of the same ilk as his defense, in the sense that it is solidly useful as well as unspectacular. He’s stolen bases at a 77% clip throughout his minor league career, but he’s only stolen 89 bases total, suggesting that while his baserunning isn’t prolific it is in fact solid. This is further backed up by his Bill James speed score that has hovered just north of 5 throughout his career (suggesting just north of average). Oliver projects Carrithers to have a WAR/600 of 1.8, which is a reasonable projection although it probably slightly over-rates his defense (6.3 runs above average) while under-rating his running (-1.6 runs above average). The wOBA projection is at .297, which is roughly in line with Steamer and ZiPS.
All that said, if a starter goes down and Carrithers has to play for a quarter of the season or so he will likely contribute about half a win. A second baseman will probably go down this season, and he will probably be replaced with a zero WAR level player. Ryan Goins is currently the starting second baseman for the Blue Jays, and he’s a zero WAR player. Last year Dan Uggla became terrible and the Braves replaced him with Eliot Johnson who ended the season as–wait for it–a zero WAR player. Meanwhile Carrithers was available via a phone call, forty miles down the road at Triple-A Gwinnett.
This past winter, Carrithers was basically available for anyone who wanted him as a minor league free agent. The Oakland Athletics picked him up, and he’s currently at their triple-A affiliate Sacramento in the PCL. The A’s already have Callaspo, Sogard, and Punto playing second base at the major league level, all of whom are probably a little better than Carrithers. But Carrithers also isn’t that much worse than those guys, from a total value perspective. If any or all of those guys go down the A’s won’t be hurt that much, and it cost them nothing. It’s this sort of thing that (along with a lot of other sorts of things) gets a team ahead in a game where everything can feel random. The A’s, after all, are still moneyballing. That hasn’t stopped yet.
Carrithers will probably never wear a uniform in a major league stadium. But if he does, it will probably be the kind of uniform-wearing that will help the team he’s playing for win baseball games. Something for nothing, is a pretty good way to win.
Whenever one makes any conclusion based off of anything, a bunch of underlying assumptions get shepherded in to the high-level conclusion that they output. Now that’s a didactic opening sentence, but it has a point–because statistics are full of underlying assumptions. Statistics are also, perhaps not coincidentally, full of high-level conclusions. These conclusions can be pretty wrong, though. By about five-hundred runs each and every season, in this case.
Relative player value is likely the most important area of sports analysis, but it’s not always easy. For example, it’s pretty easy to get a decent idea of value in baseball while it’s pretty hard to do the same for football. No one really knows the value of a pro-bowl linebacker compared to a pro-bowl left guard, for one. People have rough ideas, but these ideas are based more on tradition and ego than advanced analysis. Which is why football is still kind of in the dark ages, and baseball isn’t. But just because baseball is out of the dark ages, it doesn’t mean that it’s figured out. It doesn’t even mean that it’s even close to figured out.
Because this question right here still exists: What’s the value of a starting pitcher compared to a relief pitcher? At first glance this a question we have a pretty good grasp on. We have WAR, which isn’t perfect, yeah, but a lot of the imperfections get filtered out when talking about a position as whole. You can just compare your average WAR for starters with your average WAR for relievers and get a decent answer. If you want to compare the top guys then just take the top quartile and compare them, etc. Except, well, no, because underlying assumptions are nasty.
FanGraphs uses FIP-WAR as its primary value measure for pitchers, and it’s based on the basic theory that pitchers only really control walks, strikeouts, and home runs–and that everything else is largely randomness and isn’t easily measurable skill. RA9 WAR isn’t a good measure of individual player skill because a lot of it depends upon factors like defense and the randomness of where the ball ends up, etc. This is correct, of course. But when comparing the relative value of entire positions against each other, RA9 WAR is the way to go. Because when you add up all the players on all of the teams and average them, factors like defense and batted balls get averaged together too. We get inherently perfect league average defense and luck, and so RA9 WAR loses its bias. It becomes (almost) as exact as possible.
Is this really a big deal, though? If all of the confounding factors of RA9 WAR get factored together, wouldn’t the confounding factors of FIP-WAR get factored together too? What’s so bad about using FIP-WAR to judge value? Well there’s this: From 1995 onward, starting pitchers have never outperformed their peripherals. Relievers? They’ve outperformed each and every time. And it’s not like the opposite happened in 1994–I just had to pick some date to start my analysis. Here’s a table of FIP-WAR compared to RA9-WAR compared to starters for the last 18 years, followed by the same table for relievers.
Ok, so that’s a lot of numbers. The basis, though, is that FIP thinks that starters are better than they actually are, while it thinks relievers are the converse. And this is true year after year, by margins that rise well above negligible. Starters allow roughly 250 more runs than they should according to FIP every season, while relievers allow about 250 less than they should by FIP’s methodologies–in much fewer innings. In more reduced terms this means that starters are over-valued by about 10% as whole, while relievers are consistently under-valued by about 25% according to FIP-WAR. Now, this isn’t a completely new idea. We’ve known that relievers tend to outperform peripherals for a while, but the truth is this: relievers really outperform peripherals, pretty much all the time always.
Relievers almost get to play a different game than starters. They don’t have to face lineups twice, they don’t have to throw their third or fourth-best pitches, they don’t have to conserve any energy, etc. There’s probably a lot more reasons that relievers are better than starters, too, and these reasons can’t be thrown out as randomness, because they pretty much always happen. Not necessarily on an individual-by-individual basis, but when trying to find the relative value between positions, the advantages of being a reliever are too big to be ignored.
How much better are relievers than starters at getting “lucky”? Well, a few stats that have been widely considered luck stats (especially for pitchers) for a while are BABIP and LOB. FIP assumes that starters and relievers are on even ground, as far as these two numbers are concerned. But are they? Here’s a few tables for comparison, using the same range of years as before.
With the exception of BABIP in ’96, relievers always had better luck than starters. Batters simply don’t get on base as often–upon contacting the ball fairly between two white lines–when they’re facing guys that didn’t throw out the first pitch of the game. And when batters do get on, they don’t get home as often. Relievers mean bad news, if good news means scoring more runs.
Which is why we have to be careful when we issue exemptions to the assumptions of our favorite tools. There are a lot of solid methodologies that go into the formulation of FIP, but FIP is handicapped by the forced assumption that everyone is the same at the things that they supposedly can’t control. Value is the big idea–the biggest idea, probably–and it’s entirely influenced by how one chooses to look at something. In this case it’s pitching, and what it means to be a guy that only pitches roughly one inning at a time. Or perhaps it’s about this: What it means to be a guy who looks at a guy that pitches roughly one inning at a time, and then decides the worth of the guy who pitches said innings, assuming that one wishes to win baseball games.
The A’s and Rays just spent a bunch of money on relievers, after all. And we’re pretty sure they’re not dumb, probably.
There seems to be quite a bit of disagreement in FanGraphs-land over what skills make for a good pinch-hitter. Some will argue that power is more important while others might say that on-base skills are more important. And while I know that it’s fashionable for the author to make a stance at the start of his article, I’m not going comply. I’m just going to unsexily dive face-first into Retrosheet.
How can we solve this problem? How do we know what skills are best for pinch-hitters? Well, we can examine the base-out states that pinch-hitters confront and then derive from those base-out states specific pinch-hitter linear weights. We will then compare pinch-hitter linear weights to league-average linear weights to see which skills retain value. Simple.
We’re also going to split the data by league, since pinch-hitting tendencies in the National League are likely going to be different than American League tendencies. I’m going to use the last five years of data, because whim. The table below, then, includes league-average linear weights followed by NL and AL pinch-hitter linear weights (aside: the run values of linear weights are from 1999-2002, per Tango’s work. This won’t make a real difference in the results, however, since we’re examining relative value of different base-out states and not overall run-value of different events).
In the National League we can see that the value of home runs have increased slightly while walks have seen a corresponding decrease. This is because pinch-hitters often come to the plate when there are more outs than average. This sensibly decreases the value of walks and increases the importance of hurrying up and sending everyone around the bases already. This note comes with a caveat, however — the differences in linear weights are pretty small. It seems that managers in the National League are often forced to use the pinch-hitter to replace the pitcher, and therefore pinch-hitters are used in a lot of sub-optimal places.
The American league does not condone making everyone hit, however, and the impact upon pinch-hitting situations is pretty clear. The run value of home runs increases by .04 in pinch-hitting situations in the American League compared to the paltry .01 National League increase. In fact the run values of nearly all events increases — managers in the American League simply have more flexibility on when to use pinch-hitters and so they are able to deploy their pinch-hitters in base/out situations that are strategically favorable.
What does this all mean? Like everything, this simultaneously means quite a bit and not much at all. Home run value increases while walk value decreases during average pinch-hitter situations, but the change isn’t huge. If you’re a general manager looking for a bench bat and there’s a home-run guy available with a 90 wRC+ and a plate-discipline guy with a 95 wRC+, take the plate-discipline guy. What if they both have a 90 wRC+? Then take the home-run guy. The pinch-hitter linear weights here are more of a tie-breaker than a game-changer. Power is more important than walks when it comes to being a pinch-hitter, but being a good hitter is more important than power.
Roster construction is never that simple, though. Ideally a team will have both power and plate-discipline guys available on the bench and then the manager will be able to leverage both of their abilities based upon the base/out state (and also the score/inning situation, which is outside the scope of this article). Managers tend to be kind of strategic dunces, though, so I’m not sure if I see this happening. If I were in charge of anything I would supply my manager with a chart of base/out states that list the team’s best pinch-hitters in each situation. I’m not in charge, though, and even if I were I would probably be ignored.
I am in charge of this article, however, which means that I can bring it to a close. I’ll note that another valid way to do this study would be to create WPA-based weights rather than run-expectancy weights. There’s a lot more noise in WPA, but it could still create some interesting conclusions. I reckon the conclusion would be pretty much the same though — what makes a good pinch-hitter? Well, a good hitter makes for a good pinch-hitter. And a little power doesn’t hurt.
There are so many problems with ERA that it’s unbelievable. I’m not going to sit here and tell you what’s wrong with ERA, though, because you’re probably smart. But there’s a problem with ERA, and it’s a problem that transcends ERA. It’s a problem that trickles down through FIP, xFIP, SIERA, TIPS, etc. etc. name your favorite stat, etc., and it’s something I don’t see talked about much.
All of our advanced pitcher metrics are trying to predict or estimate ERA. They’re trying to figure out what a pitcher’s ERA should be, and herein lies the problem: Because they could be exactly right, but they’d still be a little incorrect due to one little assumption.
This assumption–that pitchers have no control over whether or not the fielders behind them make errors–seems easy to make. Like most assumptions, however, this one is subtly incorrect. Thankfully, the reason is pretty simple. Ground balls are pretty hard to field without making an error, and fly balls aren’t. And the difficulty gap is pretty huge.
How big? Well in 2013 there were precisely 58,388 ground balls, 1,344 of which resulted in errors. On the other hand a mere 98 out of 39,328 fly balls resulted in errors. That means that 2.3% of ground balls result in errors while a tiny 0.25% of fly balls do. It’s time to stop pretending that this gap doesn’t exist, because it does.
So now that we know this, what does it mean? Well it means this: ground-ball pitchers will have an ERA that suggests they are better than their actual value, while fly-ball pitchers have the opposite effect. Pitchers who allow contact, additionally, are worse off because every time they allow contact they put pressure on their defense. They’re giving themselves a chance to stockpile unearned runs which nobody will count against them if they’re only looking at ERA derivatives. When it comes to winning baseball games, however, earned runs don’t matter. Runs matter.
I am going to call this the “pressure on the defense” effect, which will cause some pitchers to be more prone to unearned runs than other pitchers. How big is this effect? Well, not huge. The gap between the best pitcher and worst pitcher in the league is roughly three runs over the course of the season. But keep in mind that three runs is about a third of a win, and a third of win is worth about $2 million dollars. We’re not discussing mere minutiae here.
In order to better quantify this effect I have developed the xUR/180 metric, which will estimate how many unearned runs should have taken place behind each pitcher with an average defense. Below is a table of all qualified starting pitchers from 2013 ranked according this metric. I have also included how many unearned runs they actually allowed in 2013, scaled to 180 innings for comparative purposes.
I’m sure there is more to be gleaned, but the point is this: we need to stop trying to predict ERA, because ERA is not a pure value stat. We should be trying to figure out how many runs a pitcher should/should have given up, because that’s what matters. Runs matter, and who cares if they’re unearned? They’re kind of the pitcher’s fault, anyways.
A recent article by ncarrington brought up an interesting point, and it’s one that merits further investigation. The basis of the article points out that even though two teams may have similar team average on-base percentages, a lack of consistency within one team will cause them to under-perform their collective numbers when it comes to run production. A balanced team, on the other hand, will score more runs. That’s our hypothesis.
How does the scientific method work again? Er, nevermind, let’s just look at the data.
In order to gain an initial understanding we’re going to start by looking at how teams fared in 2013. We’ll calculate a league average runs/OBP number that will work as a proxy for how many runs a team should be expected to score based on their OBP. And then we’ll calculate the standard deviation of each team’s OBP (weighted to plate appearances), and compare that to the league average standard deviation. If our hypothesis is true, teams with a relatively low OBP deviations will outperform their expected runs scored number.
Of course, there’s a lot more to team production than OBP. We’re going to conquer that later. Bear with me–here’s 2013.
A few things to keep in mind while dissecting this chart: 668.5 is the baseline number for Runs/(OBP/LeagueOBP). Any team number above this means that they are outperforming, while any number below represents underperformance. The league average team OBP standard deviation is .162
That chart’s kind of a bear, so I’m going to break it up into buckets. In 2013 there were 16 teams that exhibited above-average variances. Of those, 11 outperformed expectations while only 5 underperformed expectations. Now for the flipside–of the 14 teams that exhibited below-average variances, only 2 outperformed expectations while a shocking 12(!) teams underperformed.
That absolutely flies in the face of our hypothesis. A startling 23 out of 30 teams suggest that a high variance will actually help a team score more runs while a low variance will cause a team to score less.
Before we get all comfy with our conclusions, however, we’re going to acknowledge how complicated baseball is. It’s so complicated that we have to worry about this thing called sample size, since we have no idea what’s going on until we’ve seen a lot of things go on. So I’m going to open up the floodgates on this particular study, and we’re going to use every team’s season since 1920. League average OBP standard deviation and runs/OBP numbers will be calculated for each year, and we’ll use the aforementioned bucket approach to examine the results.
Small sample size strikes again. Will there ever be a sabermetric article that doesn’t talk about sample size? Maybe, but it probably won’t be written by me. Anyways, the point is that variance in team OBP has little to no effect on actual results when you up your sample size to 2000+. As a side note of some interest, I wondered if teams with high variances would tend have bigger power numbers than their low variance counterparts. High variance teams have averaged an ISO of .132 since 1920. Low variance teams? .131. So, uh, not really.
If you want to examine the ISO numbers a little more, here’s this: outperforming teams had an ISO of .144 while underperforming teams had an ISO .120. These numbers remain the same for both high and low variance teams. It appears that overachieving/underachieving OBP expectations can be almost entirely explained by ISO.
I’m not satisfied with that answer, though. Was 2013 really just an aberration? What if we limit our samples to only teams that significantly outperformed or underperformed expectations (by 50 runs) while having a significantly large or small team standard deviation OBP.
The numbers here do point a little bit more towards high variance leading to outperformance. High-variance teams are more likely to strongly outperform their expectations to the tune of about 20%, and the same is true for low-variance teams regarding underperforming. Bear in mind, however, that that is not a huge number, and that is not a huge sample size. If you’re trying to predict whether a team should outperform or underperform their collective means then variance is something to consider, but it isn’t the first place you should look.
Being balanced is nice. Being consistent is nice. It’s something we have a natural inclinations towards as humans–it’s why we invented farming, civilization, the light bulb, etc. But when you’re building a baseball team it’s not something that’s going to help you win games. You win games with good players.
It’s something we hear all the time: “He’s a scrappy player” or “He’s always trying hard out there, I love his scrappiness.” Maybe chicks don’t dig the long ball anymore; maybe they’re into scrappiness. I’m not really in a position to accurately comment on what chicks dig though, so I don’t know.
Even from a guy’s perspective, scrappiness is great. It’s hard to hate guys that overcome their slim frames by just out-efforting everyone else and getting to the big leagues. It’s not easy to quantify scrappiness, though. Through the years it’s always been a quality that you know when you see, but there’s never been a number to back it up. Until now.
Scrap is a metric that is scaled on a similar scale to Spd, where 5 is average and anything above that is above average, and anything below 5 is below average. Here are the components that make it up (each component is factored onto a Spd-like scale, assigned a weight, and then combined with all of the other components to give a final number).
Without further ado, here are the Scrap rankings of all qualified batters in 2013.
That’s quite a bit to look at. Here are a few of my takeaways:
This isn’t a stat that’s going to forever change how we view baseball. But this does give us a way of quantifying, however imperfectly, a skillset that we haven’t been able to before. Now we not only know that Jose Altuve is scrappy, we know just how scrappy he is. I’ll let you decide how important that is.
If you have any suggestions regarding different ways to calculate Scrap let me know in the comments. It’s a metric that requires a good amount of arbitrary significance since, well, what does it even mean to be scrappy? We’ve always had an idea, and now we have a number.
The idea for this metric was spurned on by Dan Syzmborksi on this episode of the CACast podcast, somewhere around the 75-minute mark.
My article on weighting a hitter’s past results was supposed to be a one-off study, but after reading a recent article by Dave Cameron I decided to expand the study to cover starting pitchers. The relevant inspirational section of Dave’s article is copied below:
“The truth of nearly every pitcher’s performance lies somewhere in between his FIP-based WAR and his RA9-based WAR. The trick is that it’s not so easy to know exactly where on the spectrum that point lies, and its not the same point for every pitcher.”
Dave’s work is consistently great. This, however, is a rather hand-wavy explanation of things. Is there a way that we can figure out where pitchers have typically laid on this scale in the past so that we can make more educated guesses about what a pitcher’s true skill level is? We have the data–so we can try.
So, how much weight should be placed on ERA and FIP respectively? Like Dave said, the answer will be different in every case, but we can establish some solid starting points. Also since we’re trying to predict pitching results and not just historical value we’re going to factor in the very helpful xFIP and SIERA metrics.
Now for the methodology paragraph: In order to test this I’m going to use every pitcher season since 2002 (when FanGraphs starts recording xFIP/SIERA data) where a pitcher had at least 100 innings pitched, and then weight all of the relevant metrics for that season in order to create an ERA prediction for the following season. I’ll then look at the difference between the following season’s predicted and average ERA, and then calculate the average miss. The smaller the average miss, the better the weights. Simple. As an added note, I have weighted the importance of a pitcher’s second (predicted – actual) season by innings pitched so that a pitcher who pitched 160 innings in his second (predicted – actual) season will assume more merit than the pitcher who pitched only 40 innings.
How predictive are each of the relevant stats without weights? I am nothing without my tables, so here we go (There are going to be a lot of tables along the way to our answers. If you’re just interested in the final results, go ahead and skip on down towards the bottom).
This doesn’t really tell us anything we don’t already know: SIERA and xFIP are similar, and FIP is a better predictor than ERA. Let’s start applying some weights to see if we can increase accuracy, starting with ERA/SIERA combos.
We can already see that factoring in ERA just a slight amount improves our results substantially. When you’re predicting a pitcher’s future, therefore, you can’t just fully rely on xFIP or SIERA to be your fortune teller. You can’t lean on ERA too hard either, though, since once you start getting up over around 25% your projections begin to go awry. Ok, so we know how SIERA and ERA combine, but what if we use xFIP instead?
Using xFIP didn’t really improve our results at all. SIERA consistently outperforms xFIP (or is at worst only marginally beaten by it) throughout pretty much all weighting combinations, and so from this point forward we’re just going to use SIERA. Just know that SIERA is basically xFIP, and that there are only slight differences between them because SIERA makes some (intelligent) assumptions about pitching. Now that we’ve established that, let’s try throwing out ERA and use FIP instead.
It’s interesting that ERA/SIERA combos are more predictive than FIP/SIERA combos, even though FIP is more predictive in and of itself. This is likely due to the fact that a lot of pitchers have consistent team factors that show up in ERA but are cancelled out by FIP. We’ll explore that more later, but for now we’re going to try to see if we can use any ERA/FIP/SIERA combos that will give us better results.
There are three values here that are all pretty good. The important thing to note is that ERA/FIP/SIERA combos offer more consistently good results than any two stats alone. SIERA should be your main consideration, but ERA and FIP should not be discarded since the combo offers a roughly .05 better predictive value towards ERA than SIERA alone. It’s a small difference, but it’s there.
Now I’m going to go back to something that I mentioned previously–should a player be evaluated differently if he isn’t coming back to the same team? The answer to this is a pretty obvious yes, since a pitcher’s defense/park/source of coffee in the morning will change. Let’s narrow down our sample to only pitchers that changed teams, to see if different numbers work better. These numbers will be useful when evaluating free agents, for example.
As suspected ERA loses a lot of it’s usefulness when a player is switching teams, and FIP retains its marginal usefulness while SIERA carries more weight. Another thing to note is that it’s just straight-up harder to predict pitcher performance when a pitcher is changing teams no matter what metric you use. SIERA itself goes down in accuracy to .793 when only dealing with pitchers that change teams, a noticeable difference from the .760 value above for all pitchers.
For those of you who have made it this far, it’s time to join back in with those who have skipped down towards to bottom. Here’s a handy little chart that shows previously found optimal weights for evaluating pitchers:
Of course, any reasonable projection should take more than just one year of data into account. The point of this article was not to show a complete projection system, but more to explore how much weight to give to each of the different metrics we have available to us when evaluating pitchers. Regardless, I’m going to expand the study a little bit to give us a better idea of weighting years by establishing weights over a two-year period. I’m not going to show my work here mostly out of an honest effort to spare you from having to dissect more tables, so here are the optimal two year weights:
As expected using multiple years increases our accuracy (by roughly .15 ERA per pitcher). Also note that these numbers are for evaluating all pitchers, and so if you’re dealing with a pitcher who is changing teams you should tweak ERA down while uptweaking FIP and SIERA. And, again, as Dave stated each pitcher is a case study–each pitcher warrants their own more specific analysis. But be careful when you’re changing weights. When doing so make sure that you have a really solid reason for your tweaks and also make sure that you’re not tweaking the numbers too much, because when you begin to start thinking that you’re significantly smarter than historical tendencies you can start getting in trouble. So these are your starting values–carefully tweak from here. Go forth, smart readers.
As a parting gift to this article, here’s a list of the top 20 predictions for pitchers using the two-year model described above. Note that this will inherently exclude one-year pitchers such as Jose Fernandez and pitchers that failed to meet the 100IP as a starter requirement in either of the past two years. Also note that these numbers do not include any aging curves (aging curves are well outside the scope of this article), which will obviously need to be factored in to any finalized projection system.
It’s a pretty well documented sabremetric notion that pitching your closer when you have a three run lead in the ninth is probably wasting him. You’re likely going to win the game anyways, since the vast majority of pretty much everyone allowed to throw baseballs in the major leagues is going to be able to keep the other team from scoring three runs.
But we still see it all the time. Teams keep holding on to their closer and waiting until they have a lead in the ninth to trot him out there. One of the reasons for this is that blowing a lead in the ninth is devastating—it’ll hurt team morale more to blow a lead in the ninth than to slip behind in the seventh. And then this decrease in morale will cause for the players to play more poorly in the future, which will result in more losses.
Or will it?
We’re going to look at how teams play following games that they devastatingly lose to see if there’s any noticeable drop in performance. The “devastating blown save” stat can be defined as any game in which a team blows the lead in the ninth and then goes on to lose. Our methodology is going to look at team records in both the following game as well as the following three games to see if there’s any worsening of play. If the traditional thought is right (hey, it’s a possibility!), it will show up in the numbers. Let’s take a look.
All Games (2000-2012)
9+ Inning Games
Following Game W%
Three Game W%
In the following game, the team win percentage was very, very close to 50%. Over a sample size of 1,333 that’s completely insignificant. But what about the following three games, where the win percentage drops down to roughly 48.4%? Well, that’s a pretty small deviation from the 50% baseline, and is of questionable statistical significance. And wouldn’t it make sense that if the devastating blow save effect existed at all it would occur in the directly following game, and not wait until later to manifest itself? It seems safe to say that the “morale drop” of devastatingly losing is likely nonexistent—or at most incredibly small. We’re dealing with grown men after all. They can take it.
Another thing you might want to consider when looking at these numbers is that teams with lots of blown saves are probably more likely to be subpar. Not so fast. The win% of teams weighted to their amount of blown 9th innings over the years is .505. This is probably because better teams are more likely to be ahead in the first place, and so they are going to be on the bubble to blow saves more often even if they blow them a smaller percentage of the time. Just for the fun of seeing how devastation-prone your team has been over the past 13 years, however, here’s a table of individual team results.
Devastating Blown Saves By Team (2000-2012)
Devastating Blown Saves
Next Game W%
Chicago White Sox
New York Yankees
Congratulations Pittsburgh, you’ve been the least devastated full-time team over the past 13 years! Now if there’s a more fun argument against the effects of devastating losses than that previous sentence, I want to hear it. Meanwhile the Braves have lived up to their nickname, winning in an outstanding 74.3% of games following devastating losses (it looks like we’ve finally found our algorithm for calculating grit, ladies and gentleman) while the hapless Expos rebounded in just 20% of their games. Milwaukee leads the league in single-game heartbreak, etc. etc. Just read the table. These numbers are fun. Mostly meaningless, but fun.
Back to the point: team records following devastating losses tend to hover very, very close to .500. Managers shouldn’t worry about how their teams lose games—they should worry about if their teams lose games. Because, in the end, that’s all that matters.
Raw data courtesy of Retrosheet.
We all know by now that we should look at more than one year of player data when we evaluate players. Looking at the past three years is the most common way to do this, and it makes sense why: three years is a reasonable time frame to try and increase your sample size while not reaching back so far that you’re evaluating an essentially different player.
The advice for looking at previous years of player data, however, usually comes with a caveat. “Weigh them”, they’ll say. And then you’ll hear some semi-arbitrary numbers such as “20%, 30%, 50%”, or something in that range. Well, buckle up, because we’re about to get a little less arbitrary.
Some limitations: The point of this study isn’t to replace projection systems—we’re not trying to project declines/improvements here. We’re simply trying to understand how past data tends to translate into future data.
The methodology is pretty simple. We’re going to take three years of player data (I’m going to use wRC+ since it’s league-adjusted etc., and I’m only trying to measure offensive production), and then weight the years so that we can get an expected 4th year wRC+. We’re then going to compare our expected wRC+ against the actual wRC+*. The closer the expected to our actual, the better the weights.
*Note: I am using four-year spans of player data from 2008-2013, and limiting to players that had at least 400 PA in four consecutive years. This should help throw out outliers and to give more consistent results. Our initial sample size is 244, which is good enough to give meaningful results.
I’ll start with the “dumb” case. Let’s just weigh all of the years equally, so that each year counts for 33.3% of our expected outcome.
Expected vs. Actual wRC+, unweighted
Okay, so we’re averaging missing the actual wRC+ by roughly 16.5. That means that we’re averaging 16.5% inaccuracy when extrapolating the past into the future with no weights. Now let’s try being a little smarter about it and try some different weights out.
Expected vs. Actual wRC+, various weights
Huh! It seems that no matter what we do, “intelligently weighting” each year never actually increases our accuracy. If you’re just generally trying to extrapolate several past years of wRC+ data to try and predict a fourth year of wRC+, your best bet is to just unweightedly average the past wRC+ data. Now, the differences are small (for example, our weights of [.3, .3, .4] were only .03 different in accuracy the unweighted total, which is statistically insignificant), but the point remains: weighing data from past years simply does not increase your accuracy. Pretty counter-intuitive.
Let’s dive a little deeper now—is there any situation in which weighting a player’s past does help? We’ll test this by limiting our ages. For example: are players that are younger than 30 better served by weighing their most previous years heavily? This would make sense, since younger players are most likely to experience a true-talent change. (Sample size: 106)
Expected vs. Actual wRC+, players younger than 30
Ok, so that didn’t work either. Even with young players, using unweighted totals is the best way to go. What about old players? Surely with aging players the recent years would most represent a player’s decline. Let’s find out (Sample size: 63).
Expected vs. Actual wRC+, players older than 32
Hey, we found something! With aging players you should weight a player’s last two seasons equally, and you should not even worry about three seasons ago! Again, notice that the difference is small (you’ll be about 0.8% more correct by doing this than using unweighted totals). And as with any stat, you should always think about why you’re coming to the conclusion that you’re coming to. You might want to weight some players more aggressively than others, especially if they’re older.
In the end, it just really doesn’t matter that much. You should, however, generally use unweighted weights since differences in wRC+ are pretty much always results of random fluctuation and very rarely the result of actual talent change. That’s what the data shows. So next time you hear someone say “weigh their past three years 3/4/5” (or similar), you can snicker a little. Because you know better.