Big Data and Baseball Efficiency: the Traveling Salesman had Nothing on a Baseball Scout

The MLB draft is coming up and with any luck I’ll get this posted by Thursday and take advantage of web traffic. I can hope! (ed. note: nope) Anyway, Tuesday on FanGraphs I read a fascinating portrayal of the draft process, laying out the nuts and bolts of how organizations scout for the draft. The piece, written by Tony Blengino (whose essays are rapidly becoming one of my favorite parts of this overall terrific baseball site), describes all the behind the scenes work that happens to prepare a major league organization for the Rule 4 draft. Blengino described the dedication scouts show in following up on all kinds of prospects at the college and high school levels, what they do, how much they need to travel, and especially how much ground they often need to cover to try and lay eyes on every kid in their area.

One neat insight for me was Blengino’s one-word description of most scouts as entrepreneurs. You could think of them almost as founders of a startup, with the kids they scout as the product the scouts are trying to sell to upper layers of management in the organization. As such, everything they can do to get a better handle on a kid’s potential can feed into the pitch to the scouting director.

I respect and envy scouts’ drive to keep looking for the next big thing, the next Jason Heyward or Mike Trout. As Blengino puts it, scouts play “one of the most vital, underrated, and underpaid roles in the game.” While one might make the argument that in MLB, unlike the NFL or NBA, draft picks typically are years away from making a contribution and therefore how important can draft picks be?, numerous studies have shown that the draft presents an incredible opportunity for teams in building and sustaining success. In fact, given that so much of an organization’s success hinges on figuring out which raw kids will be able to translate tools and potential into talent, one could (and others have)  made the argument that scouting is a huge potential market inefficiency for teams to exploit. Although I’ll have a caveat later. But in any case, for a minor league system every team wants to optimize their incoming quality because, like we say in genomic data analysis, “garbage in, garbage out.”

As I was reading this piece, I started thinking about ways to try and create more efficiencies. And I started thinking about Big Data.  Read the rest of this entry »


Foundations of Batting Analysis – Part 3: Run Creation

I’ve decided to break this final section in half and address the early development of run estimation statistics first, and then examine new ways to make these estimations next week. In Part 1, we examined the early development of batting statistics. In Part 2, we broke down the weaknesses of these statistics and introduced new averages based on “real and indisputable facts.” In Part 3, we will examine methods used to estimate the value of batting events in terms of their fundamental purpose: run creation.

The two main objectives of batters are to not cause an out and to advance as many bases as possible. These objectives exist as a way for batters to accomplish the most fundamental purpose of all players on offense: to create runs. The basic effective averages presented in Part 2 provide a simple way to observe the rate at which batters succeed at their main objectives, but they do not inform us on how those successes lead to the creation of runs. To gather this information, we’ll apply a method of estimating the run values of events that can trace its roots back nearly a century.

The earliest attempt to estimate the run value of batting events came in the March 1916 issue of Baseball Magazine. F.C. Lane, editor of the magazine, discussed the weakness of batting average as a measure of batting effectiveness in an article titled “Why the System of Batting Averages Should be Changed”:

“The system of keeping batting averages…gives the comparative number of times a player makes a hit without paying any attention to the importance of that hit. Home runs and scratch singles are all bulged together on the same footing, when everybody knows that one is vastly more important than the other.”

To address this issue, Lane considered the fundamental purpose of making hits.

“Hits are not made as mere spectacular displays of batting ability; they are made for a purpose, namely, to assist in the all-important labor of scoring runs. Their entire value lies in their value as run producers.”

In order to measure the “comparative ability” of batters, Lane suggests a general rule for evaluating hits:

“It would be grossly inaccurate to claim that a hit should be rated in value solely upon its direct and immediate effect in producing runs. The only rule to be applied is the average value of a hit in terms of runs produced under average conditions throughout a season.”

He then proposed a method to estimate the value of each type of hit based on the number of bases that the batter and all baserunners advanced on average during each type of hit. Lane’s premise was that each base was worth one-fourth of a run, as it takes the advancement through four bases for a player to secure a run. By accounting for all of the bases advanced by a batter and the baserunners due to a hit, he could determine the number of runs that the hit created. However, as the data necessary to actually implement this method did not exist in March 1916, the work done in this article was little more than a back-of-the-envelope calculation built on assumptions concerning how often baserunners were on base during hits and how far they tended to advance because of those hits.

As he wanted to conduct a rigorous analysis with this method, Lane spent the summer of 1916 compiling data on 1,000 hits from “a little over sixty-two games”[i] to aid him in this work. During these games, he would note “how far the man making the hit advanced, whether or not he scored, and also how far he advanced other runners, if any, who were occupying the bases at the time.” Additionally, in any instance when a batter who had made a hit was removed from the base paths due to a subsequent fielder’s choice, he would note how far the replacement baserunner advanced.

Lane presented this data in the January 1917 issue of Baseball Magazine in an article titled similarly to his earlier work: “Why the System of Batting Averages Should be Reformed.” Using the collected data, Lane developed two methods for estimating the run value that each type of hit provided for a team on average. The first method, the one he initially presented in March 1916, which I’ll call the “advancement” method,[ii] counted the total number of bases that the batter and the baserunners advanced during a hit, and any bases that were advanced to by batters on a fielder’s choice following a hit (an addition not included in the first article). For example, of the 1,000 hits Lane observed, 789 were singles. Those singles resulted in the batter advancing 789 bases, runners on base at the time of the singles advancing 603 bases, and batters on fielder’s choice plays following the singles advancing to 154 bases – a total of 1,546 bases. With each base estimated as being worth one-fourth of a run, these 1,546 bases yielded 386.5 runs – an average value of .490 runs per single. Lane repeated this process for doubles (.772 runs), triples (1.150 runs), and home runs (1.258 runs).

This was the method Lane first developed in his March 1916 article, but at some point during his research he decided that a second method, which I’ll call the “instrumentality” method, was more preferable.[iii] In this method, Lane considered the number of runs that were scored because of each hit (RBI), the runs scored by the batters that made each hit, and the runs scored by baserunners that reached on a fielder’s choice following a hit. For instance, of the 789 singles that Lane observed, there were 163 runs batted in, 182 runs scored by the batters that hit the singles, and 16 runs scored by runners that reached on a fielder’s choice following a single. The 361 runs “created” by the 789 singles yielded an average value of .457 runs per single. This method was repeated for doubles (.786 runs), triples (1.150), and home runs (1.551 runs).

In March 1917, Lane went one step further. In an article titled “The Base on Balls,” Lane decried the treatment of walks by the official statisticians and aimed to estimate their value. In 1887, the National League had counted walks as hits in an effort to reward batters for safely reaching base, but the sudden rise in batting averages was so off-putting that the method was quickly abandoned following the season. As Lane put it:

“…the same potent intellects who had been responsible for this wild orgy of batting reversed their august decision and declared that a base on balls was of no account, generally worthless and henceforth even forever should not redound to the credit of the batter who was responsible for such free transportation to first base.

The magnates of that far distant date evidently had never heard of such a thing as a happy medium…‘Whole hog or none’ was the noble slogan of the magnates of ’87. Having tried the ‘whole’ they decreed the ‘none’ and ‘none’ it has been ever since…

‘The easiest way’ might be adopted as a motto in baseball. It was simpler to say a base on balls was valueless than to find out what its value was.”

Lane attempted to correct this disservice by applying his instrumentality method to walks. Over the same sample of 63 games in which he collected information on the 1,000 hits, he observed 283 walks. Those walks yielded six runs batted in, 64 runs scored by the batter, and two runs scored by runners that replaced the initial batter due to a fielder’s choice. Through this method, Lane calculated the average value of a walk as .254 runs.[iv]

Each method Lane used was certainly affected by his limited sample of data. The proportions of each type of hit that he observed were similar to the annual rates in 1916, but the examination of only 1,000 hits made it easy for randomness to affect the calculation, particularly for the low-frequency events. Had five fewer runners been on first base at the time of the 29 home runs observed by Lane, the average value of a home run would have dropped from 1.258 runs to 1.129 runs using the advancement method and from 1.551 runs to 1.379 runs using the instrumentality method. It’s hard to trust values that are that so easily affected by a slight change in circumstances.

Lane was well aware of these limitations, but treated the work more as an exercise to prove the merit of his rationale, rather than an official calculation of the run values. In an article in the February 1917 issue of Baseball Magazine titled, “A Brand New System of Batting Averages,” he notes:

“Our sample home runs, which numbered but 29, were of course less accurate. But we did not even suggest that the values which were derived from the 1,000 hits should be incorporated as they stand in the batting averages. Our labors were undertaken merely to show what might be done by keeping a sufficiently comprehensive record of the various hits…our data on home runs, though less complete than we could wish, probably wouldn’t vary a great deal from the general averages.”

In the same article, Lane applied the values calculated with the instrumentality method to the batting statistics of players from the 1916 season, creating a statistic he called Batting Effectiveness, which measured the number of runs per at-bat that a player created through hits. The leaderboard he included is the first example of batters being ranked with a run average since runs per game in the 1870s.

Lane didn’t have a wide audience ready to appreciate a run estimation of this kind, and it gained little notoriety going forward. In his March 1916 article, Lane referenced an exchange he had with the Secretary of the National League, John Heydler, concerning how batting average treats all hits equally. Heydler responded:

“…the system of giving as much credit to singles as to home runs is inaccurate…But it has never seemed practicable to use any other system. How, for instance, are you going to give the comparative values of home runs and singles?”

Seven years later, by which point Heydler had become President of the National League, the method to address this issue was chosen. In 1923, the National League adopted the slugging average—total bases on hits per at-bat—as its second official average.

While Lane’s work on run estimation faded away, another method to estimate the run value of individual batting events was introduced nearly five decades later in the July/August 1963 issue of Operations Research. A Canadian military strategist, with a passion for baseball, named George R. Lindsey wrote an article for the journal titled, “An Investigation of Strategies in Baseball.” In this article, Lindsey proposed a novel approach to measure the value of any event in baseball, including batting events.

The construction of Lindsey’s method began by observing all or parts of 373 games from 1959 through 1960 by radio, television, or personal attendance, compiling 6,399 half-innings of play-by-play data. With this information, he calculated P(r|T,B), “the probability that, between the time that a batter comes to the plate with T men out and the bases in state B,[v] and the end of the half-inning, the team will score exactly r runs.” For example, P(0|0,0), that is, the probability of exactly zero runs being scored from the time a batter comes to the plate with zero outs and the bases empty through the end of the half-inning, was found to be 74.7 percent; P(1|0,0) was 13.6 percent, P(2|0,0) was 6.8 percent, etc.

Lindsey used these probabilities to calculate the average number of runs a team could expect to score following the start of a plate appearance in each of the 24 out/base states: E(T,B).[vi] The table that Lindsey produced including these expected run averages reflects the earliest example of what we now call a run expectancy matrix.

With this tool in hand, Lindsey began tackling assorted questions in his paper, culminating with a section on “A Measure of Batting Effectiveness.” He suggested an approach to assessing batting effectiveness based on three assumptions:

“(a) that the ultimate purpose of the batter is to cause runs to be scored

(b) that the measure of the batting effectiveness of an individual should not depend on the situations that faced him when he came to the plate (since they were not brought about by his own actions), and

(c) that the probability of the batter making different kinds of hits is independent of the situation on the bases.”

Lindsey focused his measurement of batting effectiveness on hits. To estimate the run values of each type of hit, Lindsey observed that “a hit which converts situation {T,B} into {T,B} increases the expected number of runs by E(T,B) – E(T,B).” For example, a single hit in out/base state {0,0} will yield out/base state {0,1}. If you consult the table that I linked above, you’ll note that this creates a change in run expectancy, as calculated by Lindsey, of .352 runs (.813 – .461). By repeating this process for each of the 24 out/base states, and weighting the values based on the relative frequency in which each out/base state occurred, the average value of a single was found to be 0.41 runs.[vii] This was repeated for doubles (0.82 runs), triples (1.06 runs), and home runs (1.42 runs). By applying these weights to a player’s seasonal statistics, Lindsey created a measurement of batting effectiveness in terms of “equivalent runs” per time at bat.

Like with Lane’s methods, the work done by Lindsey was not widely appreciated at first. However, 21 years after his article was published in Operations Research, his system was repurposed and presented in The Hidden Game of Baseball by John Thorn and Pete Palmer—the man who helped make on base average an official statistic just a few years earlier. Using play-by-play accounts of 34 World Series games from 1956 through 1960,[viii] and simulations of games based on data from 1901 through 1977, Palmer rebuilt the run expectancy matrix that Lindsey introduced two decades earlier.

In addition to measuring the average value of singles (.46 runs), doubles (.80 runs), triples (1.02 runs), and home runs (1.40 runs) as Lindsey had done, Palmer also measured the value of walks and times hit by the pitcher (0.33 runs), as well as at-bats that ended with a batting “failure,” i.e. outs and reaches on an error (-0.25 runs). While I’ve already addressed issues with counting times reached on an error as a failure in Part 2, the principle of acknowledging the value produced when the batter failed was an important step forward from Lindsey’s work, and Lane’s before him. When an out occurs in a batter’s plate appearance, the batting team’s expected run total for the remainder of the half-inning decreases. When the batter fails to reach base safely, he not only doesn’t produce runs for his team, he takes away potential run production that was expected to occur. In this way, we can say that the batter created negative value—a decrease in expected runs—for the batting team.

Palmer applied these weights to a player’s seasonal totals, as Lindsey had done, and formed a statistic called Batter Runs reflecting the number of runs above average that a player produced in a season. Palmer’s work came during a significant period for the advancement of baseball statistics. Bill James had gained a wide audience with his annual Baseball Abstract by the early-1980s and The Hidden Game of Baseball was published in the midst of this new appreciation for complex analysis of baseball systems. While Lindsey and Lane’s work had been cast aside, there was finally an audience ready to acknowledge the value of run estimation.

Perhaps the most important effect of this new era of baseball analysis was the massive collection of data that began to occur in the background. Beginning in the 1980s, play-by-play accounts were being constructed to cover entire seasons of games. Lane had tracked 1,000 hits, Lindsey had observed 6,399 half-innings, and Palmer had used just 34 games (along with computer simulations) to estimate the run values of batting events. By the 2000s, play-by-play accounts of tens of thousands of games were publically available online.

Gone were the days of estimations weakened by small sample sizes. With complete play-by-play data available for every game over a given time period, the construction of a run expectancy matrix was effectively no longer an estimation. Rather, it could now reflect, over that period of games, the average number of runs that scored between a given out/base state and the end of the half-inning, with near absolute accuracy.[ix] Similarly, assumptions about how baserunners moved around the bases during batting events were no longer necessary. Information concerning the specific effects on the out/base state caused by every event in every baseball game over many seasons could be found with relative ease.

In 2007, Tom M. Tango,[x] Mitchel G. Lichtman, and Andrew E. Dolphin took advantage of this gluttony of information and reconstructed Lindsey’s “linear weights” method (as named by Palmer) in The Book: Playing the Percentages in Baseball. Tango et al. used data from every game from 1999 through 2002 to build an updated run expectancy matrix. Using it, along with the play-by-play data from the same period, they calculated the average value of a variety of events, most notably eight batting events: singles (.475 runs), doubles (.776 runs), triples (1.070 runs), home runs (1.397 runs), non-intentional walks (.323 runs), times hit by the pitcher (.352 runs), times reached on an error (.508 runs). and outs (-.299 runs). These events were isolated to form an estimate of a player’s general batting effectiveness called weighted On Base Average (wOBA).

Across 90 years, here were five different attempts to estimate the number of runs that batters created, with varying amounts of data, using varying methods of analysis, in varying run scoring environments, and yet the estimations all end up looking quite similar.

Method / Event

Advancement Instrumentality Equivalent Runs Batter Runs

wOBA

Single

.490

.457

.41 .46

.475

Double

.772 .786 .82 .80

.776

Triple

1.150 1.150 1.06 1.02

1.070

Home Run

1.258

1.551

1.42

1.40

1.397

Non-Intentional Walk

—–

.254

—–

.33

.323

Intentional Walk —–

.254

—– .33 .179
Hit by Pitch —– —– —– .33

.352

Reach on Error

—–

—–

—–

-.25

.508

Out

—– —– —– -.25

-.299

 

Beyond the general goal of measuring the run value of certain batting events, each of these methods had another thing in common: each method was designed to measure the effectiveness of batters. Lane and Lindsey focused exclusively on hits,  the traditional measures of batting effectiveness.[xi] Palmer added in the “on base” statistics of walks and times hit by the pitcher, while also accounting for the value of those times the batter showed ineffectiveness. Tango et al. threw away intentional walks as irrelevant events when it came to testing a batter’s skill, while crediting the positive value created by batters when reaching on an error.

The same inconsistencies present in the traditional averages for deciding when to reward batters for succeeding and when to punish them for failing are present in these run estimators. In the same way we created the basic effective averages in Part 2, we should establish a baseline for the total production in terms of runs caused by a batter’s plate appearances, independent of whether that production occurred due to batting effectiveness. We can later judge how much of that value we believe was caused by outside forces, but we should begin with this foundation. This will be the goal of the final part of this paper.


[i] In his article the next month, Lane says explicitly that he observed 63 games, but I prefer his unnecessarily roundabout description in the January 1917 article.

[ii] I’ve named these methods because Lane didn’t, and it can get confusing to keep going back and forth between the two methods without using distinguishing names.

[iii] Lane never explains why exactly he prefers this method, and just states that it “may be safely employed as the more exact value of the two.” He continues, “the better method of determining the value of a hit is…in the number of runs which score through its instrumentality than through the number of bases piled-up for the team which made it.” This may be true, but he never proves it explicitly. Nevertheless, the “instrumentality” method was the only one he used going forward.

[iv] This value has often been misrepresented as .164 runs in past research due to a separate table from Lane’s article. That table reflected the value of each hit, and walks, with respect to the value of a home run. Walks were worth 16.4 percent of the value a home run (.254 / 1.551), but this is obviously not the same as the run value of a base on balls.

[v] The base states, B, are the various arrangements of runners on the bases: bases empty (0), man-on-first (1), man-on-second (2), man-on-third (3), men-on-first-and-second (12), men-on-first-and-third (13), men-on-second-and-third (23), and the bases loaded (123).

[vi] The calculation of these expected run averages involved an infinite summation of each possible number of runs that could score (0, 1, 2, 3,…) with respect to the probability that that number of runs would score. For instance,  here are some of the terms for E(0,0):

E(0,0) = (0 runs * P(0|0,0)) + (1 run * P(1|0,0)) + (2 runs * P(2|0,0)) + … + (∞ runs * P(∞|0,0))

E(0,0) = (0 runs * .747) + (1 run * .136) + (2 runs* .068) + … + (∞ runs * .000)

E(0,0) = .461 runs

Lindsey could have just as easily found E(T,B) by finding the total number of runs that scored following the beginning of all plate appearances in a given out/base state through the end of the inning, R(T,B), and dividing that by the number of plate appearances to occur in that out/base state, N(T,B), as follows:

E(T,B) = Total Runs (T,B) / Plate Appearances (T,B) = R(T,B) / N(T,B)

This is the method generally used today to construct run expectancy matrices, but Lindsey’s approach works just as well.

[vii] To simplify his estimations, Lindsey made certain assumptions about how baserunners tend to move during hits, similar to the assumptions Lane made in his initial March 1916 article. Specifically, he assumed that “runners always score from second or third base on any safe hit, score from first on a triple, go from first to third on 50 per cent of doubles, and score from first on the other 50 per cent of doubles.” While he did not track the movement of players in the same detail which Lane eventually employed, the total error caused by these assumptions did not have a significant effect on his results.

[viii] In The Hidden Game of Baseball, Thorn wrote that Palmer used data from “over 100 World Series contests,” but in the foreword to The Book: Playing the Percentages in Baseball, Palmer wrote that “the data I used which ended up in The Hidden Game of Baseball in the 1980s was obtained from the play-by-play accounts of thirty-five World Series games from 1956 to 1960 in the annual Sporting News Baseball Guides.” I’ll lean towards Palmer’s own words, though I’ve adjusted “thirty-five” down to 34 since there were only 34 World Series games over the period Palmer referenced.

[ix] The only limiting factor in the accuracy of a run expectancy matrix in the modern “big data” era is in the accuracy of those who record the play-by-play information and in the quality of the programs written to interpret the data. Additionally, the standard practice when building these matrices is to exclude all data from the home halves of the ninth inning or later, and any other partial innings. These innings do not follow the standard rules observed in every other half-inning, namely that they must end with three outs, and thus introduce bias into the data if included.

[x] The only nom de plume I’ve included in this history, as far as I’m aware.

[xi] Lane didn’t include walks in his Batting Effectiveness statistic, despite eventually calculating their value.


Pitch Win Values for Starting Pitchers – May 2014

Introduction

A few weeks back, I introduced a new method of calculating pitch values using a FIP-based WAR methodology.  That post details the basic framework of these calculations and  can be found here.  This post is simply the May 2014 update of the same data.  What follows is predominantly data-heavy but should still provide useful talking points for discussion.  Let’s dive in and see what we can find.  Please note that the same caveats apply as last month.  We’re at the mercy of pitch classification.  I’m sure your favorite pitcher doesn’t throw that pitch that has been rated as incredibly below average, but we have to go off of the data that is available.  Also, Baseball Prospectus’s PitchF/x leaderboards list only nine pitches (Four-Seam Fastball, Sinker, Cutter, Splitter, Curveball, Slider, Changeup, Screwball, and Knuckleball).  Anything that may be classified outside of these categories is not included.  Also, anything classified as a “slow curve” (here’s looking at you, Yu) is not included in Baseball Prospectus’s curveball data.

Constants

Before we begin, we must first update the constants used in calculation for May.  As a refresher, we need three different constants for calculation: strikes per strikeout, balls per walk, and a FIP constant to bring the values onto the right scale.  We will tackle them each individually.

First, let’s discuss the strikeout constant.  In May, there were 52,100 strikes thrown by starting pitchers.  Of these 52,100 strikes, 5,005 were turned into hits and 15,110 outs were recorded.  Of these 15,110 outs, 4,058 were converted via the strikeout, leaving us with 11,052 ball-in-play outs.  11,052 ball-in-play strikes and 5,005 hits sum to 16,057 balls-in-play.  Subtracting 16,057 balls-in-play from our original 52,100 strikes leaves us with 36,043 strikes to distribute over our 4,058 strikeouts.  That’s a ratio of 8.88 strikes per strikeout.  This is up from 8.47 strikes per strikeout in March and April.  Hitters were slightly harder to strikeout in May that the previous two months.

The next two constants are much easier to ascertain.  In May, there were 29,567 balls thrown by starters and 1,575 walked batters.  That’s a ratio of 18.77 balls per walk, up from 18.50 balls per walk in March and April.  This data would suggest that hitters were slightly less likely to walk in May than previously.  The FIP subtotal for all pitches in May was 0.75.  The MLB Run Average for May was 4.32, meaning our FIP constant for May is 3.58.

Constant Value
Strikes/K 8.88
Balls/BB 18.77
cFIP 3.58

 

Pitch Values – May 2014

For reference, the following table details the FIP for each pitch type in the month of May.

Pitch FIP
Four-Seam 4.43
Sinker 4.29
Cutter 4.13
Splitter 4.03
Curveball 4.01
Slider 4.13
Changeup 4.80
Screwball 2.56
Knuckleball 3.38
MLB RA 4.32

As we can see, only two pitches would be classified as below average for the month of May: four-seam fastballs and changeups.  Sinkers also came in right around league average.  Pitchers that were able to stand out in these categories tended to have better overall months than pitchers who excelled at the other pitches.  Now, let’s proceed to the data for the month of May.

Four-Seam Fastball

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Phil Hughes 0.7 185 Vidal Nuno -0.3
2 Ian Kennedy 0.6 186 Doug Fister -0.3
3 Jose Quintana 0.6 187 Wei-Yin Chen -0.3
4 Tom Koehler 0.5 188 John Danks -0.3
5 Lance Lynn 0.5 189 Mike Minor -0.4

Sinker

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Mike Leake 0.5 171 Brandon Maurer -0.2
2 Dallas Keuchel 0.4 172 Wandy Rodriguez -0.2
3 Tyson Ross 0.4 173 Tom Koehler -0.2
4 Charlie Morton 0.4 174 Kyle Lohse -0.3
5 Chris Archer 0.4 175 Edinson Volquez -0.6

Cutter

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Corey Kluber 0.5 74 Shelby Miller -0.1
2 Josh Collmenter 0.4 75 Kevin Correia -0.1
3 Adam Wainwright 0.4 76 Hector Santiago -0.1
4 Jarred Cosart 0.4 77 Brandon McCarthy -0.2
5 Madison Bumgarner 0.3 78 Cliff Lee -0.2

Splitter

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Masahiro Tanaka 0.3 27 Alfredo Simon -0.1
2 Hisashi Iwakuma 0.2 28 Franklin Morales -0.1
3 Hiroki Kuroda 0.2 29 Clay Buchholz -0.1
4 Jake Odorizzi 0.2 30 Jorge De La Rosa -0.1
5 Ubaldo Jimenez 0.2 31 Danny Salazar -0.2

Curveball

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Sonny Gray 0.3 160 Clay Buchholz -0.1
2 Brandon McCarthy 0.2 161 Tyler Lyons -0.1
3 Ryan Vogelsong 0.2 162 Dan Straily -0.1
4 Tyler Skaggs 0.2 163 Yordano Ventura -0.1
5 Collin McHugh 0.2 164 Franklin Morales -0.2

Slider

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Jason Hammel 0.4 120 Robbie Erlin -0.1
2 Ricky Nolasco 0.3 121 Kyle Gibson -0.2
3 Garrett Richards 0.3 122 Julio Teheran -0.2
4 Bud Norris 0.3 123 Johnny Cueto -0.2
5 Edwin Jackson 0.3 124 Yovani Gallardo -0.3

Changeup

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Felix Hernandez 0.3 165 Josh Collmenter -0.3
2 Stephen Strasburg 0.3 166 Jake Peavy -0.3
3 Francisco Liriano 0.2 167 Danny Duffy -0.3
4 Henderson Alvarez 0.2 168 Drew Smyly -0.3
5 Eric Stults 0.2 169 Marco Estrada -0.7

Screwball

Rank Pitcher Pitch Value
1 Alfredo Simon 0.0
2 Trevor Bauer 0.0
3 Hector Santiago 0.0

Knuckleball

Rank Pitcher Pitch Value
1 R.A. Dickey 0.6
2 C.J. Wilson 0.0

Overall

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Felix Hernandez 1.2 192 Edinson Volquez -0.3
2 Mike Leake 1.1 193 Alfredo Simon -0.3
3 Jason Hammel 1.0 194 CC Sabathia -0.3
4 Dallas Keuchel 1.0 195 Franklin Morales -0.4
5 Masahiro Tanaka 0.9 196 Marco Estrada -0.7

Pitch Ratings – May 2014

Four-Seam Fastball

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Jason Hammel 60 86 Brandon Maurer 38
2 Aaron Harang 60 87 John Danks 36
3 Phil Hughes 59 88 Trevor Bauer 35
4 Yordano Ventura 59 89 Rafael Montero 35
5 Jose Quintana 59 90 Mike Minor 28

Sinker

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Jeff Samardzija 58 71 Alfredo Simon 41
2 Jake Arrieta 58 72 Kyle Lohse 39
3 Aaron Harang 58 73 Ricky Nolasco 37
4 Blake Treinen 58 74 James Shields 37
5 Matt Shoemaker 57 75 Edinson Volquez 22

Cutter

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Josh Tomlin 60 26 Ryan Vogelsong 46
2 Corey Kluber 60 27 Josh Beckett 45
3 Franklin Morales 59 28 Dan Haren 44
4 David Price 58 29 Kevin Correia 41
5 Jorge De La Rosa 58 30 Jesse Chavez 40

Splitter

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Jake Odorizzi 60 10 Ricky Nolasco 54
2 Masahiro Tanaka 59 11 Tim Lincecum 53
3 Wei-Yen Chen 58 12 Kyle Kendrick 46
4 Ubaldo Jimenez 57 13 Dan Haren 43
5 Alex Cobb 57 14 Jorge De La Rosa 40

Curveball

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Felix Hernandez 60 61 Roenis Elias 42
2 John Lackey 59 62 Tommy Milone 41
3 Collin McHugh 58 63 Wei-Yen Chen 40
4 Jose Fernandez 58 64 Yordano Ventura 36
5 Mike Minor 58 65 Scott Carroll 35

Slider

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Yu Darvish 61 46 Jeremy Guthrie 40
2 Jhoulys Chacin 61 47 Homer Bailey 38
3 Corey Kluber 60 48 Julio Teheran 35
4 Edwin Jackson 60 49 Yovani Gallardo 31
5 Gavin Floyd 59 50 Kyle Gibson 30

Changeup

Rank Pitcher Pitch Rating Rank Pitcher Pitch Rating
1 Stephen Strasburg 59 59 Hector Noesi 33
2 Wade Miley 58 60 Cesar Ramos 30
3 Justin Verlander 58 61 Josh Collmenter 26
4 Francisco Liriano 57 62 Ian Kennedy 23
5 Anibal Sanchez 57 63 Marco Estrada 20

Screwball

Rank Pitcher Pitch Rating
1 Alfredo Simon 57
2 Hector Santiago 56
3 Trevor Bauer 56

Knuckleball

Rank Pitcher Pitch Rating
1 R.A. Dickey 55

Monthly Discussion

As we can see, Felix Hernandez ascended to the throne for this month riding the overall quality of his entire repertoire.  Hernandez was classified as throwing five different pitches in May (Four-Seam, Sinker, Curveball, Slider, and Changeup) and managed to earn at least 0.1 WAR in each category.  His best two pitches were his Sinker (0.4 WAR) and Changeup (0.3 WAR).  The most valuable pitch overall in May was the Four-Seam Fastball thrown by Phil Hughes.  The least valuable was Marco Estrada’s changeup.  As far as offspeed pitches, R.A. Dickey’s 0.6 WAR from his knuckleball lead the way.  Excluding Dickey’s knuckleball due to the sheer number of times it was thrown, the most valuable offspeed pitch was Jason Hammel’s slider.  The least valuable fastball was Edinson Volquez’s sinker.

On our 20-80 scale pitch ratings, the highest rated qualifying pitch was Yu Darvish’s slider.  Unsurprisingly, the lowest rated was Marco Estrada’s changeup.  It’s difficult to generate -0.7 WAR with a single pitch unless it was just awful.  The highest rated fastball Jake Odorizzi’s splitter, and the lowest rated fastball was Edinson Volquez’s sinker.

Pitch Values – 2014 Season

Four-Seam Fastball

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Ian Kennedy 1.0 210 Doug Fister -0.3
2 Phil Hughes 1.0 211 Marco Estrada -0.3
3 Michael Wacha 0.9 212 Eric Stults -0.3
4 Jose Quintana 0.9 213 Dan Straily -0.4
5 Lance Lynn 0.7 214 Mike Minor -0.4

Sinker

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Cliff Lee 1.0 195 Mike Pelfrey -0.3
2 Charlie Morton 0.9 196 Edinson Volquez -0.3
3 Felix Hernandez 0.8 197 Erasmo Ramirez -0.3
4 Dallas Keuchel 0.8 198 Dan Straily -0.3
5 Justin Masterson 0.7 199 Wandy Rodriguez -0.3

Cutter

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Madison Bumgarner 0.7 88 Shelby Miller -0.2
2 Adam Wainwright 0.7 89 Brandon McCarthy -0.2
3 Corey Kluber 0.7 90 Felipe Paulino -0.2
4 Clay Buchholz 0.5 91 Johnny Cueto -0.3
5 Josh Collmenter 0.4 92 C.J. Wilson -0.3

Splitter

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Masahiro Tanaka 0.5 27 Jorge De La Rosa -0.1
2 Tim Hudson 0.3 28 Alfredo Simon -0.2
3 Hisashi Iwakuma 0.2 29 Franklin Morales -0.2
4 Hiroki Kuroda 0.2 30 Clay Buchholz -0.2
5 Wei-Yin Chen 0.2 31 Danny Salazar -0.3

Curveball

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Jose Fernandez 0.6 182 Ivan Nova -0.1
2 Sonny Gray 0.6 183 Bronson Arroyo -0.2
3 A.J. Burnett 0.5 184 Clay Buchholz -0.2
4 Brandon McCarthy 0.5 185 Franklin Morales -0.2
5 Stephen Strasburg 0.4 186 Felipe Paulino -0.3

Slider

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Edwin Jackson 0.5 139 Yovani Gallardo -0.2
2 Bud Norris 0.5 140 Tim Lincecum -0.2
3 Jason Hammel 0.4 141 Jeremy Guthrie -0.2
4 Aaron Harang 0.4 142 Erasmo Ramirez -0.2
5 Garrett Richards 0.4 143 Danny Salazar -0.4

Changeup

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Stephen Strasburg 0.5 191 Matt Cain -0.2
2 Francisco Liriano 0.5 192 Danny Duffy -0.3
3 Felix Hernandez 0.4 193 Drew Smyly -0.3
4 Eric Stults 0.4 194 Wandy Rodriguez -0.4
5 John Danks 0.4 195 Marco Estrada -0.6

Screwball

Rank Pitcher Pitch Value
1 Alfredo Simon 0.0
2 Trevor Bauer 0.0
3 Hector Santiago 0.0

Knuckleball

Rank Pitcher Pitch Value
1 R.A. Dickey 1.1
2 C.J. Wilson 0.0

Overall

Rank Pitcher Pitch Value Rank Pitcher Pitch Value
1 Felix Hernandez 1.8 216 Franklin Morales -0.4
2 Adam Wainwright 1.7 217 Dan Straily -0.4
3 Corey Kluber 1.6 218 Felipe Paulino -0.5
4 Aaron Harang 1.5 219 Marco Estrada -0.7
5 Jeff Samardzija 1.5 220 Wandy Rodriguez -0.8

Year-to-Date Discussion

If we look at the year-to-date numbers, Felix Hernandez still sits in the top spot.  Current AL and NL FIP leaders Corey Kluber and Aaron Harang rank third and fourth respectively.  The least valuable starter has been Wandy Rodriguez.  On a per-pitch basis, the most valuable pitch has been R.A. Dickey’s knuckleball, which should be the case for much of the season due to the heavy pitch totals.  Other than Dickey, the most valuable pitch has been Ian Kennedy’s four-seam fastball.  I guess there’s something to the idea of throwing a lot of fastballs in an extreme pitcher’s park after all.  The most valuable offspeed pitch has been Jose Fernandez’s curveball.  The fact that he still tops this list even after being injured and missing starts is simply astounding.  Get healthly Jose, we all miss your brilliance.  The least valuable pitch has been Marco Estrada’s changeup.  The least value fastball has been Mike Minor’s four-seam.  Qualitatively, I feel fairly encouraged by the year-to-date results so far.  The leaderboard is topped by two no-doubt aces, with the current FIP leaders coming in right behind them.  For reference, the top five in the year-to-date overall rankings are currently 1st, 6th, 2nd, 14th, and 22nd on the FanGraphs WAR leaderboards respectively.  Please feel free to provide feedback in the comments section.


Peter O’Brien’s Raw Power: Estimating Batted-Ball Velocities in the Minor Leagues

On May 20th Peter O’Brien hit a massive home run to straight away center clearing the 32 foot tall batter’s eye at Arm & Hammer Park more the 400 feet from home plate.  O’Brien is currently 1 home run behind Joey Gallo, in what looks to be an exciting competition for the minor league home run title.  O’Brien isn’t as highly touted a prospect as Gallo, but he still has some of the most impressive power in the minor leagues.  Reggie Jackson saw O’Brien’s home run and said it was one of hardest hit balls in the minor leagues that he had ever seen (and Reggie knows a thing or two about tape measure home runs).

How hard was that ball actually hit?  It is impossible to figure out exactly how hard and how far the ball was hit from the available information.  You can however use basic physics to make a reasonable estimation.

Below I explain the assumptions and thought process I used to get to an estimate of how hard the ball was hit.  If that does not interest you, then just skip to the end to find out what it takes to impress Reggie Jackson. But, if you’re curios or skeptical stick around.

OBSERVATIONS

I started off by watching the video to see what information I could gather (O’Brien’s at bat starts at the 37 second mark in the video).

TIME OF FLIGHT From the crack of the bat, to the ball leaving the park – it appears to take 5 seconds. If you watched the video, you can tell this is not a perfect measurement since the camera doesn’t track the ball very closely. If you think you have a better estimation, let me know and I’ll rework the numbers.  

LOCATION LEAVING THE PARK  The ball was hit to straight away center. From the park dimensions we know when it left the park it was 407 feet from home plate and at least 32 feet in the air to clear the batter’s eye.

ASSUMPTIONS

COEFFICIENTS OF DRAG (Cd) – The Cd determines how much a ball will slow down as it moves through the air. I chose 0.35 for the Cd because it is right in the middle of the most frequently inferred Cd values for the home runs that Allan Nathan was looking at in this paper.In looking at the Cds of baseballs, Allan Nathan showed there is reason to believe that there is some significant (meaning greater than what can be explained by random measurement error) variation in Cd from one baseball to another.

ORIGIN OF BALL I assume the ball was 3.5 feet off the ground and 2 feet in front of home plate when it was hit.  These are the standard parameters in Dr. Nathan’s trajectory calculator. But what if the location is off by a foot? The effects of the origin on the trajectory are translational. One foot up, one foot higher. One foot down, one foot lower. The other observations and assumptions are more significant in determining the trajectory of the home run.

Using these assumptions and the trajectory calculator, I was able to determine the minimum speed and backspin a ball would need in order to clear the 32 foot batter’s eye 5 seconds after being hit at different launch angles.  The table below shows the vertical launch angle (in degrees), the back spin (in RMPs) and the speed of the balled ball (in MPH).

Vertical launch angle Back spin Speed off Bat
19 14121 101
21 6817 101.9
23 4155 102.75
25 2779 103.69
27 1940 104.7
29 1375 105.89
30 1156 106.5
32 805 107.88
34 536 109.4
36 322 111.1
38 149 112.99
40 4 115.1

The graph shows a more visual representation of the trajectories in the table above (with the batter’s eye added in for reference).

http://i1025.photobucket.com/albums/y314/GWR87/OBrienhomerun_zpsb1507cf4.png

Looking at the graph you will notice that all of these balls would be scraping the top of the batter’s eye.  This makes sense because the table shows the minimum velocities and back spins needed for the ball to exactly clear the batter’s eye.

What is the slowest O’Brien could have hit the ball?

If you were in a rush, looking at the table you would think the slowest O’Brien could have hit the ball would be 101 MPH at 19o. But, not so fast! The amount of backspin required for the ball to travel at that trajectory is humanly impossible.

What is a reasonable backspin?

I am highly skeptical of backspin values greater than 4,000 rpm based on the Baseball Prospectus article by Alan Nathan “How Far Did That Fly Ball Travel?.” The backspin on home runs Nathan examined ranged from 500 to 3,500 rpm, with most falling in around 2,000. The first 3 entries in the table have backspins of over 4,000 and can be eliminated as possibilities. If the ball with the 19o launch angle only had 3,500 rpm of back spin it would have hit the batter’s eye less than 11 feet off the ground instead of clearing it.  Maybe you’re skeptical that I eliminated the 3rd entry because it’s close to the 4,000 rpm cut off.  Think about it this way, if a player was able to hit a ball with over 4,000 rpm of back spin, they would have to be hitting at a much higher launch angle than 23o (Higher launch angles generate greater spin while lower launch angles generate less spin).

The high launch angle trajectories with very little back spin (like the bottom three in the table) are also not very likely.  A ball hit with a 40o launch angle would almost certainly have more than 4 rpm of back spin.  If the ball hit with the 40o launch angle had 1,000 rmp of back spin (instead of 4) it would have been 70 feet off the ground, easily clearing the 32 foot batter’s eye.

Accounting for reasonable back spin, the slowest O’Brien could have hit the ball is 103.69 MPH at 25o with 2,779rpm of backspin.

So what do all these observations and assumptions get us?

We can say that the ball was likely hit 103.69 MPH or harder, with a launch angle of 25o or greater.  103.69 MPH launch velocity is not that impressive, it is essentially the league average launch velocity for a home run.  Distance wise, how impressive of a home runs was it? Unobstructed the ball would have landed at least 440 feet from home plate (assuming the 25o scenario).  The ball probably went further than 440 because it did not scrape the batter’s eye. So, how rare is a 440+ foot home run? Last year during the regular season there were 160 home runs that went 440 feet or further, there were a total of 4661 home runs that season, meaning only 3.4% of all home runs were hit at least that far.

For those of you who wanted to just skip to the end. My educated guess is that the ball went at least 440 feet and left the bat at at least 103.69 MPH.

If you like this, you can read other articles on my blog GWRamblings, or follow me on twitter  @GWRambling

None of this would have been possible without Alan Nathan’s great work on the physics of baseball.  I used his trajectory calculator to do this, and I referenced his articles frequently to make sure I wasn’t way making stupid assumptions. The information on major league home run distance is based off of hittrackeronline.com


Old Player Premium

One of Dave Cameron’s articles a while back showed payroll allocations by age groups, and it shows that over the last five years or so more money is going to players in their prime years while less is being spent on players over 30.  That seems to be a logical thing for teams to do, but that trend can only continue for so long.  Eventually a point will be reached where older players are undervalued, and it might be possible that we are already there.

There are several things to keep in mind when comparing these age groups, and one of the biggest is the survivorship bias.  There is a natural attrition over time for players in general.  Let’s look at an example, and for all the following I will be using 2012 versus 2013 as a way to see what happens from year to year.  To look at survivorship, I looked at all position players in 2012 and then their contribution in 2013 to see how many disappeared the next year.  The players that were not in the 2013 year could be due to retirement, demotion, injury, etc.  I also took out a small group that played in both seasons, but were basically non-factors in 2013, for example Wilson Betemit played in both seasons, but in 2013 he only had 10 plate appearances.  The attrition rate for the age groups looks like this:

Age Group % of 2012 Players That Did Not Contribute in 2013
18-25 22.2%
26-30 25%
31-35 29.3%
36+ 38.9%

As you would expect, the attrition rate increases over time.  Players in their late teens and early 20s who make it to the majors are likely to be given opportunities in the near future, but as the age increases the probability of teams giving up on the player, major injury, or retirement goes up.  Players who make it from one group to the next have survived, and that is where the bias comes in.  By the time you get to the 36+ group a significant number of the players are really good because if they weren’t they would not have made it so far.  This ability to survive is also a reason why they should be getting a good chunk of the payroll.  As I will show you, it leads to steady play which teams should pay a premium for.

The next step is looking at performance risk among the groups.  To look at this I took each group’s performance in 2012 and compared it to the group’s performance in 2013, again only with survivors from year to year.  I looked at both wRC+ and WAR just to see if only the hitting component or overall performance behaved differently.

Further, to calculate a risk level I looked at the standard deviations of the differences (2013 minus 2012) for each player, but those are not directly comparable.  Standard deviation is higher for distributions with higher averages due to scaling issues.  For instance, the average 36+ player had a 95 wRC+ in 2012 versus, which is more than 10 wRC+ above the average 18 to 25 year old in the same year.  A 10% drop or increase  in production is therefore a larger absolute change for the 36+ player, so they naturally end up with a higher standard deviation.  To take care of this I calculated the standard deviation of the difference as a % of 2012 average production as the overall riskiness measure.

Age Group wRC+ Risk WAR Risk
18-25 56.5% 167.7%
26-30 48.3% 118.9%
31-35 46.4% 140.7%
36+ 35.2% 92.8%

Don’t compare the wRC+ to WAR figures as there are again scaling issues, but look at the age groups.  A one standard deviation change is most volatile for the youngest age group, so the younger players are the most uncertain or most risky.  That is what we would expect as we have all seen prospects flame out.  The middle two groups are similarly volatile with the 31 to 35 group have a slightly lower risk level in the hitting for this sample and slightly higher overall play according to the WAR risk.  More years might need to be compared to see how consistent those groups are relatively.  The 36+ players are significantly less risky than the other ages.  If they decline by 1 standard deviation it will mean a smaller reduction in performance, less volatile and less risky.

The only thing that really hurts the older players is the aging curve.  They are more likely to see a decline in performance.  From the youngest group to oldest the percent of players who were worse in 2013 than they were in 2012 by wRC+ was 52.3%, 54.5%, 64.4%, 63.6%, and for WAR 52.9%, 48.7%, 56.7%, and 81.8%.  So it is more likely that the older players will see performance worse than the previous year, but again a drop for them will likely be smaller due to lower volatility and it is on average from a higher level of performance to begin with.

Older players are like buying bonds for your investment portfolio, you have a pretty good idea of what there going to pay in the next period with occasional defaults.  Younger players are more like growth stocks, you aren’t sure when or if they are going to pay dividends but when they do you can make huge returns.  Investors pay a premium for bonds (accept a lower rate of return) due to their stability, and teams pay more for older players than maybe their production seems to warrant for the same reason.

 photo Survivor_zpsee696878.jpg

If you go back to the payroll allocation, part of the shift is in the number of players in each group.  The 31-35 year-olds no longer get the largest chunk of payroll in part because there are more 26 to 30 year-old players.  Baseball is getting younger overall, so a larger portion of the money going to younger players is inevitable.  The 18 to 25 group isn’t getting a large change in payroll allocation because they are generally under team control, but the teams are extending the players at that age with the money showing up as they get into the next couple age groups.  Like Chris Sale, who is making $3.5 million this year on the extension he signed (he’s 25), but when he is 26, 27, and 28 he will make 6, 9.15, and 12 million respectively.

So the 36+ group, as you can see only 4.7% of the players, used to make about 20% of the total salaries paid, but now they make 15 or 16% (I don’t have Dave’s exact numbers).  Is that premium fair, four times more of the allocation than they make up of the overall player pool?  That is a tough question, and one I am working on.  If anyone can give me tips on how to dump lots of player game logs, that is probably what I am going to do next, but haven’t figured out how to do it without eating up my entire life.  Being more certain on this sort of thing, and having a relative risk measure for players could make contracts a lot easier to understand and predict.


The Tim Hudson Renaissance

As a general rule, giving multi-year contracts to 38-year-old pitchers coming off major ankle injuries is not a good idea. Yet Brian Sabean and the San Francisco Giants did just that, inking Tim Hudson to a two-year, $23M contract this off-season, and thus far have come out smelling like roses.

While Hudson has been a reliable and at times masterful starter during his long career, he is en route to his best overall year since 2003. The data further suggests that he is pitching better now than he has at any other point.

Examining Hudson’s career statistics suggest that his current pace, while not completely sustainable, is not a mirage by any means. The one stat that jumps off the page is his BB/9, which is a paltry 0.77. Of course that rate is bound to rise, but it’s certainly reasonable to expect it to stay in the low 2s. Hudson’s career low BB/9 is 2.10, and he hasn’t had a rate above 2.91 since 2006.

This season, Hudson’s strikeout rate—5.63—is actually lower than his career rate of 6.05. But he has never been a strikeout pitcher; his highest K/9 (8.71) came in 1999, his rookie season, when he also walked 4.09 batters per nine. He hasn’t had a strikeout rate above 6.51 since 2001.

What Hudson is now doing better than he has at any time in his career is limiting baserunners and stranding those that do manage to reach. His miniscule 0.88 WHIP is far off from his career total of 1.22, but it’s by no means a complete anomaly. As recently as 2011, Hudson has had a WHIP as low as 1.14; in 2003 he posted a career best of 1.08. While his current rate is likely to regress closer to the mean, he has proven capable of keeping batters from reaching base at an impressive rate.

When the WHIP does rise, it will likely be a result of an increased BB/9 and BABIP. Against Hudson in 2014, hitters have a BABIP of .243, a number well below his career mark of .278. But Hudson has posted similar rates in the past. In 2010, a year in which he pitched 228.2 innings, he held opposing hitters to a .249 BABIP. He hasn’t allowed a BABIP above .300 in a full season since 1999, though he threw just 136.1 innings that year.

Further, Hudson has stranded 80.8% of his baserunners thus far in 2014, his highest rate since 2010 (81.2). His groundball rate—60.7%—is a big reason why, as is his refusal to allow home runs. His HR/9 is a measly 0.51, a number he’s only bettered twice in his career (0.38 in 2004, 0.40 in 2007). While pitching in the friendly confines of AT&T park has helped, his FIP- of 83 is relatively close to his career mark of 88. In 2007, pitching half his games at Turner Field, Hudson posted a FIP- of 77.

So how is Hudson doing it? Besides the absurdly low walk rate, what has made him so effective this year?

Thus far, he is throwing his split/changeup and cutter with more frequency than his career rates from 1999-2013. His split/change—which he throws 14.60% of the time—has been especially effective this season, garnering a whiff/swing rate of 36.84. Before this season, the pitch amassed a whiff/swing rate of 27.94. His cutter, while getting slightly less whiffs this season (16.67%) than in years past (17.12% from 1999-2013), is forcing more ground balls (11.26 compared to 9.05).

Hudson’s curveball has also been a more valuable weapon this season than it has been in the past. While he’s throwing it at a rate that is almost identical to his career line, it gets him more whiffs (17.19%) than any of his other pitches besides the split/change (20.14). Before, batters whiffed at Hudson’s curve just 11.74% of the time.

When batters do put the ball in play, they aren’t hitting it very hard. Hudson’s LD% of 15.9 is the second lowest number he’s posted in his career, and a decent chunk below of his career mark of 18.0. In 2010, he had a career best 13.6%. This has resulted in Hudson throwing strikes at a higher rate than he ever has in his career. In 2014, 68.2% of the pitches he has thrown have been strikes, compared to a career rate of 63.7%.

As amazing as Hudson has been through 10 starts this season, the data suggests that, for the most part, his rates are legitimate and sustainable. Besides the infinitesimal walk rate, which translates to a low WHIP, and improved whiff rates on two of his pitches, Hudson isn’t doing anything that he hasn’t proven able to do in the past.


Nick Markakis, What Happened?

Nick Markakis has carved himself out a nice major league career. He now has the 8h most hits in Orioles history and by seasons end he’ll likely be in sole possession of 6th place. Markakis, now with nearly 1,500 hits, at 30 years old has a shot at gathering 2,500 hits in his career. While hits are a compilation statistic,  that would still place him in the top 100 of all time. However, Markakis still strikes me as a player of unfulfilled potential. In his last four seasons, Markakis has not compiled a WAR higher than his rookie season (2.1 in 2006). His two highest WAR seasons—far and away—were at ages 23 and 24. In 2008, a season in which he compiled 6.1 WAR, he had the 11th highest total in all of baseball. To peak so young is a very odd career trajectory. While Markakis was on the path to being one of the best all around players in baseball, he cratered early. This loss in value is due to two reasons, which are readily apparent to date this season, a reduction in power and a reduction in defense.

Markakis early on posted decent advanced defensive numbers. But, since 2009 he has been bad according to the metrics. To follow up that up with some regular scouting, he has simply lost a step. He lost his range at a young age and has never been able to get it back. His arm keeps him respectable but he has even lost some of that strength as well. He remains a below average right fielder and it is not getting any better.

While his defense has hindered his overall value, the most critical aspect of his game to leave him at young age was his power. Markakis never hit many home runs, with a career high of 23, but the doubles were critical to his value. He had four straight seasons of 43, 48, 45, and 45. All fantastic numbers. In fact, after the 2010 season, he had a decent shot of reaching the top 10-20 for the all time doubles record if he kept up that pace. However, his homers and doubles fell following 2010. If he had maintained a 40 double, 15-20 homer pace over the course of his career, alongside his .300 batting average and decent walk rate, Markakis could have been one of the most valuable outfielders in the game. The graph below tells the story best of when he lost his power. Those are his season by season ISO and SLG numbers.

NickMarkakis_PowerGraph

Looking at the graph above, once can see that Markakis was average to above average in power production for his first handful of seasons. Starting in 2010 is when his power began to fall to below average. His numbers spiked in 2012, however that is his shortest season to date so the sample size is smaller than the other years around it. Also, 2012 was still lower in both ISO and SLG than 2007 and 2008. Since 2009, Nick Markakis has been a below average power hitter. And his most recent season, 2013, was his worst ever producing a paltry .085 ISO (.145 is considered average and .080 is considered awful) and posting a -.1 WAR number. But, the question still remains to why did he lose his power?

After watching Markakis for years and staring at hours of tape it is hard to tell if this power reduction is due to mechanical issues. Markakis has been known to change his stance and approach at the plate nearly every week. He will lower or raise his hands, stay open or close up, he is a constant tinkerer at the plate with his mechanics. I do not believe mechanics has anything to do with the steady power decline. Nor is it necessarily how pitchers are pitching to Markakis. Looking at the numbers, he is seeing a similar amount of pitches in the zone, a little less than the early years but nothing unexpected and in fact his rate has rebounded recently. Furthermore, the mix of pitches he is seeing is similar to his early years. It has not been an adjustment from pitchers. Rather, much like his defense, he simply lost a step earlier than most other position players do.

Looking at the two heat maps below. One shows his power peak years (2007 to 2010) and the one below that shows the last two seasons (2013 to 2014). They are ISO heat maps showing which pitches in which locations Markakis has been able to drive for extra bases.

Markakis2007to2010ISOMarkakis2013to2014ISO

Clearly, Nick Markakis has shown over the past two seasons to not be able to drive the pitches for extra bases that he once could. In particular the pitches in the outside middle of the plate—which if you remember those great Markakis years he could artfully fade right in between the center fielder and the left fielder for a double like clockwork—he has shown a clear ability to not drive for extra bases anymore. The only power left in Markakis’ game comes from pitches down and in and even then its limited power at best. Basically, he can still run into a meatball, but his double-hitting days are over. And with someone who cannot and has never been able to hit the ball out of the park readily, Markakis is basically a slap-hitting right fielder who can post some decent value at the plate, but nothing special.

The career arc is strange and unfortunate but clearly obvious. Markakis simply could not and cannot maintain the production of his early seasons. His skills broke down sooner than most. He is a nice piece and if he kept up his early pace, he would have been a steal on his current contract. However, unless he is brought back at a reduced price—or if Peter Angelos decides that loyalty is worth $17.5 million—Orioles fans better get used to having a new right fielder in 2015.

Article originally posted at www.Orioles-Nation.com


Satchel Paige: Baseball’s Believable Myth

One of the biggest drawbacks of statistics is the how they can get in the way of our imagination. I’ve heard stories of how Pete Rose could will his team to victory on any given day of his career that spanned 23 years. Our stats claim that, actually, you can value his contributions at 80 wins. Rickey Henderson’s speed was electric and unfathomable, and no one can put a number on that, we’ve heard. FanGraphs says, really, his baserunning was worth 142 runs. Aroldis Chapman throws so hard, his fastball isn’t comparable to anyone else’s in baseball. Our data suggest that last year it was 7 runs above average.

While statistics have contributed significantly more than they’ve taken from us, it is occasionally fun to ignore them and just pretend the stories we want to believe are true. However, for a pitcher that is the focus of some of the most incredible tales in baseball history, a few stats from the end of his career are all the more reason to trust the absurd stories we have about him.

Satchel Paige pitched almost all of his professional baseball career in the Negro Leagues and barnstorming. He estimated that he played for 250 teams, though his “facts” about himself were often far from reality (for instance, he claimed that he never hit under .300, but he actually hit .097 in the majors). Baseball wasn’t integrated until Paige was 41 years old. Up until that point, he had built a legendary career that earned him the first Hall of Fame induction for any Negro Leagues player. Unfortunately, record keeping from these leagues was nearly non-existent, and almost no statistical evidence remains of his elite performances.

Stories of Paige paint a picture of arguably the most talented and entertaining pitcher to ever throw a baseball. As a teenager playing semi-pro baseball in Alabama, he supposedly got so mad at a poorly performing defense that he ordered his outfielders to sit down in the infield, where they watched him strike out the game’s last batter to complete his shutout with the bases loaded.

The greatest Negro Leagues hitter, Josh Gibson, once told Paige that he was going to hit a grand slam off of him in an upcoming game. With Gibson in the hole and one player on base, Paige intentionally walked the next two hitters, so Gibson would have an opportunity to hit a grand slam. Paige struck him out.

Joe DiMaggio called Paige the best pitcher and hardest thrower he had ever seen. Teammates claimed he could consistently throw his fastball over a gum wrapper. In his six exhibition matchups against Dizzy Dean (during two seasons in which Dean achieved a total WAR over 13), Paige won 4 games, and Dean said Paige’s fastball made his own look like a changeup.

Witnesses of Paige’s pitching would go on to tell countless other stories of his heroics, and a good number of them can’t be true. But what is possibly most remarkable is how historically effective he was when he was finally allowed to play in the majors, long after his prime.

Satchel Paige’s pitching demands were enormous, because through almost his entire career, people only paid to watch him pitch. He would frequently throw over 100 pitches in consecutive days. While his estimate of 2,500 games started is almost certainly exaggerated, he may very well have thrown more professional innings than anyone ever has. He pitched professionally for 22 years before Major League teams would allow him to join a roster; he would have done so with more financial incentive to pitch frequently than any reasonable person could expect.

Considering the wear and tear on his arm, expectations even for such a legendary pitcher would need to be very tempered for his performance in his 40’s. After all, only 67 pitchers have ever even thrown 100 innings after they turned 40.

Of those 67, Paige ranks 8th in ERA- (81). Of the seven in front of him, three were knuckleball pitchers, one pitched before World War I, and one has been held out of the Hall of Fame due to steroid allegations (whether fair or not).

In the course of his first 4 seasons, 128 pitchers threw at least 300 innings. Of those 128, Paige’s strikeout rate ranked 2nd. At the end of that four-year stretch, he was 46. 46 year olds don’t strike players out. You have to go down 20 spots to find a pitcher who was less than 10 years younger than Paige.

After Paige had been out of the majors for over a decade, the Kansas City A’s had him throw for them when he was 59 years old. He threw three scoreless innings, allowing only one runner.

It’s easy to wish we had better stats of Satchel Paige’s early career. It could help us establish if he really had, as he said, over 20 no-hitters. We could definitively say whether or not he had 250 shutouts, 2000 wins, 21 straight wins, or over 60 consecutive scoreless innings, all of which he claimed to be true. It’s quite likely all those numbers are fabricated. It’s possible that many of the stories about his pitching are exaggerated.

But when Satchel Paige was finally given a chance to prove himself, he blew away any realistic expectations anyone could have set for him. No one will ever know what stories about Satchel Paige really happened, or how trustworthy people’s observations of him were. But 25 years into his career, at years in his life few ever spend pitching professionally, he gave us a reason to believe them.


Performance With and Without Runners On, and Hitter Valuation

The increased prevalence of defensive shifts, as well as recent stories touting certain players as “shift-proof,” got me thinking: Is it a good thing to be shift-proof?  Is it inherently better to be a player against whom defensive shifting is less effective, or is there room for different players with different make-ups?  A downstream effect of defensive shifts is that, because teams shift less often (and shifts are less exaggerated) with runners on base, we start to see differences in a hitter’s performance with runners on versus with the bases empty.  We also notice other effects of players performing differently based on the number of baserunners.  In this post we’ll take a look at how we observe significant changes offensive performance (often fueled by changes in BABIP) of a few sample players when there are runners on base, versus with the bases empty.

Let’s take 3 players with very high similarity scores to each other: David Ortiz, Jason Giambi, and Carlos Delgado.  First, a look at their career stats:

Player G PA HR ISO BABIP AVG OBP SLG wOBA wRC+ WAR
Delgado 2035 8657 473 0.266 0.303 0.280 0.383 0.546 0.391 135 43.5
Ortiz 2020 8467 443 0.261 0.304 0.286 0.381 0.548 0.392 138 41.7
Giambi 2242 8864 440 0.241 0.294 0.277 0.400 0.518 0.395 140 49.3

Pretty comparable overall.  Giambi has accumulated more WAR, primarily through having a few more plate appearances, but also from having a better walk rate, which drives up his OBP, wOBA, and wRC+ significantly as well.

Now let’s look at their splits with runners on vs. bases empty:

Player

Split G PA HR HR/PA BB% SO% AVG OBP ISO OPS BABIP
Delgado Bases Empty 1932 4430 255 5.8% 11.7% 21.4% 0.275 0.374 0.273 0.922 0.303
Delgado Men On 1895 4227 218 5.2% 14.0% 18.9% 0.286 0.393 0.258 0.936 0.304
Ortiz Bases Empty 1862 4193 262 6.2% 11.2% 19.1% 0.271 0.356 0.282 0.908 0.281
Ortiz Men On 1851 4274 181 4.2% 15.2% 16.6% 0.302 0.406 0.240 0.948 0.327
Giambi Bases Empty 1999 4513 224 5.0% 13.0% 18.1% 0.256 0.367 0.228 0.851 0.271
Giambi Men On 2020 4351 216 5.0% 17.8% 17.1% 0.302 0.434 0.255 0.991 0.320

Here we start to see a lot of divergence.  With Ortiz and Giambi, we see a large increase in BABIP when there are runners on base (and corresponding increases to AVG and OPS).  With Delgado, there is only a trivial increase in BABIP, and a much smaller increase in OPS.

Here’s the difference in BABIP and OPS each player shows in the split between {bases empty} and {runners on}:

Player BABIP(runners on) – BABIP(empty) OPS(runners on) – OPS(empty)
Delgado 0.001 0.014
Ortiz 0.046 0.040
Giambi 0.049 0.140

Note that to some extent, all hitters tend to put up better numbers with runners on due to sampling bias – in an average “runners on” situation, a batter is more likely to be facing an inferior pitcher than in an average bases-empty situation.  Delgado’s splits are in line with the league-average splits for {bases empty} vs. {runners on}; in a given league season, the league-wide runners-on-vs.-bases-empty split in BABIP tends to range from 0.000-0.005; for OPS, the increase ranges from 0.010-0.030.  Ortiz and Giambi on the other hand show splits well outside this range that indicate there are other factors at play causing these effects.

Does this mean Ortiz and Giambi are tapping into some part of their psyche that allows them to suddenly transform into better players when runners are aboard?  Unlikely.  Ortiz and Giambi are pretty heavy pull hitters, especially looking at their ground ball spray charts, against whom defenses have often employed dramatic shifts to great effect.  However, with runners on base, these shifts tend to be less dramatic and less effective.  This is likely the primary reason for the large increases in BABIP with runners on (a 0.046 increase for Ortiz, 0.049 with Giambi).

Beyond this, although Ortiz and Giambi both show similar BABIP splits, they still differ greatly from each other in terms of their production with runners on.  Giambi’s OPS increases a whopping 140 points, while Ortiz’s only increases by 40 points.  This is largely due to Ortiz’s dramatic decrease in home run rate with runners on.  While Ortiz’s HR% drops by nearly 33%, Giambi has managed to continue hitting homers at the same rate when runners are aboard.  Do pitchers change their approach when facing Ortiz with runners on to “minimize the damage” and try to prevent him from hitting home runs?  Likewise Ortiz (based on the knowledge that pitchers will approach him differently) may change his approach at the plate as well.  The splits for other stats seem to bear this out, as Ortiz increases his walk rate and decreases his strikeout rate; this isn’t particularly revelatory, and in fact these trends are present for Giambi and even Delgado as well.

This has profound implications for player valuation.  Given 3 players who put up similar aggregate numbers over the course of the season, would you rather have the player who is going to produce at roughly the same level (similar AVG / BABIP / OPS) regardless of whether there are runners on base, or the player who is going to overproduce with runners on and underproduce with bases empty?  I’d go with the latter.  I’d prefer Ortiz to Delgado.  And then, since the decrease in Ortiz’s HR% with runners on is curious (and warrants further investigation), I’d prefer Giambi to Ortiz, Giambi being the even more extreme example of increased production with runners on.

As we start to see more and more defensive shifts (and if the assumption holds that shifts cannot be employed as effectively with runners on base), there will be more and more players who demonstrate these splits in performance.  WAR, for example, does not take this into account at all.  If a player is dramatically more productive (e.g. a 140-point increase in OPS!) with runners on, you would project his team to score more runs and win more games than if that player was replaced by a player who puts up equivalent full-season numbers (and hence, has the same WAR) but did not have the same splits.

It would be interesting to run some simulations (probably using Markov models) to more precisely quantify the impact a given player’s splits have on team run production.  Said impact would likely vary based on the team too (e.g. overall team OBP).  This could be similar to the analysis comparing how 2 players with similar wRC+ but different makeup (an OBP guy versus an ISO guy) can impact expected run totals for different teams in different ways.


Home-Run Environment And Win-Homer Correlation

Home runs are good, I think we can all agree on that, and in the presumably post-steroid environment they have been in decline.  Does that make the home run more or less important?  It is hard to say.  In some ways it means that they are more scarce, and you might expect that home run hitting teams might be at a larger advantage than previously.  On the other hand, teams that don’t hit a lot of balls out of the park will not be as far behind their peers if said peers are not taking the ball yard quite so frequently.  So which is it?

FanGraphs, of course, can give the answer.  I took every team in the expansion era (1961 and on) and then tracked two things year over year.  The first was how far each team was from the average home runs for a team, just home runs for a team minus the average of all MLB teams.  From there I calculated the correlation of those differences with the wins that the team accumulated in that year.  Then I tracked that correlation versus the overall home run environment.  To get them in the same scale I tracked home run environment as a percent of the max average home runs per team, so 2000 became 100%, or peak home run environment, as it was the highest average per team and every other year the average was some percent below that with the average in 2000 as the denominator.

I did omit 1994 and 1981 due to how much the seasons were shortened by strikes.  It made the overall graph harder to read.  The results look like this:

 

 photo HRenvironment_zps35a42fa7.jpg

 

And the answer is…it doesn’t matter!  Home runs are always positively correlated with wins, meaning it is never advantageous for a team to be below average when it comes to hitting home runs.  That correlation over time has a best fit line with a near zero slope.  Home runs are equally valuable with respect to winning in lower home run environments and the more recent high ones.  You can also see that the correlation is rather volatile ranging from barely positive to about .65 which is a fairly strong positive relationship.  Volatile, but never negative, so there are no years where a bunch of below average home run hitting teams took the league by storm.

The home run environment last year was back to 81.9% of the peak in 2000, and this year’s pace is a little slower than last with home runs in 2.38% of plate appearances rather than 2013’s 2.52%, which could reduce the total home runs hit by more than 8 per team for the year, though the heat of summer will probably close that gap up some.  It is likely though that the overall home run environment will be down to the levels we saw in 2011 and 2012, and maybe the drop off from 2000 has flattened out.

Anyway, I know everyone hates a non-result, there are published papers that have been published about the bias against them even, but this is still interesting to at least me.  You always want to hit home runs, we already knew that, but the value of the home runs should not be increased in times when they are scarce and they don’t become even more necessary during a homer boom.  This means that teams shouldn’t for instance overpay for a guy like Giancarlo Stanton right now because his power bat is more valuable in the current home run environment.  It means they should overpay so that their fans can enjoy the majestic blasts and feel content knowing they will be just as valuable as ever.