Archive for Research

The Truth About Power, Contact, and Hitting in General

The overarching purpose of this study was to identify the core skills that underlie hitting performance and investigate the extent to which hitters must choose between these skills. The article unfolds in two parts.  In Part 1, I explore the ostensible trade-off between power and contact in search of the optimal approach. Then in Part 2, I show that 66% of variance in wRC+ can be explained by four skill-indicators: power, contact, speed, and discipline.  It will be revealed that increasing hard contact should be of paramount importance to hitting coaches, while contact and discipline are complimentary assets.

PART ONE: IS THERE A POWER-CONTACT TRADE-OFF?

Eli Ben-Porat recently published a terrific study on the trade-off between contact ability and power and I will be building on his findings.  As such, I will be using the same sample as his study, which includes all players since 2008 who have swung at 1000 pitches or more. First, I want to explain why it is assumed that there is a trade-off between power and contact.  Not only is it intuitive that a hitter chooses between swinging for the fence and putting the ball in play — there is also clearly a trade-off between abilities among MLB hitters.  Here is a plot of the relationship between SLG on Contact and Contact%.

SLG and Contact
Figure 1. Contact Rate and SLG on Contact.

There is a strong inverse relationship between power and contact, explaining 42% of total variance.  However, Ben-Porat cited evidence that power hitters tend to face tougher pitches than light hitters, a factor that is likely to affect their contact rate.  When Ben-Porat controlled for effect of pitch location on contact rate, the relationship between contact and power dropped to an R2 of 33%. Figure 2 plots the relationship between Ben-Porat’s new True Contact, a location-independent measure of contact skill, and SLG on Contact.

SLG and True Contact
Figure 2. True Contact and SLG on Contact.

While controlling for location loosened the relationship between power and contact, there still appears to be a significant inverse correlation between the skills.  Is this lingering relationship due to a necessary trade-off between hitting for power and making contact? I propose not.  Instead, consider the relationship between Fastball% and SLG on Contact.

The graph in Figure 3 plots the relationship between percentage of fastballs faced and SLG on Contact.

SLG and Fastball%
Figure 3.  Percentage of Fastballs Faced and SLG on Contact.

Predictably, pitchers tend to throw fewer fastballs to more powerful hitters.  To parcel out the effect of pitch type, I examined the relationship between regular Contact% and SLG on Contact while controlling for Fastball%.  This strategy is similar to Ben-Porat’s approach but controls for pitch type rather than location.  The results of a simultaneous multiple regression analysis indicate that when holding Fastball% constant, Contact% explains just 12% of the variance in SLG on Contact.  In other words, most of the relationship between Contact% and SLG on Contact was due to differences in the amount of fastballs faced.

To do a little better, I examined the relationship between Fastball% and True Contact.  Figure 4 shows that Fastball% accounts for about a quarter of the variance in True Contact.  Understandably, as Fastball% increases so does True Contact.

Fastball% and True Contact
Figure 4.  Relationship between True Contact and Fastball%.

While True Contact controls for the location of pitches faced, it does not account for the proportion of fastballs faced.  When the effect of Fastball% is held constant, True Contact accounts for just 9% of the variance in SLG on Contact.  I computed a new Fastball%-independent version of True Contact, called Real Contact, and plotted it against SLG on Contact in Figure 5.

Real Contact and SLG
Figure 5. Relationship between Real Contact and SLG on Contact.

The plot resembles a shotgun distribution with only a slight relationship between power and contact left. It is possible this remaining relationship is due to what’s left of the “trade-off hypothesis.” If so, I suspected there would be evidence that an approach that maximizes slugging, such as hitting fly balls and pulling the ball, would be associated with lower Real Contact scores.  Instead, FB% explained only 2.6% and Pull% only 2.4% of total variance in Real Contact.  If there is real trade-off between contact and power, I still can’t isolate it.

Dr. Alan Nathan has demonstrated that home runs and base hits are optimized by different swing strategies.  The implication is that there is a trade-off between base hits and power. Perhaps a contact swing is a base-hit swing. I tested this notion, and Figure 6 plots the relationship.

babip and contact

Figure 6.  BABIP and Real Contact.

Surprisingly, contact and BABIP are unrelated.  This is a counter-intuitive null finding, like the non-association between LD% and Hard%. In this case, I think base-hit skill requires more than not-missing.

I can’t test my final explanation, but I think selective sampling could explain the remaining small association between contact and power.  Since hitters need to achieve a minimum level of success to stay in the league, it seems unlikely for hitters to lack both power and contact skills.  Further, a hitter deficient in one skill would need to make it up with the other to avoid being released.  Since I could not find evidence to support an adjustment-based trade-off between power and contact, I assume the skills are independent moving forward.

PART TWO: POWER, CONTACT, SPEED, AND DISCIPLINE

If power and contact are separate skills, how much does each contribute to a hitter’s overall production? What about speed and discipline?  To answer these questions, I conducted a multiple regression analysis with wRC+ as the dependent variable and Hard%, Real Contact, Spd, and O-Swing% included as predictors.  The predictors were chosen to reflect power, contact, speed, and discipline because they measure each construct without including outcome data that make up wRC+. A multiple regression allows us to measure the unique contribution of each predictor on wRC+ as well as the overall variance accounted for by all the predictors.

The correlation matrix for the four predictors and one dependent variable are presented in Figure 7.  Only Spd and Hard% have a zero-order correlation over .20, with an R2 of 11.6%.  The four skills are mostly unique, which means the model avoids statistical problems of multicollinearity and singularity.

Matrix
Figure 7. Correlation matrix indicating zero-order correlations in the top row, 1-tailed p-values in the second row, and sample size in the third row.

The results of the multiple regression are presented in Figure 8.  Note the adjusted R2 of .66 indicating that the four predictors explained 66% of total variance in wRC+.

Model Summary
Figure 8. Results of multiple regression.  Hard%, Real Contact, Spd, and O-Swing% predicted 66% of variance in wRC+.

The specific contribution of each measure is indicated in Figure 9.  The Part Correlation statistic describes the unique contribution (R) of each predictor to explaining wRC+. When considering all predictors together, Hard% accounts for 60% of the variance in wRC+. The remaining three skills provide only incremental value compared to hitting the ball hard.

Coefficients
Figure 9.  Coefficients and Correlations from multiple regression.

The Partial Correlation statistic indicates the proportion of the remaining variance explained by each predictor while controlling for the effects of the others.  In other words, when controlling for Hard%, Spd, and O-Swing%, Real Contact explains 24% of the remaining variance in wRC+.

The strength of the multiple regression approach is clear when comparing the zero-order correlations to the partial and part correlations.  In every case, the part and partial correlations are larger, suggesting that each predictor benefits from the inclusion of the others in the model. Further, the relationship between each skill and wRC+ seems more intuitive when the contribution of the other skills is accounted for.  For example, Spd has a slight negative association with wRC+ on its own, but a positive relationship accounting for 11% of the remaining variance when included with the other predictors. It makes sense that speed is helpful, all else being equal.  Similarly, Real Contact and O-swing% have larger, more intuitive relationships to wRC+ when controlling for all predictors.

CONCLUSION

I conducted this research from a coach and player’s perspective, with the goal of identifying the ideal composition of hitting skill. Previous research has already reported a strong association between Hard% and wRC+, and this study only reaffirms the contribution of Hard% to overall production.  Given the same amount of speed, discipline, and contact skill, hard-hit percentage accounts for over two-thirds of remaining variance in a hitter’s wRC+.

A novel finding of this study is that there is little to no trade-off between power and contact ability.  Almost all of the apparent effect was due to differences in how power hitters and light hitters are pitched.  Given the same pitches, power hitters can make as much contact as light hitters. For example, Albert Pujols ranks 10th in the sample in Hard% and 15th in Real Contact.

The truth about hitting is that every hitter is swinging the bat just about as fast as they can. They are racing 95+, so they don’t really have a choice.  That doesn’t leave a lot of room for a hitter to consciously swing easier.  The hitter can choose to take a “shorter” swing, but should only do so if it results in more hard contact (or the same amount and more overall contact). Hitting the ball hard is the name of the game. Making contact, running well, and being disciplined complete the package.


xHR%: Questing for a Formula (Part 3)

Part 3 of a series of posts regarding a new statistic, xHR%, and its obvious resultant, xHR. This article will examine formulas 2 and 3. 

As a reminder, I have attempted to create a new statistic, xHR%, from which xHR (expected home runs) can be derived. xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season. In searching for the best formula possible, I came up with three different variations, pictured below.

Today, I’m going to examine formulas 2 and 3 to measure their viability as formulas for xHR%. Hopefully the analysis will shine some light on a murky matter. Likely, formula 2 will end up being the best one because it probably balances in-season performance with prior performance better than formula 3, which has a heavier reliance on in-season performance. Thus, it will end up correlating too well with what actually happened (the same outcome is likely for formula 2).

Methodology

Luckily for myself and the readers, the process was a simple one. Pulling data from FanGraphs player pages, ESPN’s Home Run Tracker, and various Google searches, I compiled a data set from which to proceed. From FanGraphs, I collected all information for Part Two of the formula, including plate appearances and home runs. Unfortunately, because a few of the players from the sample were rookies or had fewer than three years of major league experience, I had to use regressed minor league numbers. In some cases, where that data wasn’t applicable, I dug through old scouting reports to find translatable game power numbers based off of scouting grades (and used a denominator of 600 plate appearances).

Then, from ESPN’s Home Run Tracker website, I obtained all relevant data for player home-run distance, average home-run distance for the player at home, and league average home-run distance. Due to my limited time, I only used players that qualified for the batting title during the 2015 season, yielding a potentially weak sample of only 130 players. Additionally, before anyone complains, please realize that the purpose of my research at this point is to obtain the most viable formula and refine it from there so that it can be applied across a wider population.

Results for Formula 2

Using Microsoft Excel, I calculated the resultant xHR% and xHR. Some key data points:

League Average HR% (actual):  3.03%

Average xHR%:  2.89%

Average Home Runs: 18.7

Expected Home Runs: 17.8

Please note that there is a significant amount of survivorship bias in this data. That is, because all of these players played enough to qualify for the batting title, they are likely significantly better than replacement level, which is why the percentages and home runs seem so high.

Correlation between xHR% and HR%: 0.974418884

R² for above: 0.949492162

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4265261

Correlation between xHR and HR: 0.977796283

R² for above: 0.956085571

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.474596069

Results for Formula 3

League Average HR% (actual):  3.03%

Average xHR%:  2.92%

Average Home Runs: 18.7

Expected Home Runs: 18.1

Again, note the survivorship bias that comes with having a slightly skewed sample

Correlation between xHR% and HR%: 0.986440621

R² for above: 0.973065099

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4615323

Correlation between xHR and HR:0.988287804

R² for above:0.976712783

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.698203408

Mostly Boring Analysis

I have opted to condense the analysis into one section instead of two because it would have otherwise been repetitive and boring.

I understand that that’s a lot to process, but the data really isn’t all that dissimilar. The expected home-run percentage is slightly lower than the actual home-run percentage for both of them, but it isn’t a massive difference by any means. When prorated to a 600 plate appearance season, xHR% for formula 2 predicts that the average player in the sample would have hit 17.3 home runs, while formula 3’s xHR% expects that the average home-run total would have been 17.5. In reality the average player hit 18.2 home runs per 600 plate appearances, so both were fairly close (maybe too close).

Both formulas had incredibly high correlations, with formula 3 correlating an insignificantly higher amount more. More importantly, formula 2 explains about 94% of the variance, while formula 3 accounts for 97%. The difference between those is relatively unimportant because they explain a very high amount of what occurred. Furthermore, p<.001, so the data must be statistically significant (actually many times lower than that).

Both formulas resulted in slightly lower standard deviations than what actually occurred, which is a recurring theme. In these formulas, the numbers have been clumped a little bit closer together and tend to underestimate rather than overestimate.

Players of Interest

Mr. Kole Calhoun – Last season he hit 26 home runs, but by both formulas he should have hit 3-4 fewer. Likely, this is because his only previous full season of home runs was in 2014, when he had only 17, in addition to the fact that I was forced to use scout grades for his third season. The scout grades were particularly off for Calhoun because he wasn’t even expected to be good enough for the majors, let alone be an above-average, high-value outfielder. Even though his overall offensive prowess declined slightly this past season (by 20 points of wRC+), he didn’t appear to be selling out for power, as his power profile numbers (FB%, Pull%, etc.) remained the same. Personally, I would expect him to regress next season, and I think the formula agrees with me.

Mr. Nolan Arenado – Arguably having the most unexpected offensive breakout of the season, he increased his home-run totals from 10 in 2013, to 18 in 2014, and finally to an astonishing 42 in 2015. While his totals were probably slightly Coors-inflated, they were real for the most part because his average home-run distance was excellent, in addition to the fact that 22 of his dingers came on the road. Arenado is young and likely to regress somewhat in the power department, but he is probably around to stay as a significant home-run threat. The formula was likely wrong on this one due to weighting of prior seasons, so go ahead and make the lazy Todd Helton comparison.

Mr. Carlos Gonzalez – Though Arenado’s teammate had the highest home-run total (40) of his career in 2015, it isn’t clear that he was anywhere near his peak statistically. His wRC+ was below his career average by six points, in addition to him being a net below-average player. All of this leads to the conclusion that he was selling out for power — which makes sense given that he lost over fifty points of batting average and on-base percentage from his 2010-13 peak years. While a viable argument could be made for his “subpar” performance being due to injuries, a better one could be made that his home runs were in part a result of playing half his games at Coors Field, where he hit 60% of his round-trippers. The formula says he should have hit about seven fewer home runs, which may be a best case scenario for next season given his penchant for injury. Additionally, while the Rockies are by no means full of talent, if Gonzalez continues his overall downward trend, he could get traded and lose the Coors advantage, or he could lose playing time.

Keep watch for a concluding piece in the next week. Criticism would be highly appreciated, but keep in mind that I’m still in high school and have yet to actually study statistics.


xHR%: Questing for a Formula (Part 2)

Part 2 of a series of posts regarding a new statistic, xHR%, and its obvious resultant, xHR, this article will examine formula 1. The primer, Part 1, was published March 4.

As a reminder, I have conceptualized a new statistic, xHR%, from which xHR (expected home runs) can be derived. Furthermore, xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season rather than what will happen or what actually happened. In searching for the best formula possible, I came up with three different variations, all pictured below with explanations.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s home run tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. In cases where there isn’t available major league data, then regressed minor league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.

PA – Plate appearances

(Apologies for my rather long-winded reminder, but if you really forgot everything from Part 1, then you should really invest in some Vitamin E supplements and/or reread the first post.)

The focus formula of this post is the first one, which also happens to be the one I think will work the least well because it relies too heavily on prior seasons to provide an accurate and precise estimate of what should have happened in a given season.

In the second piece of the formula, with only fifty percent of the results from the season being studied taken into account, it likely fails to take into account the fact that breakouts occur with regularity. As a result, it probably predicts stagnation rather than progress.

Methodology

Luckily for myself and the readers, the process was an incredibly simple one. Pulling data from FanGraphs player pages, ESPN’s Home Run Tracker, and various Google searches, I compiled a data set from which to proceed. From FanGraphs, I collected all information for Part Two of the formula, including plate appearances and home runs. Unfortunately, because a few of the players from the sample were rookies or had fewer than three years of major league experience, I had to use regressed minor league numbers. In some cases, where that data wasn’t applicable, I dug through old scouting reports to find translatable game power numbers based off of scouting grades (and used a denominator of 600 plate appearances).

Then, from ESPN’s amazingly in-depth Home Run Tracker website, I obtained all relevant data for player home run distance, average home run distance for the player at home, and league average home run distance. Due to my limited time, I only used players that qualified for the batting title during the 2015 season, yielding an iffy sample of only 130 players. Additionally, before anyone complains, please realize that the purpose of my research at this point is only to obtain the most viable formula and refine it from there.

Results

Using Microsoft Excel, I calculated the resultant xHR% and xHR. Some key data points:

League Average HR% (actual):  3.03%

Average xHR%:  2.85%

Average Home Runs: 18.7

Expected Home Runs: 17.7

Please note that there is a significant amount of survivorship bias in this data. That is, because all of these players played enough to qualify for the batting title, they are likely significantly better than replacement level, which is why the percentages and home runs seem so high.

Clearly, the numbers match up fairly well, with this version of the formula expecting that the league should have hit home runs at a .18% lower clip, and one fewer per player, which amounts to a significant difference. Over the course of a 600 plate appearance season, the difference between them is still only a little more than one home run, an acceptable distance.

Correlation between xHR% and HR%: 0.960506092

R² for above: 0.922571953

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.3883746

Correlation between xHR and HR: 0.966224253

R² for above: 0.933589307

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.201355342

While xHR% using this formula apparently explains about 92% of the variance, correlation may not be the best method of determining whether or not the formula works adequately. This holds at least for between xHR% and HR%, because there’s only a minuscule difference between their numbers (but one that matters), meaning it’s not a particularly explanatory method and that it may not have the descriptive power I’m looking for. Nevertheless, it is important to note that the correlation is not a product of random sampling, as p<.005. Unsurprisingly, the standard deviation for xHR% is smaller than that of HR% (nearly insignificantly so), indicating that the data is clumped together close to the mean as a result of using this formula, a potentially good thing (in terms of regression).

A better indicator of the success of the formula is the correlation between xHR and HR, a relatively high value of ≈.97. Here, presumably because the separation between home runs and expected home runs is greater, the formula ostensibly explains approximately 94% of the variance in outcomes and resultant data. However, in this case, the standard deviation for actual home runs is about 10.4, while for xHR it’s about 9.2, suggesting that, after being multiplied out by plate appearances, xHR is spaced nearly as evenly as HR. Ergo, it likely serves as a decent predictor of actual home runs.

Players of Interest

Mr. Bryce Harper – It’s likely there isn’t a better candidate for regression according to this formula than Bryce Harper, who the formula says have hit only 32 home runs as opposed to his actual total of 42. While he did lead his league in “Just Enough” home runs with 15, he’s also always been known for having prodigious power (or at least a potential for it). Furthermore, Mr. Harper dramatically changed his peripherals last season to ones more conducive to power. Suggesting this are the facts that he increased his pull percentage from 38.9% to 45.4%, his hard hit percentage from 32% to 40%, and his fly ball percentage from 34.6% to 39.3%. On their own, all of the previous statistics lend credence to the idea that Harper changed his profile to a more home-run-drive one, but when taken together they significantly suggest that. His season was no fluke, and the formula certainly failed him here because it weighted prior seasons far too heavily.

Mr. Brian Dozier – No surprises here. Mr. Dozier has certainly been trending upward for a long time, and in a model that heavily weights prior performance such as this one, upticks in performance are punished. Nevertheless, the data vaguely supports the idea that Dozier should have hit 24 home runs instead of 28. While he did significantly increase his pull percentage to an incredibly high 60% from 53%, he did play in a stadium where it’s of an average difficult to hit pull home runs as a right-handed hitter. Moreover, 10 of his 28 home runs were rated as “Just Enough” home runs, in addition to his average home-run distance being 12 feet below average (admittedly not a huge number, nor a perfect way of measuring power). If I were a betting man, I’d expect him to hit 4-6 fewer home runs this coming season.

Keep watch for Part 3 in the coming days, which will detail the results of the other formulas. Something to watch for in this series is the issue that the results of the formula correspond too closely to what actually happened, which would render it useless as a formula.

Note that because I have never formally taken a statistics course, I am prone to errors in my conclusions. Please point out any such errors and make suggestions as you see fit.


Hardball Retrospective – The “Original” 1905 New York Giants

In “Hardball Retrospective: Evaluating Scouting and Development Outcomes for the Modern-Era Franchises”, I placed every ballplayer in the modern era (from 1901-present) on their original team. Accordingly, Vada Pinson is listed on the Reds roster for the duration of his career while the Red Sox declare Amos Otis and the Rockies claim Chone Figgins. I calculated revised standings for every season based entirely on the performance of each team’s “original” players. I discuss every team’s “original” players and seasons at length along with organizational performance with respect to the Amateur Draft (or First-Year Player Draft), amateur free agent signings and other methods of player acquisition.  Season standings, WAR and Win Shares totals for the “original” teams are compared against the “actual” team results to assess each franchise’s scouting, development and general management skills.

Expanding on my research for the book, the following series of articles will reveal the finest single-season rosters for every Major League organization based on overall rankings in OWAR and OWS along with the general managers and scouting directors that constructed the teams. “Hardball Retrospective” is available in digital format on Amazon, Barnes and Noble, GooglePlay, iTunes and KoboBooks. The paperback edition is available on Amazon, Barnes and Noble and CreateSpace. Supplemental Statistics, Charts and Graphs along with a discussion forum are offered at TuataraSoftware.com.

Don Daglow (Intellivision World Series Major League Baseball, Earl Weaver Baseball, Tony LaRussa Baseball) contributed the foreword for Hardball Retrospective. The foreword and preview of my book are accessible here.

Terminology

OWAR – Wins Above Replacement for players on “original” teams

OWS – Win Shares for players on “original” teams

OPW% – Pythagorean Won-Loss record for the “original” teams

Assessment

The 1905 New York Giants          OWAR: 69.9     OWS: 348     OPW%: .634

Based on the revised standings the “Original” 1905 Giants edged the Phillies, seizing the pennant by three games. New York led the National League in OWS and posted the highest all-time OWAR.

Cy Seymour’s tremendous offensive outburst transformed the Giants’ attack. Seymour paced the circuit in seven major categories including batting average (.377), hits (219), doubles (40), triples (21), RBI (121), SLG (.559) and total bases (325). A .303 lifetime batter, Seymour never led the League in any categories during his other 15 MLB seasons. Harry H. Davis (.285/8/83) topped the home run charts in four consecutive campaigns. Danny F. Murphy ripped 34 two-base knocks and swiped 23 bags. Art Devlin pilfered a League-high 59 bases in his sophomore season. “Wee” Willie Keeler contributed 42 sacrifice hits along with a .302 BA – the twelfth of thirteen straight seasons with a batting average above the .300 mark. Keeler posted a career BA of .341 and collected at least 200 base knocks per year from 1894-1901.

Christy Mathewson leads the All-Time Second Basemen rankings according to Bill James in “The New Bill James Historical Baseball Abstract.” Teammates listed in the “NBJHBA” top 100 rankings include Seymour (30th-CF), Keeler (35th-RF), Murphy (51st-2B), Devlin (58th-3B) and Davis (60th-1B).

LINEUP POS WAR WS
Willie Keeler RF 2.22 19.56
Danny F. Murphy 2B 4.04 25.62
Cy Seymour CF 10.32 40.54
Harry H. Davis 1B 4.1 26.45
Art Devlin 3B 3.74 21.67
Dave Zearfoss C -0.35 0.5
Charlie Babb SS -1.07 3.32
Ike Van Zandt LF/RF -1.73 3.69
BENCH POS WAR WS
Moonlight Graham RF -0.01 0
Offa Neal 3B -0.17 0.15

Christy Mathewson (31-9, 1.28) dominated opposition batsmen as he topped the charts in victories, ERA, shutouts (8), strikeouts (206) and WHIP (0.933). Excluding 1902, “Big Six” tallied at least 20 wins per season from 1901-1914. The Hall of Fame hurler registered a lifetime won-loss record of 373-188 with an ERA of 2.13. Red Ames whiffed 198 batters and furnished a 22-8 mark with a 2.74 ERA. Dummy Taylor fashioned a 2.66 ERA and compiled 16 victories. Hooks Wiltse contributed a 15-6 mark with 2.47 ERA in 32 games (19 starts).

ROTATION POS WAR WS
Christy Mathewson SP 10.56 39.05
Hooks Wiltse SP 3.56 18.38
Dummy Taylor SP 2.04 14.76
Red Ames SP 1.75 17.71
BULLPEN POS WAR WS
Red Donahue SP -1.32 4.41

 

The “Original” 1905 New York Giants roster

NAME POS WAR WS General Manager Scouting Director
Christy Mathewson SP 10.56 39.05 John Brush
Cy Seymour CF 10.32 40.54 John Brush
Harry Davis 1B 4.1 26.45 John Brush
Danny Murphy 2B 4.04 25.62 John Brush
Art Devlin 3B 3.74 21.67 John Brush
Hooks Wiltse SP 3.56 18.38 John Brush
Willie Keeler RF 2.22 19.56 John Brush
Dummy Taylor SP 2.04 14.76 John Brush
Red Ames SP 1.75 17.71 John Brush
Moonlight Graham RF -0.01 0 John Brush
Offa Neal 3B -0.17 0.15 John Brush
Dave Zearfoss C -0.35 0.5 John Brush
Charlie Babb SS -1.07 3.32 John Brush
Red Donahue SP -1.32 4.41 John Brush
Ike Van Zandt RF -1.73 3.69 John Brush

Honorable Mention

The “Original” 1962 Giants    OWAR: 52.6     OWS: 355     OPW%: .589

The Giants engaged in fierce late-season combat with the Braves and the Reds. “The Say Hey Kid” and his San Francisco teammates emerged with a hard-fought victory. Willie Mays (.304/49/141) supplied career-bests in runs (130) and RBI yet finished runner-up in the 1962 NL MVP balloting. The twelve-time Gold Glove Award winner retired in 1973 with 660 home runs, 2062 runs scored and 3283 base hits. Orlando “Baby Bull” Cepeda mashed 35 long balls, amassed 114 ribbies and registered 105 tallies. Felipe Alou (.316/25/98) and Leon “Daddy Wags” Wagner (.260/37/107) merited their first All-Star invitations. Seven-time Gold Glove Award winner Bill D. White swatted 20 big-flies, drove in 102 baserunners and produced a career-best .324 BA. Eddie Bressoud drilled 40 doubles while third-sacker Jim Davenport (.297/14/58) earned an All-Star nod along with the Gold Glove Award. Juan Marichal began a string of 8 consecutive All-Star appearances in ’62. The “Dominican Dandy” amassed 18 victories, completed 18 of 36 starts and compiled a 3.36 ERA.

On Deck

What Might Have Been – The “Original” 1904 Phillies

References and Resources

Baseball America – Executive Database

Baseball-Reference

James, Bill. The New Bill James Historical Baseball Abstract. New York, NY.: The Free Press, 2001. Print.

James, Bill, with Jim Henzler. Win Shares. Morton Grove, Ill.: STATS, 2002. Print.

Retrosheet – Transactions Database

Seamheads – Baseball Gauge

Sean Lahman Baseball Archive


xHR%: Questing for a Formula (Part 1)

One of the most important developments in statistics — and its subordinate field, sabermetrics — is the usage of multiyear data to produce an expected outcome in a given year. It’s an old concept, one that’s been around for centuries, but it likely originated in sabermetrics circles with Bill James. In Win Shares (arguably the birth of WAR), the sabermetric response to Principia Mathematica, he details a procedure of finding park factors wherein the calculator uses a weighted average of several years of data in conjunction with league averages to find park factors for a certain ballpark.

Methods such as Mr. James’s allow the amateur sabermetrician (and even the mighty professional statistician) to determine what ought to have happened over a specific time period. Essentially, a descriptive statistic. The best example of a descriptive statistic for the unlearned reader is xFIP, which basically describes what a pitcher’s fielding-independent average runs allowed would have been if the pitcher had a league-average home runs per fly ball rate.

Several statistics fluctuate greatly from year to year and are thus considered unstable. Examples include BABIP, HR/FB% for pitchers, and line-drive percentage. HR/FB% in particular is very fluid because all sorts of variables go into whether a ball leaves the park or not. For instance, on a particularly windy day, an otherwise certain dinger might end up in the glove of an expectant center fielder on the warning track instead of in the beer glass of your paunchy friend in the cheap seats. Rendered down, xFIP takes the uncontrollable out of a pitcher’s runs-allowed average.

With this, and an excellent article about xLOB% from The Hardball Times, in mind, I started developing my own statistic a few days ago. xHR%, as I dubbed it, attempts to find an expected home-run percentage, and from there one can easily find expected home runs (xHR) by multiplying xHR% by plate appearances, a more understandable idea to the casual baseball fan. In order to calculate this, I wrote several different (albeit very similar) formulas:

More likely than not, your eyes glazed over in that section, so I will explain.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s Home Run Tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. In cases where there isn’t available major-league data, then regressed minor-league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.

PA – Plate appearances

(For the uninitiated, HR% is HR/PA)

Essentially, what I have created is a formula that describes home-run percentage. First off, I used (.5)(AHRDH) + (.5)(AHRDL) in the denominator of the first part because a player spends half his time at home and half on the road. If I were so inclined, I could factor in every single stadium that gets visited, weight the average of them, and make that the denominator, but that’s just doing way too much work for a negligible (but likely more accurate) effect. Besides, writing that out in a formula would be a disaster because then there essentially couldn’t be a formula. Furthermore, having half of the denominator come from the player’s home stadium factors in whether or not the stadium is a home-run suppressor or inducer, which helps paint a more accurate picture of the player.

Dividing the player’s average HRD by(.5)(AHRDH) + (.5)(AHRDL) allows the calculator to get a good idea of whether or not the player was “lucky” in his home runs. If his average home-run distance is less than the average of the league and his home stadium, then it follows that he is a below-average home-run hitter and his home-run totals ought to be lesser.

Since the values in the numerator and the denominator will invariably end up close in value to each other, I decided that this part of the formula could be used as the coefficient (as opposed to just throwing it out) because it will change the end number only slightly. Moreover, the xCo (as I call it) acts as a rough substitute for batted-ball distance and park dimensions in order to factor those into the formula.

The second part, the meat of the formula, uses a weighted average of multiple years of home-run-percentage data to help determine what should have been the home-run percentage in year one (the year being studied). Basically, it helps to throw out any extreme outlier seasons and regress them back a little bit to prior performance without stripping out everything that happened in that season (notice that in every formula the biggest weight is given to the season studied).

At this juncture, I cannot say for certain how much weight ought to be given to prior seasons. Obviously, a player can have a meaningful and lasting breakout season, with continued success for the rest of his career, making it inaccurate to heavily weight irrelevant data from a season two years ago. On the other hand, a player can have a false breakout, making it better to include more data from previous seasons. Undoubtedly that will be the subject of future posts. At present, the formula is a developmental one that will no doubt experience heavy changes in the future.

For the interested reader, some prior iterations of the formula are below:

As a reminder, with some small addenda, here is the explanation for each variable:

HRDY3 – Average Home Run Distance Year Three (year three being the oldest of the three years in the sample). HRD is calculated with ESPN’s home run tracker. HRDY2 and HRDY1 follow the same idea.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium by any player.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. n cases where there isn’t available major league data, then regressed minor league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.

PA – Plate appearances

(You should be initiated at this point, so figure out HR% for yourself.)

The reason these formulas were thrown out was that the xCo relied too heavily on seasons past to provide an accurate estimate. When I briefly tested this one on a few players, it delivered incredibly scattered results. Furthermore, there wouldn’t be any data available for rookies to use these iterations on because there’s no such thing as a minor-league or high-school home-run tracker (and if there were I probably wouldn’t trust it). The first formulas described are overall more elegant and more accurate.

Stay tuned for Part 2, when results will be delivered instead of postulations.


Using Recent History to Analyze Dee Gordon’s Defensive Improvement

Dee Gordon is a polarizing player. His all-speed, no-power approach on offense has both fans and projection systems divided on what to make of his bat. Is he an elite offensive second baseman? Is he a one-hit wonder that won’t be able to repeat his numbers from 2015? Reasonable people can really disagree on Gordon’s bat.

Reasonable people can also really disagree on Dee Gordon’s defense, and that’s where I intend to focus my analysis today. Dee Gordon led all second basemen with a 6.4 Ultimate Zone Rating (UZR), which means he was worth roughly six runs on defense compared to an average second baseman. That doesn’t sound too unreasonable, right? Here’s where things get interesting. Gordon, despite his obvious athleticism, had previously been considered a below-average defender, coming in with a -3.4 UZR last year at second base. He had been a massively below-average defender at shortstop (where he played a few years ago before moving to second base full-time in 2014), so there are years of data painting him as a minus defender relative to other middle infielders.

In 2015, Gordon’s advanced defensive metrics took a massive jump forward. Dee Gordon improved by exactly 10 runs according to UZR, which is roughly an entire win difference thanks to his defense. Which defender is the real Dee — the one that flailed around in 2014, or the elite defender from 2015?

Let’s find some historical comparisons, and see what they can teach us about the repeatability of Dee Gordon’s defensive statistics.

We know Dee Gordon improved 10 runs defensively at second base to become one of the best defenders in the league at the position. Let’s take a look at the past 10 years, and find all second basemen that improved by at least 10 runs in UZR from year to year and had a UZR of at least 5 in the improved year. There are 16 player seasons that fit this criteria. Excluding those that didn’t play enough innings to qualify at second, 11 player seasons were left fitting the criteria. The numbers are presented below, along with the UZR that the player recorded the season following his improved year.

Table of Dee Gordon Comparisons

Among the second basemen in the last 10 years that made a big jump into the elite of the defensive statistics, on average those players lost almost nine runs of UZR the following season after the leap. The group lost about 60% of the improvements they had made the following season, indicating that a big jump in UZR for a second baseman is unlikely to signal a new level of performance. Among the qualifying group, not a single second baseman improved their UZR the following year again and only one member of the group, Placido Polanco in 2009, regressed by less than four runs.

However, there is a slight bright side. Only one member of the group had a UZR that was lower the year after “the leap” than before the improvement, indicating that taking a leap of over 10 runs of UZR means you almost certainly have improved as a defender. It’s just not by nearly as much as you would think from the leap-year UZR, but the players kept about 40% of the improvement they made in their improved year.

What does this mean for the Marlins’ speedy second baseman? While Dee Gordon’s huge jump in UZR this year means he’s almost certainly a better defender than he was two years ago, the improvement to his talent is likely only modest and not nearly what you would hope for after his great 2015 defensively. To those who pointed to Dee Gordon’s greatly improved UZR this season as a reason to believe he’s made big strides as a defender, I’ll sadly have to point out that we can expect Dee Gordon to return much closer to the mediocre defender he was in 2014 than the star he was in 2015.


The Best Bets for Over/Under Team Win Totals

Typically, projections and conjecture about the upcoming baseball season serve the general purpose of piquing your interest. However, sometimes they are good for making money. In this instance, here are some gambles you can make based on the Atlantis Race and Sports Book. 

This article was written on February 28, 2016 and the initial lines from this Fox Sports article were published on February 12, 2016.

The team win projections referenced are some basic (keyword, “basic”) projections I made for this season. 

  1. Colorado Rockies — Over 68 1/2 Wins, -110

The projection for the Rockies is shockingly bullish at first glance. But, take a step back and put it in context. The Rockies gave up 844 runs last year, the highest amount in MLB. This year they are projected to surrender 757, or 87 less runs; an improvement of over a half-run per game.

This is not ridiculous considering what you can expect from their pitching staff. They will have a full season from a maturing Jon Gray and they bolstered their bullpen with Jason Motte, Chad Qualls, and Jake McGee. These highlights may not be awe-inspiring, but they don’t need to be. The 757 projected runs against is the worst projected runs against in the NL. The projection doesn’t signify the Rockies are good; they signify they are not as bad as last year.

The Rockies offense is projected to keep chugging along, with 761 runs scored, which would be the ninth-lowest runs scored for a Rockies team from 1995–2015, and only 24 runs greater than last year’s Rockies team. It’s not all that extreme.

You don’t need to buy into the projections to view this as a good bet. You just need to buy into the idea that the Rockies are better than they were last year (when they won 68 games). The Rockies are the best bet at the dawn of spring training.

  1. Chicago Cubs — Over 89, -110

A pessimist may ask some of the following questions of the Cubs: (1) It’s the Cubs. Will they find some way to blow it?; (2) Will Jake Arrieta be able to carry over his performance of the past season and a half?; (3) Will Kris Bryant and Kyle Schwarber suffer a decline in performance now the league has had an off-season to study their strengths and weaknesses?

A pessimist would probably have more questions along these lines, but a pessimist would have more of these types of questions about other teams. So, don’t be a pessimist; play the odds, particularly if you’re betting. The odds say the Cubs are the best team in the league.

You may not want to bet on the Cubs’ projected win figure of 100, but it seems foolish to not bet on 90+ wins. Teams can be ravaged by injuries (see 2015 Washington Nationals) and teams can be ravaged by bad luck, but don’t let the world of possibilities cloud the virtue of probabilities. The probability that the Cubs win over 90 games for the second year in a row is greater than the pessimistic possibilities that may (but probably aren’t) dancing through your head.

  1. Los Angeles Dodgers — Over 87, -115

How much can one man be vilified? Snark surrounded Andrew Friedman and the Dodgers’ offseason, beginning with the departure of Zack Greinke. It continued as the Dodgers added more starting pitchers to their pitching staff than they did former general mangers to their front office staff. But that’s okay. You know better, don’t you?

This writer is hard-pressed to think of a team so well-equipped to survive the maladies and booby traps that a major-league-baseball team may encounter in a trek through a 162-game season (well, all but Clayton Kershaw’s arm falling off). They have a cadre of infielders (Kendrick, Turner, Utley, Seager, Guerrero), outfielders (Puig, Pederson, Ethier, Crawford, Van Slyke, Thompson), and Enrique Hernandez is essentially baseball’s equivalent to the utility knife. As suggested in the first paragraph, the Dodgers’ positional depth may only blush when it encounters the depth of their own pitching staff.

If you doubt the Dodgers, you may be the kind of person who’d choose a wallet with a $100 bill over another with ten $20 bills. But, don’t fear if you did that, you can turn that $100 into $187 if you bet on the Dodgers to win more than 87 games this year.

If you’re still unsure, you should have chose the wallet with ten $20 bills. You wouldn’t need to gamble at all if you did that.

  1. Washington Nationals — Over 87, -115

I will not blame you if you begin to feel a greater degree of uncertainty at this point. The luster may have come off the Nationals last year, but don’t you believe they could be re-polished? It’s feasible the Mets and Nationals (and maybe the Marlins) take the battleground of the mid-80s to determine the NL East champion, but it’s more likely that the division winner will walk away with more than 90 wins, or the Nationals will surpass everyone at that level.

You may not want to bet on the health of Stephen Strasburg, Anthony Rendon, and Jayson Werth. Or, you may just want to bet. If the latter is the case, the Nationals are a good bet; not a sure bet. But what is a sure bet? The Nationals’ biggest offseason splash was Daniel Murphy, but their most effective offseason acquisitions likely went under the radar. They bolstered their bullpen with the additions of Shawn Kelley, Oliver Perez, Yusmeiro Petit, and Trevor Gott. They also have a farm system that can (1) patch holes this year (Lucas Giolito) and (2) be used to acquired talent to fill any other holes through trade.

Oh, and Dusty Baker is their manager. You can feel how you want about that, but that means Matt Williams isn’t their manager this year and there’s only one way to feel about that.

  1. Kansas City Royals — Under 87, -115

Lets establish two things: (1) The projected wins are low, and (2) the universe may haunt you for making this bet.

Disregard the universe for the moment. The Royals should be the favorites to win the AL Central. I don’t state that in a hypothetical way. There is no team in the AL Central that is so good that you should expect them to overcome the Royals’ Black Magic. But, for purposes of this exercise, ask the important question: Is the Royals’ Black Magic so good that it will propel them to win more than 87 games? I think not.

Much like the Nationals, I wouldn’t take my last $115 and make this bet, but if you want to bet on, say, five over/under win totals for a MLB team, I would make this your fifth bet. But realize, you’re not making a bet on a the performance of a baseball team; you’re making a bet on the rhythms of the universe.

If you’re hesitant to bet on the universe, here are some other reasonable (but not as reliable) choices:

6. Boston Red Sox — Over 85 1/2 Wins, -105

7. Toronto Blue Jays — Under 87 Wins, -110

8. Texas Rangers — Under 86 Wins, -110

9. Detroit Tigers — Under 85 Wins, -115

10. Baltimore Orioles — Under 80 1/2 Wins, -110


Using WAR to Project Wins by Team and by Team Position

When I think of WAR, I tend to think of it truly in terms of wins.  So when I see that a player is rated an 8 WAR player, to me I’m literally thinking this guy will get my team approximately eight additional wins.  Otherwise we should really just rename this “best player metric.”  Not that anything is wrong with a best player metric, but let’s not try to “connect” it to wins, if it’s not really connecting to wins, right?  So I wanted to see how accurate this really is.  So I downloaded the team WAR data from FanGraphs from 1985 – 2013, both hitting and pitching. I summed up the hitting & pitching WAR and plotted them versus the teams’ wins that year, hoping for a strong correlation.

You can see from the chart above, a correlation of 0.7525 was recorded. Great! This also shows a replacement-level team is about a 46.5-win team.  Not unreasonable. Things make sense.
So then I figured, maybe we could try to do this same drill, but instead of using complete team calculations, what if we used individual position components?  Would that result in a more accurate result?  It’s possible, since the sum of a team’s individual player WAR values is not necessarily representative of the team WAR calculation alone.  So what would this look like?  So I went to FanGraphs again and downloaded the same dataset, except by position this time, instead of by team.  For example, I’ve linked the catcher data below.
I went through and built a comprehensive list, tagging each player’s position.  For pitchers the FanGraphs link was comprehensive, so I determined the RP and SP tag by assigning anybody who had >75% of their games also be games-started, as a SP, and all others as RPs.  In some cases players showed up in multiple categories (i.e. Mike Napoli was listed as a C and 1b in 2011).  In those events, I simply equally split their total seasonal WAR evenly across however many positions.  So if a 6 WAR player showed up as a C & 1b & DH in a single season, each position was credited with 2 WAR. This prevented double or triple-counting of players.  So how did this work out?
This actually projected slightly better. I do mean slightly — 0.7559 R2 versus the 0.7525 R2 when viewed as just team hitting and pitching.  It also predicted basically the same replacement-level team, a 46-win one.  So you could probably make the argument that it’s slightly more accurate to try to actually use the sum of the individual player WARs on the team instead of just a team calculation.  But it is so close it’s probably not worth the extra effort for most exercises.
This then led me to think, why not try to tie wins in as a multi-variable regression using all the positions individually instead of just a linear one where we connect wins to some singular WAR total?
Since I already had the data i gave it a shot.
You can see here that we actually arrive at an R2 of a bit above 76%.  So this is ever so slightly more predictive again.  Again you also see that the intercept ends up very close to other methods, at 45.4 Wins for a replacement-level team.  But bottom line, it’s basically as accurate as the other approaches.  However, what I do find interesting in this approach is that it actually appears to value RP highest and the SS position the lowest.  And those values are substantial. Very substantial.
You could probably make the argument then that shortstops are being overvalued by the present system. This could possibly mean the defensive position adjustment value for SS defense is too high.  Reasons aside, this seems like a very legit finding, as the “WAR” metric appears to overstate SS value by 26.7% (1/0.789).  So for example, a typical FanGraphs contract analysis approach can use a standard $/WAR value for projections into the future. Yet from this perspective, spending that $/WAR on a SS will have you significantly overweighting the benefit you’ll get from that SS.  To a lesser extent that would also apply to 2b, CF and RFs.
Conversely, RP, SP and catcher figures are actually quite undervalued.  This would certainly lend some credence to the approaches of “smaller” and “rebuilding” teams to date (think Royals and Astros, even last year’s Yankees) who have focused, among other things, on RP groups.
Based on this data, it would seem that focusing on pitching, specifically RP, and getting an excellent catcher, would be the best ways to focus on turning around a team.  At least in the context of a singular $/WAR metric.
While this wasn’t what I went into this analysis looking for, it was a fairly surprising result. Yet one that seems to be in line with the approach many teams are currently taking.
NOTE: I do understand this could be refined even further to re-weight the players WAR values exactly correctly based upon their actual number of games at each position instead of the approach I took which was just to equally distribute those values.  Given the size of that specific sample and what type of change we’d be talking about, I would find it unlikely that would move the needle substantially here though. But I think it’s an interesting finding.

The Sea Breeze Might Be Suppressing Homers at Petco Park

Land and water tend to do two different things when it comes to heat – the land retains it, while water repels it. The land’s retention of heat gives way by the afternoon, causing the rising heat to create a vacuum, which sucks in cooler air sitting on the surface of the ocean. Cool air rushes into the coasts by mid to late afternoon.

Petco Park is less than one mile from the Pacific Ocean, making it susceptible to these afternoon sea-breeze gusts, which tend to pick up in the spring time and fade in the summer. Fortunately, the ballpark is situated east of Coronado Island [1], which helps to buffer the would-be stronger sea breezes that might affect fly balls. The spring time gusts, the Coronado Island buffer, and the “effect” on fly balls are all hearsay. We’ll look closer at each of these, starting with the sea breezes at the ballpark.

The Wind Matters

Let’s take a closer look at how the wind affects fly balls at Petco Park. Not that the common word of the good people of San Diego can’t be trusted; it’s just a matter of science. Below is a graph of every home run hit at Petco Park over the last two years and the approximate wind speed while the home run was hit. It seems like there’s no correlation between wind speed and distance of home runs. http://i.imgur.com/VM9UQ87.png

However, not all wind is created equal, so the directional changes of the wind might have some influence on the flight of the ball. In the 2014 and 2015 seasons, the directional path of the wind for 261 home runs was registered (the wind was either “calm”, “variable”, or “NNE” which registered in only one case).

http://i.imgur.com/2MKKEgK.png

Most home runs were hit while the wind was blowing in the west-northwesterly (WNW) direction. Given that center field is due north of home plate that would mean that a majority of wind is probably blowing over the Western Metal Supply Co. brick building. My guess (I’m not a meteorologist) is that the wind is drawn in from the ocean, over the top of Coronado Island. Here’s a bird’s eye view of Petco; the arrow indicates where the wind is coming from – it’s the WNW direction from home plate.

http://i.imgur.com/VwKTKCr.png

So, this begs the question: How does WNW wind affect the distance of home runs? If we only look at the 101 home runs hit while the wind was blowing from the WNW direction, we begin to see something going on (r = – .21, p = .04. For every 1.53 mph faster the wind blows from the WNW direction, 1 foot is lost from every home run hit (R2 = .04, p = .04, n = 101)

http://i.imgur.com/BbTGQp4.png

No other individual direction of wind registered a significant influence of the distance of home runs hit, nor did the combination of every other wind direction have any effect. So much for the Coronado Island buffer.

It’s a decent speculation that the direction in which a home run was hit (left, right, center) might be more or less affected by the WNW wind. However, the direction that the home run was hit had no effect on the relationship of the distance of the home run, with respect to the speed of the wind. Exit velocity (the speed of the ball off the hitter’s bat) is an obvious predictor of home run distance. Exit velocity did show the weakest correlation with home run distance when hit in the WNW direction as compared to every other direction [2]. It’s likely that lower exit velocity means that the home run hit spent more time spent in flight, and was thus more susceptible to WNW winds that suppressed its total distance, regardless of the direction that it was hit.

Addressing the hearsay

Wind direction and wind speed were recorded ten minutes before every hour of every home game for the last two seasons [3,4]. No surprise, WNW winds dominate during the course of every home game.

http://i.imgur.com/XHT7nn6.png

Wind speed does seem to be higher in the afternoon a compared to the evening, peaking in the late afternoon.

http://i.imgur.com/1Ao9NQe.png

Additionally, May tends to have the strongest winds, but July and August have produced stronger winds than April. The theory that the spring is windier than the summer isn’t entirely true, but the spring does contain the windiest month of the regular season (May).

http://i.imgur.com/DXduBr2.png

Why does this research matter?

Obviously, the pitcher and the batter are going to matter most. But, the WNW wind explains about 4% – 5% of the reason why the home run ended up where it did (R2 = .044). If you’re the Padres and you play 81 home games a year 4% – 5% might mean something to you [5].

Here’s a crazy idea: let’s say you’re the Padres and you’re playing an afternoon (3pm – 5pm) game and the winds are blowing in from the WNW (there are at least 22 home games this 2016 season that will be played between 3pm and 5pm). If it’s early in the game, start Carlos Villanueva, who has a career 40.4% FB%, and if it’s later in the game, use Jon Edwards who had a 67.6% FB% in 52 innings between AAA and majors last season. Meanwhile, give Matt Kemp a break (who has a career 36% FB%) and platoon rookie Travis Jankowski who showed a 27% FB% in 34 games last year with the Padres.

Caveats

Why did I only choose the last two years? Wind patterns and sea breezes can change over time [6]. If we rewind the years, we may or may not see similar results. I felt that the last two years were a decent idea about what we could expect from 2016, any further back, and I might have run into a different profile. Don’t agree with these results? Add a few years, and let’s see if the trend holds — I’m all for more objectivity.

Yes, sea breezes could entail the “marine layer” which brings a body of cool and moist air into the ballpark, and I might take a look at that with my next article. However, it’s not the moisture that will suppress home runs — it’s the cool air. Warm air expands and lowers the air density, which results in less resistance on the baseball. Therefore the cooler the air is, the higher the density. Water (H2O) is less dense than atmospheric O2 and N2, therefore if there’s more moisture in the air, we’d see less resistance on the baseball [1]. Temperature, dew point, humidity, and pressure had no effect on the distance of home runs between 2014 and 2015.

[1] http://www.sandiegouniontribune.com/news/2011/jun/01/marine-layer-formidable–faraway-fences/

[2] Of the 4 directions that reported significant effects: North Northwest (r = .674, p < .01, n = 16), Northwest (r = .473, p < .01, n = 45), West Northwest (r = .393, p < .01, n = 101), West (r = .591, p < .01, n = 36)

[3] http://www.weatherforyou.com/reports/index.php?forecast=pass&pass=archive&zipcode=&pands=petco+park%2Ccalifornia&place=petco+park&state=ca&icao=KSAN&country=us&month=04&day=28&year=2015&dosubmit=Go

[4] https://www.wunderground.com/history/airport/KSAN/2016/02/23/DailyHistory.html?req_city=San%20Diego&req_state=CA&reqdb.zip=92101&reqdb.magic=1&reqdb.wmo=99999

[5] Quality of batter and/or pitcher was not tested in a multiple regression model, nor were any other predictor variables beyond wind speed. 

[6] See Coors Field effect: http://m.mlb.com/news/article/45755012/with-subtle-changes-to-dimensions-padres-hope-petco-park-plays-fair


2015: A Season of Unprecedented Parity In the American League

Background: When the 2015 season ended, I remarked to myself that there seemed to be a great amount of parity in the American League this year. So I decided to see whether it was just my faulty impression, or if it was indeed a closer race this year from years past.

Methodology: I decided to use variance in win percentages among teams in each season to define parity, with a lower variance equating to more parity.

Variance is a measure of the spread of a dataset. It is calculated as follows:
variance equation
where N=population size, mu = population mean, x_i = data entry.

I took my dataset from baseball-reference.com and used Python scripts to modify the raw data into a cleaner .csv format, so that I could run analysis in R.

The 2015 season had the lowest variance (.001836222) in win percentage of any season in the history of the American League (1901-2015).

Here is a time plot of the variances across seasons:
timeplotvariance
On the left, 0 is 2015 and it increases by one season as the graph goes to the right.

Conclusion: 2015 was in fact the season with the most parity all-time in the American League.

The American League season with the worst parity? Go back to 1932, where the Babe Ruth and Lou Gehrig-led Yankees won 107 games and the Boston Red Sox lost 111. (Variance = 0.01710932)