Archive for Research

Using Statcast Data to Measure Team Defense

As I’m sure you all know, Statcast allows us to measure the launch angle and velocity for each batted ball. These measurements afford us the ability to estimate precisely the expected wOBA value of every batted ball. Due to the skills of the opposing defense (as well as, admittedly, factors like luck, weather, and ballpark quirks), these estimated wOBA values are often drastically different from their actual values. That is the idea behind Expected Runs Saved (xRS), a metric that I have created to measure team defense. What follows is a discussion of the xRS methodology and some results.

The methodology: The calculation of xRS is actually quite simple. I started by downloading Statcast data from Opening Day through August 29th using Python’s pybaseball module. I then created a dataset consisting of all fair batted balls (excluding home runs) during that time frame. Conveniently, the downloaded data already has the expected wOBA value (based on exit velocity and launch angle), and the actual wOBA value (based on the outcome of the play) for each batted ball. Since we want to penalize teams for making errors, I changed the actual wOBA values for errors from 0 to 0.9 (the value of a single). Then all we have to do is take the average of each metric by team, find the difference, convert that to run values, and we have Expected Runs Saved.

Note that xRS is quite a bit more simplistic than UZR or DRS, as it doesn’t include any of the defensive value derived from keeping baserunners from taking the extra base, preventing steals, turning double plays, etc. While these surely play a role in run prevention, they are less important than converting batted balls into outs, and since I have a full-time job I decided to keep it simple and ignore them.

The results: Let’s start with the most obvious question: which team has the best defense?

It’s the Angels, and it’s not particularly close. While their pitchers have allowed a lot of hard contact (.323 batted-ball xwOBA, 28th in baseball), their actual wOBA on contact is 2nd in baseball at .291, trailing only the Dodgers (.284), who, as Jeff Sullivan recently noted, excel at inducing weak contact.

On the opposite end of the spectrum are the Blue Jays, who have been generally good at generating weak contact (.305 batted-ball xwOBA, 5th in baseball) but terrible at converting those weakly hit balls into outs (.322 batted ball wOBA, 28th in baseball).

In both cases UZR tends to agree, ranking the Angels and Blue Jays 1st and 27th, respectively. Due to (I think) the simplicity of the model, the run values for xRS are quite a bit more extreme than those of either UZR or DRS, but it ranks the teams in generally the same order. At the very least, xRS doesn’t disagree with UZR and DRS much more than the latter two disagree with each other.

Two teams that xRS likes a lot more than UZR and DRS are the Mariners (2nd in xRS, 11th in UZR, 15th in DRS) and Yankees (4th in xRS, 13th in both UZR and DRS). Meanwhile, it dislikes the Dodgers (12th in xRS, 3rd in UZR, 1st in DRS) relative to the other metrics, as well as the Reds (28th in xRS, 5th in UZR, 4th in DRS). Why is this happening? I really don’t know. Could be some defensive components I have left out of xRS, could be ballpark effects, or it could just be that defensive metrics are weird. It remains a mystery. Such is baseball, and such is life.


Predicting the Playoffs

By Dr. Gregory Wood and David Marmor

Among the sabermetric community, the baseball postseason has the reputation of being random. In the past 20 years from 1996-2015, the predicted winner — i.e. the team with the best season record — won the World Series only four times. This raises the question as to what specific skills and performances of a team during a season have a meaningful, if any, correlation with postseason success. This study analyzed data from every playoff team from 1996-2015 to search for significant relationships that could be used to predict postseason wins.

The first method that I used was looking for linear correlations between regular-season statistics and various measures of postseason success. If some statistics were more correlated to playoff success, they could be used to predict a team’s playoff performance.

The most obvious place to start was regular-season wins. As I had expected, there was very little correlation between regular-season wins and postseason wins.

In the graph below, every playoff team’s regular-season wins has been plotted compared to their playoff wins. The data has an extremely low correlation coefficient and is not a good fit with the trend line. The correlation coefficient was 0.007, which is far below the usual significance level of 0.6 or higher. It appears that regular-season record is not a significant factor in post-season success. This explains why postseason success is considered random.

wins vs pwins.png

The goal was to find another statistic that had a significantly stronger correlation to playoff success. I studied many other statistics including runs, runs allowed, ERA, hits and hits allowed, home runs and home runs allowed, walks and walks allowed, strikeouts and strikeouts allowed, slugging percentage, and on-base percentage.

For each one I plotted the correlation chart and found the coefficient of correlation assuming a linear correlation. However the R-squared term was always very small no matter what I tried. This was true even with statistics that are vital to regular-season success, like ERA, OBP, runs and runs allowed.

Untitled1.png

I looked at both the actual totals as well as the totals adjusted for that year’s league average. That way I could account for the fact that the total runs scored has varied quite a bit over the 20 years.

I also tried defining playoff success in three different ways: playoff wins, playoff series won, and playoff winning percentage. However, I got similar results no matter which method I used. None of them had correlations that were significant either way. The statistic that correlated best to playoff wins was run differential, but even it was too weak a correlation to be meaningful.

net runs vs playoff wins.png

The R-squared is still very small, so run differential is not a good predictor of post-season success. This method seems to suggest that the playoffs are in fact random. However, while each statistic individually was not strongly tied to playoff success, maybe combinations of them were.

To find combinations that might be meaningful, I tried using linear modeling. I used a computer program to find the best-fit line between playoff success and the regular-season statistics I was using. The model adjusted the weight given to the different factors to try and find results that were closest to what actually happened by minimizing its chi-squared term. The advantage of this method was that it could combine several factors at once. That way it could determine if there were certain factors that were important in playoff play.

The program was designed to run thousands of simulations at a time to try and improve on its previous best result by minimizing its error compared to the actual results. For each run I selected which statistics would be used. I could give the simulation different starting assumptions and set ranges for how much weight each category could be given. When the initial conditions were changed, the simulation would return different results. However, it was never able to find a result that was statistically significant. The best coefficient of correlation I found was 0.063, far below the level that implies correlation.

It seems that the sabermetric community is correct. Playoff performance is random and not predictable by regular-season performance. Therefore, teams should attempt to build the best regular-season team they can and hope to then get lucky in the playoffs, as opposed to trying to plan specifically for the playoffs.

Appendix

runs vs playoff wins.png

RA vs playoff wins.png

HR vs playoff wins.png

batting average .png

Untitled2.png


dScore: End of August SP Evaluations

I went over the starters version of dScore here, so I’m not going to re-visit that here. I’ll just jump right in with the list!

Top Performing SP by Arsenal, 2017
Rank Name Team dScore +/-
1 Corey Kluber Indians 69.41 +2
2 Max Scherzer Nationals 62.97 -1
3 Chris Sale Red Sox 56.82 -1
4 Clayton Kershaw Dodgers 55.26 +1
5 Noah Syndergaard Mets 47.39 +2
6 Stephen Strasburg Nationals 47.24 +5
7 Danny Salazar Indians 43.46 +16
8 Randall Delgado Diamondbacks 42.00 +1
9 Luis Castillo Reds 37.99 +5
10 Alex Wood Dodgers 40.72 -8
11 Zack Godley Diamondbacks 39.55 -1
12 Luis Severino Yankees 39.24 +1
13 Jacob deGrom Mets 36.69 -1
14 Dallas Keuchel Astros 37.37 -8
15 James Paxton Mariners 35.81 +1
16 Carlos Carrasco Indians 34.23 +4
17 Sonny Gray Yankees 30.59 UR
18 Brad Peacock Astros 29.98 +6
19 Lance McCullers Astros 32.18 -11
20 Buck Farmer Tigers 31.31 UR
21 Nate Karns Royals 30.21 -2
22 Zack Greinke Diamondbacks 29.45 -4
23 Charlie Morton Astros 28.55 UR
24 Kenta Maeda Dodgers 27.40 -7
25 Masahiro Tanaka Yankees 26.83 -3

 

Risers/Fallers

Danny Salazar (+16) – dScore never gave up on him, despite him being absolute trash early on this year. He came back and dominated, launching him up the ranks even farther in the process. Current status: injured. Again.

Sonny Gray (newly ranked) – If there were any doubts about the Gray the Yankees dealt for, he’s actually surpassed his dScore from his fantastic 2015 season. He’s legit (again).

Alex Wood (-8) – Looks like the shoulder issues took a bit of a toll on his stuff, but dScore certainly isn’t out on him.

Dallas Keuchel (-8) – Keuchel’s stuff isn’t the issue. He’s still a buy for me.

Lance McCullers (-11) – Poor Astros. Maybe not too poor though; their aces have gotten hammered but haven’t fallen far at all. McCullers is going to bounce back.

 

The Studs

Some light flip-flopping at the top, with Kluber taking over at #1 from Scherzer. The Klubot’s been SO unconscious. Everyone else is pretty much the usual suspects.

 

The Young Breakouts (re-visited)

Zack Godley (11) – He’s keeping on keeping on. He barely moved since last month’s update, and I’m all-in on him being a stud going forward.

Luis Castillo (9) – He’s certainly done nothing to minimize the hype. In fact, he’s added a purely disgusting sinker to his arsenal and it’s raising the value of everything he throws. Also, from a quick glance at the Pitchf/x leaderboards, two things stand out to me. He seems to have two pitches that line up pretty closely to two top-end pitches: his four-seamer has a near clone in Luis Severino’s, and his changeup is incredibly similar to Danny Salazar’s. That’s a nasty combo.

James Paxton (15) 

 

The Test Case

Buck Farmer (20) – Okay, so to be honest when he showed up on this list, I absolutely thought it was a total whiff. By ERA he’s been a waste, but he’s really living on truly elite in-zone contact management, swinging strikes, K/BB, and hard-hit minimization. His pitch profile is middling (not bad, but not great either), so I really don’t think he’s going to stay this high much longer. He’s certainly doing enough to earn this spot right now, and I’d expect him to not run a 6+ ERA for much longer.

 

The Loaded Teams

Yankees – Luis Severino (12), Sonny Gray (17), Masahiro Tanaka (25) / Some teams have guys higher up, but the Yankees are loaded up and down.

Astros – Dallas Keuchel (14), Lance McCullers (19), Brad Peacock (18), Charlie Morton (23) / Similar to the Yankees. Morton and Peacock are having simply phenomenal years.

 

The Dropouts

Rich Hill (39)

Trevor Cahill (35)

Marcus Stroman (28)

Poor Rich Hill. Lost his perfect game, then lost the game, then lost his spot in the top 25. Cahill’s regressed to #DumpsterFireTrevor since his trade to the Royals. Stroman really didn’t fall that far…and his slider is still a work of art.

 

The Just Missed

Jordan Montgomery (26) – Too bad the Yankees couldn’t send down Sabathia instead. This kid is good.

Aaron Nola (27) – #Ace

Carlos Martinez (29) – Martinez simply teases ace upside, but frankly I think you can pretty much lump him and Chris Archer (30) in the same group — high strikeouts, too many baserunners and sub-ace starts to move into the top tier.

Dinelson Lamet (32) – He’s absolutely got the stuff. He could stand to work on his batted-ball control though.

Jimmy Nelson (34) – dScore buys his changes. He finished at #148 last year. I’ll call him a #2/3 going forward.

 

Notes from Farther Down

Jose Berrios is all the way down to 47. His last month cost him 19 spots, but frankly it could be much worse: Sean Manaea lost 39 spots, down to 87. Manaea really looks lost out there. I don’t want to point at the shoulder injury he had earlier this year since his performance really didn’t drop off after that…but I’m wondering if he’s suffering from some fatigue that’s not helped by that. He’s pretty much stopped throwing his toxic backfoot slider to righties, and that’s cost him his strikeouts. Michael Wacha is another Gray-like Phoenix: he’s up to 52 on the list, once again outperforming his 2015 year. I’m cautiously buying him as a #3 with upside. And finally, buzz round: Mike Clevinger (33)Alex Meyer (36)Robbie Ray (38)Rafael Montero (41), and Jacob Faria (43) are already ranked quite highly, and outside of Montero and maybe Meyer I could see all of them bumping up even higher. Clevinger’s really only consistency away from being a legitimate stud.

 

My next update will be the end-of-season update, so I think I’m going to do a larger ranking than just the top 25; maybe all the way down to 100. Enjoy the last month-plus!


The Correlation Between BABIP Rate and Three True Outcomes

First things first, I would like to credit my friend Elling Hofland for coming up with the main idea of this piece. He’s the one who provided me with his thoughts and theories that allowed me to expand on this topic in the first place. Give him a follow on Twitter for sports and stats-related banter; his handle is @ellinghofland.

BABIP, or batting average on balls in play, is an incredibly useful stat. It does a fantastic job at using both luck and quality of contact to give a better grasp as to how a player actually performs during batted-ball events. These batted-ball events only take up a certain percentage of a player’s plate appearances. BABIP rate focuses on how many plate appearances a player has relative to the number of batted-ball events they have. To calculate BABIP rate, you take at bats minus strikeouts and home runs, plus sacrifice flies, and divide that by plate appearances. For example, if a player has 600 PA during a single season along with a 300 batted-ball events, they have a BABIP rate of .500.

Now, if you look at the three variables taken out of that equation, you’re left with walks, strikeouts, and home runs, otherwise known as the “three true outcomes.” These are called true outcomes due to the fact that none of them (for the most part) involve defense on the field. A shortstop can’t screw up a strikeout, walk, or a home run. You can take these three true outcomes and turn them into a rate as well. If you add up a player’s strikeouts, walks, and home runs and then divide them by plate appearances, you get TTO rate.

Let’s look at Mike Trout. In 2017, Trout’s BABIP currently sits at .369. However, he has a BABIP rate of .550 along with a TTO rate of .435, meaning that 55% of his at bats end with a ball in play, while 43.5% of his plate appearances result in a strikeout, walk, or home run. Both BABIP rate and TTO rate are useful stats, as they essentially show how well and how often a player makes contact. While BABIP itself is useful, it can be hard to tell how luck is involved in a batted-ball event when it isn’t hit over a fence for a homer. BABIP rate attempts to bridge the gap between BABIP and the three true outcomes.

Miguel Sano is a well-known slugger. In his three seasons in the majors, he’s smashed the ball when he’s hit it, boasting exit velocities of 94.0 in 2015, 92.3 in 2016, and 93.1 in 2017. Despite these consistent EVs, his BABIP has fluctuated from 2015 to 2017, with marks of .396, .329, and .385, respectively. If we look at his BABIP rate from 2015-2017, they look like this: .429, .478, and .473. Despite the difference in his BABIP from 2016 to 2017, his BABIP rate has stayed nearly the same, meaning that he’s still making the same amount of contact with the ball despite fewer balls falling for hit in 2016. Looking solely at BABIP, it could be argued that 2016 was his “regression” to where he should be after sporting an incredibly high BABIP in 2015. In 2017, one could say his high BABIP is a cause for concern, as he may just be getting lucky. However, his BABIP rate shows that isn’t the case.

Let’s look at another player, Brandon Phillips. Phillips’ BABIP has been incredibly consistent during his past three years, sitting at .315 in 2015, .312 in 2016, and .305 in 2017. Additionally, his BABIP rates have been .820, .816 and .802. Phillips puts the ball in play nearly 80% of the time on a regular basis.

So, as you can imagine, there is a real link between BABIP rate and TTO rate. The more contact a player makes, less they tend to walk or strikeout. Thus, a high BABIP rate equals a low TTO rate. This is exactly what we see if we attempt to correlate these two stats. Below is a snapshot of a graph that shows TTO rate vs. BABIP rate.

TTO vs BABIP rate

Players names aren’t included because, A) it clutters the graph, and B) they aren’t necessary at this point. Accompanying this graph is a trend line with an R squared value, otherwise known as a correlation coefficient. Essentially, an R squared value measures how well your model fits your data, or in this case, how closely correlated  TTO and BABIP rate are to each other. It turns out that the R-squared value is .991, which means that the relationship between BABIP rate and TTO rate fit very well together: in fact, you’ll find that TTO rate and BABIP rate are almost the exact opposites of each other. The players with the top 10 lowest BABIP rates in the MLB all have TTO rates of .437 or higher, meaning that their at bats result in an outcome of a walk, home run or strikeout 43.7% of the time. Inversely, players with the lowest BABIP rates all have TTO rates of .225 or lower.

We can also derive more information from these numbers using this correlation. Players who have a low BABIP rate have a very high OPS. Remember, these players also have high TTO rates. The top 10 players, Judge, Sano, K. Davis, Souza Jr., Reynolds, Morrison, J. Upton, C. Santana, Lamb, and Stanton all have an OPS of .841 or higher. The players with the highest BABIP rates (or lowest TTO rates) have an OPS of .798 or lower.

BABIP rate can tell us a lot of about a player. Just by glancing at a player’s BABIP rate, you can have an instant idea of how often the player walks, strikes out, or hits dingers. Not only that, but it you can tell you a lot about their offensive production. High TTO rates usually mean high hard-hit rates along with high exit velocities. BABIP rate also helps understand BABIP itself better and teaches that you can’t judge a player by BABIP all the time. In most cases, players with an over-inflated BABIP (relative to past performances), just tend to mash the absolute heck out of the ball, as told by their low BABIP rates and high TTO rates. On the opposite end, players with a steady BABIP will have very high BABIP rates and tend to be contact hitters that put the ball in play and don’t hit for power. BABIP rate, along with its correlation to TTO rate, has the potential to be a powerful, tell all offensive stat.


Why the Mets Should Call Up Tim Tebow in September

As of August 21st, 2017 Tim Tebow was slashing .220/.304/.343 between the New York Mets’ High-A team, the Columbia Fireflies (South Atlantic League), and their Advanced-A squad, the St. Lucie Mets (Florida State League). In 442 minor-league plate appearances, he is the owner of a .304 wOBA, and is striking out at a 26% clip while walking in 8% of his plate appearances. For every one ball that Tebow elevates, he is hitting the ball on the ground three times over. Right off the bat (pun intended), it is evident that Tebow’s offensive game leaves something to be desired.

Let’s take a quick look at how Tebow stacks up with the average hitter, in each A-ball league, that has had a minimum of 200 plate appearances and has primarily played the same position(s) as Mr. Tebow (outfield & designated hitter):

*Data as of 8/21/2017
Player Age BB% K% AVG OBP SLG OPS wOBA wRC+
Tim Tebow 30 8.8% 26.5% 0.220 0.304 0.343 0.647 0.304 90
Avg. SAL OF/DH 21.5 7.7% 21.9% 0.253 0.322 0.378 0.700 0.322 104
Avg. FSL OF/DH 23 8.2% 21.4% 0.255 0.324 0.370 0.694 0.324 103

Only his walk rate appears to be on par with each respective league’s average. Additionally, Tebow has logged a .913 fielding percentage while playing (primarily) left field this year. It is widely understood that fielding percentage is a “far-from-perfect” measurement when objectifying defensive ability, but it can provide a high-level perspective on one’s aptitude as it relates to fielding the baseball. To put Tebow’s number into context, the lowest fielding percentage in the major leagues this year by an outfielder (minimum 100 innings played) is Mark Canha of the Oakland A’s, at .922.

Many words come to mind when attempting to summarize the 30-year-old’s all-around quality of play while in A-ball; ‘excellent’, ‘incredible’, or ‘promising’ would not be any of those words. However, despite the subpar statistical measuring points, the Mets should seriously consider calling up Tim Tebow to the big leagues come September.

No, that is not a typo. Yes, you read the last sentence of the above paragraph correctly. When rosters expand to include anyone on the 40-man roster on September 1st, the New York Mets should give sincere thought to adding Tim Tebow to their roster/big-league club. Now, why would the New York Mets, a team that owns a 55 – 71 win-loss record and trails the NL Wild Card race by 13.5 games and NL East Division title by 21 games, bother calling up a poorly-performing 30-year-old high-A-ball player? The answer, as it is with many things in life, is money.

Baseball clubs generate revenue in many ways: merchandise sales, concessions sales, corporate sponsorships, media deals, etc. One of the largest and most obvious ways in which income at the major-league club level is generated is through home-park ticket sales. Tim Tebow excels at putting fans in the stands:

YoY Average Home Game Attendance Figures

Year Columbia Fireflies St. Lucie Mets
2016 3,768 1,405
2017 4,783 1,996
YoY % Change 21% 30%

As you can see, both teams that Tebow has played for this year have experienced huge jumps in home attendance figures. This has occurred despite the fact that in 2016 the Columbia Fireflies were celebrating their inaugural season at a brand new stadium, and the St. Lucie Mets were 11 games over .500 in the thick of a playoff race (compared to 11 games under .500 in 2017 at the time of this publication).

As I alluded to above, a lot of circumstances can impact attendance figures: new stadium, weather, promotions, team quality, opponent, etc. However, I think that it’s pretty evident that Tim Tebow’s arrival on the Mets’ minor-league scene has driven a majority of the jump. To confirm this, let’s look at attendance figures from a different angle – specifically, 2017 home attendance numbers and how they vary for each team from when Tebow was actively rostered vs. when he was not:

*Data as of 8/19/2017
Team Tebow Rostered # of Home Games Avg. Home Game Attendance % Change
Columbia Fireflies No 20 3,757
Columbia Fireflies Yes 41 5,308 29%
St. Lucie Mets No 37 1,745
St. Lucie Mets Yes 24 2,419 28%

Again, it’s evident that Tim Tebow’s roster presence has enticed people to come to the home team’s ballpark at a clip nearly 30% greater than if he were not on the team.

So how do we translate these attendance figures into dollars and cents? Since I do not have access to either team’s ticketing database, this is where some assumptions about average per-cap and ticket value will have to come into play. Baseball America’s JJ Cooper & Josh Norris have recently written articles that similarly examine Tebow’s impact at the box office – however, their stories concentrate heavily on road attendance and overall league attendance impacts, rather than the home ballpark’s ticket sales (which are critical to driving a organization’s recognized revenue). In his article, Norris notes that most minor-league operators use a $21 per-cap estimate for fan spending. This figure is an estimate of what each fan that enters the ballpark will have paid in tickets, concessions, merchandise, and parking.

For the first 39 home game dates (41 games due to two doubleheaders) of their 2017 season, the Columbia Fireflies were able to showcase Tim Tebow in uniform. They attracted 207,031 fans. In the first 39 home game dates of their inaugural 2016 season, the Fireflies drew 155,132 fans. The difference between 2017 and 2016 for these first 39 home game dates is 51,899 fans. If we apply the $21 per-cap estimate referenced above, we are looking at about $1.1 million in additional revenue that can be largely attributed to Tebow being in uniform. Tebow’s last game for the Fireflies was on June 25th, his first game for the St. Lucie Mets was on June 28th. Through August 18th, Tebow has been a member of St. Lucie’s roster for 22 home game dates (24 games due to two doubleheaders) and has helped attract 53,207 fans. In 2016, the St. Lucie Mets were able to draw 21,097 during the same stretch. If we apply the $21 per-cap estimate, it will have amounted to $674,310 in additional revenue over the course of the 22 home game dates at this point in the season. Additionally, Tebow has undoubtedly drawn in an abundance of new consumers to each team’s ballparks and databases. This is information that can be leveraged for future sales and marketing initiatives. It would not be ludicrous to state that, combined, the Mets’ A-ball affiliates have increased home-park revenues by roughly $2 million due to Tim Tebow.

Let’s take a hypothetical look at these trends from the 2017 New York Mets point of view. Their current 40-man roster sits at 36 occupants – so there is no risk of having to DFA a player in order to bring on a newcomer. They are far removed from the playoffs, and already have their sights set on next year. Even by adding Tebow to the 40-man roster, they would have three additional spots to work with should they want to expose some of their MLB-ready prospects to low(er)-leverage big-league games in September. The Mets would have to pay Tebow a pro-rated MLB minimum salary, which would come to be about $65K for the final four weeks of the season, pennies compared to what he would bring back in return.

Here is a table of the historical attendance at Citi Field for the month of September since 2010:

Year Citi Field Sept. Attendance # of Games
2010 382,306 14
2011 433,251 16
2012 385,292 16
2013 340,799 15
2014 337,343 13
2015 353,005 11
2016 468,283 14
2017 ? 14

I’ve highlighted 2014 because it most closely resembles the environment that the 2017 Mets will be embarking upon, as you can see below:

*Through 122 games
Year Winning % GB – Division GB – Wild Card Weekday Home Games Weekend Home Games
2014 0.467 10.5 7.5 7 6
2017 0.443 20 12 8 6

You will notice, the 2014 and 2017 Mets were/are both clearly out of the playoff picture and had/have a similar distribution of home games throughout the month of September. Despite one more overall September game in 2017, the 2014 season should prove to be a good starting point for us; because of the extra game, let’s estimate that the Mets will bring in around 339,000 people to Citi Field in September of 2017.

Now, the fun part. How does that audience, and consequentially revenue, project to increase if Tim Tebow were added to the roster? It would be rather difficult to forecast how a marketplace like New York City would react to a move of that nature. There are infinite amounts of variables that could be considered: chilly September temperature and weather volatility, inability to purchase season packages so late in the year, the comparison of the NYC marketplace to that of Columbia, SC and St. Lucie, FL, the matter of the media, the beginning of football season, etc. the list could go on and on. For simplicity’s sake, let’s assume that New York’s market would react in a similar manner as that of Columbia & St. Lucie’s – home attendance gains of near 30%. That would push an additional 102,000 customers through the Citi Field turnstiles during the last four weeks of the season.

The average MLB ticket price in 2016 was $31.00, a 7% increase from the previous year. A 7% increase from the 2016 ticket price would put us just over $33.00 for 2017. This gives us a place to start with regards to estimating revenue impact. I don’t have access to the Mets’ ticketing database, so this barometer will do for the time being. My gut tells me that the $33.00 price point is low; typically season-ticket prices are used when calculating the league-wide annual average ticket price, and season tickets are sold at a discount compared to single-game ticket prices. Being that it is September, most fans that would turn out to see Tebow would be purchasing at the single-game ticket price point (or group-ticket price point, but that complicates things further) since season packages are likely no longer being sold for 2017.

Irrespectively, at this point the math becomes clear: 102,000 additional fans at $33.00/ticket would generate an estimated $3.4 million in surplus revenue. This doesn’t even include the additional revenue that would accrue via a multitude of other outlets. Concessions, merchandise, and parking – all revenue streams that the Mets split with their respective vendors – would experience huge jumps. Strategies to boost season-ticket-holder retention for 2018 (Tim Tebow meet and greet anyone?) would likely yield positive results. As stated before, entirely new ticket buyers would flood into the Mets’ ticketing database — which should boost returns in some form or fashion in future years.

Tim Tebow is not going to play baseball forever. He may choose to call it quits on his “pro-ball quest” after this year. Who’s to say he even wants to go through another year toiling away in the low minor leagues? A promising and young (albeit injury-prone) starting pitching staff should have the Mets within shouting distance of playoff contention for the next couple of years. If that is the case, they will not want to waste an NL roster spot on a subpar, 31-year-old, designated hitter. Roughly $3.5 million should allow the Mets to chase around 0.5 WAR on the open market. It could provide them additional wiggle room to take on extra salary in a deadline trade next year. It would allow the acquisition of players along the likes of Trevor Cahill, Logan Morrison, or Drew Storen…all of whom signed for under $3 million this past offseason. It could be put toward additional infrastructure, baseball analytics, or scouting staff.

Sure, there are certainly more deserving players in the Mets’ minor-league system that have ‘paid their dues’ to a greater extent than Tim Tebow — all in the hopes of getting a call-up to the Show. But baseball is a business, and at the end of the day, no one in the Mets’ system will be able to have an impact on fans the same way that Tim Tebow does/can. The Mets need to capitalize on their current situation before the former Heisman trophy winner tires of the long and uncomfortable bus rides, motel stops, and food spreads that dot the minor-league landscape. The Mets need to cash in on their investment before Tebow bids baseball adieu.


A Surprising Benefit of Throwing a Good Sinker

*Note: all stats are as of August 1, 2017

I originally intended to write a post about the aspects of a four-seam fastball that are most important in generating whiffs. The correlation between fastball velocity and whiff rate on the fastball is only about 25%, so I was interested to find out whether other factors, such as vertical movement, location, or pitch usage, are better indicators of a fastball’s swing-and-miss tendencies. While some of the names at the top of the fastball whiff list were not surprising at all (Chris Sale, Jacob deGrom, James Paxton), there were several others who I was surprised to see, including Brandon McCarthy, Rick Porcello, J.A. Happ, and Clayton Richard. There was one glaring similarity between these seemingly overachieving pitchers: they throw a high percentage of sinkers.

So I looked at the correlation between whiff rate on the four-seam fastball and sinker usage, only to find that it was not only small, but also negative. However, looking at the correlation between these two variables is somewhat like a chicken-and-egg problem: does sinker usage affect a pitcher’s four-seam fastball whiff rate, or does his four-seam fastball whiff rate affect his sinker usage? The latter option certainly seems reasonable: a pitcher who is ineffective with his four-seamer is more likely to develop a sinker than a pitcher with a dominant four-seamer. For this reason, we have to dig deeper to determine if sinker usage has any effect on four-seam whiff rate.

I looked instead at only the 48 qualifying pitchers who throw a sinker at least 10% of the time (and a four-seam fastball at least 5% of the time). I found the correlation between several variables — some relating to the sinker and some unrelated to it — and four-seam whiff rate. If the variables related to the sinker have a significant correlation with four-seam whiff rate, then that implies that a pitcher’s sinker can have an effect on his four-seam fastball. The variables I looked at were the four-seam fastball’s velocity and vertical movement, the sinker’s velocity and vertical movement, and the difference between a pitcher’s four-seam fastball and sinker in both velocity and vertical movement. Here are their correlations with four-seam fastball whiff rate:

   4-Seam Fastball                                        Sinker                                       Difference
Velocity 0.3022 0.2249 0.3011
V-Movement -0.0544 -0.2875 0.4348

 

There are a few interesting things to note here. First, the four-seam fastball’s velocity seems to be just as important as the difference in velocity between the four-seamer and the sinker. While velocity is often the first thing most people look for to determine if a pitcher has a swing-and-miss fastball, relative velocity is equally as important as absolute velocity, at least when it comes to pitchers who also throw a sinker. This confirms the notion that changing speeds can upset the hitter’s timing and make a fastball seem faster than it is.

Relativity is even more important when it comes to vertical movement. While there is no correlation between four-seam whiff rate and four-seam vertical movement, there is a significant correlation between four-seam whiff rate and the difference in vertical movement between the four-seamer and the sinker (I’ll call this “v-movement difference”). This seems to show that the downward movement of the sinker makes hitters more likely to swing under the four-seam fastball; they keep the sinker in mind, so the four-seamer appears to have more vertical movement than it actually does. If this is true, then we should expect v-movement difference to have a greater effect on pitchers who throw a higher percentage of sinkers. To test whether this is true, I increased the requirement of minimum percentage of sinkers thrown in intervals of 5%, from 10% to 35%. I then found the correlation between four-seam whiff rate and v-movement difference at these different thresholds. Here are the results:

threshold correlation
10 % 0.4348
15 % 0.4125
20 % 0.4752
25 % 0.4752
30 % 0.5121
35 % 0.5025

 

Just as we expected, the correlation between v-movement difference and four-seam whiff rate is higher for pitchers who throw more sinkers. If the relatively high correlation we observed at the 10% threshold were pure luck, then the correlations at higher thresholds would be scattered randomly. The fact that there is a clear upward trend in correlations as the threshold increases proves that v-movement difference does, in fact, have an effect on four-seam whiff rate. While this does not necessarily mean that adding a sinker will help a pitcher get more whiffs on his fastball, it does prove that the quality of a pitcher’s sinker can affect the effectiveness of his fastball. More specifically, we also learn that a good sinker, in terms of generating whiffs on the four-seamer, is one that has little vertical movement (or a lot of sink) in relation to the four-seamer.


A Baseball World Without Intentional Walks

There are at-bats. And the possible positive outcomes of those come down to three: hits, walks and batters hit by pitches. Hits can be separated in singles, doubles, triples and home runs. Hits by pitch are pretty much what they sound like. Walks, on the other hand, are bases on balls awarded by the pitcher to the batter either unintentionally due to lack of control or intentionally to supposedly prevent the hitter for inflicting more than single-valued damage by giving him the first base for free.

The intentional base by balls have always been present in baseball. They have been tracked, though, since 1955. From that point in time to 2016 (the last complete season with data available), a total of 73,272 IBB have been awarded to batters, for an average of around 1,182 per season. If we look at the full picture, though, there have been more than 11 times more BB than IBB in the same period of time. Obviously, hitters are not awarded a base for free if they have not gained a certain status in which pitchers “fear” the possibility of them being punished by a bomb to the outfield that holds high value and could turn into runs for the opposing team.

Even with that, IBB rates are at their lowest since 1955 due to strategical improvements and the study of the game, which has led to the conclusion that awarding bases to hitters for free is more than probably not the best approach. But with more than a thousand instances per season on average, we have a big enough sample size as to have some fun with the numbers and try to think of a baseball world in which IBB had been somehow vetoed by the MLB and therefore not awarded to hitters from 1955 on. What could have this meant for batters during this span? How much could have it impacted the hitting totals for some of the already-great hitters of baseball history? Let’s take a look at the data.

Counting from 1955, only five players have had careers in which they have posted an IBB/PA larger than 2% in at least 10,000 PA. Barry Bonds, Hank Aaron, Ken Griffey, Albert Pujols and David Ortiz. Those are some scary names to have at the home plate staring at you while playing the role of the pitcher. If we lower the threshold to 1% IBB/PA, we end with a group of 39 players, more than enough to get some interesting testing. The first thing that jumps out and we could expect is that only one of the 39 players fell short of the 100 HR mark (Rod Carew, with 92) and that all of them surpassed 2223 hits during their careers (for that matter, only 110 MLB players since 1955 have got to that mark, so players from our group make for 34% of them).

So, back to our group, the correlation between IBB and HR yields an R-value of 0.256, which is more or less significant. This means that power hitters have historically tended to be awarded more bases by balls than any other type of batter. If no IBB had been allowed in baseball, we would only have hits, unintentional walks and hits by pitch left as our possible plate appearance outcomes. By making a simple set of calculations we can come up with how many extra hits, home runs, etc. each of our players could have ended their careers with had they not being walked on purpose during their playing time. It is just about knowing the rates they hit singles, doubles, triples and homers per PA (subtracting IBB outcomes from the total number of PA) and then multiplying those rates for the IBB each of them were awarded in their careers. This way we can have a simple look at how much better numbers those hitters could have reached based on their pure hitting ability.

The case of Barry Bonds is truly unique. The all-time home run leader not only lead the IBB leaderboard with 688, but the difference between him and the second ranked player (Albert Pujols, 302 IBB) is a staggering 386 IBB, more than doubling him. The difference between Pujols and third-ranked Hank Aaron is of 9 IBB, just for comparison’s sake. In order to get a comprehensive list of the most improved players in this alternative world, we can sort them by the number of extra hits (no matter the type) they would have got had they not received a single intentional base on balls. The next table includes the 20 players with the most expected extra hits to gain in this scenario.

Unsurprisingly, Bonds comes out first – and by a mile. Again, Barry doubles the EEH of second-ranked Pujols and would have finished his career with over 3,000 hits, at a 3,104 mark. That would make him the eighth player in terms of hits among those analyzed, while Pete Rose (not in the table above) would have gained 45 hits to surpass the 4,300-hit mark and reach exactly 4,301.

By breaking the hits by category the outcome at the top is the expected, with Barry Bonds always topping the simulations. Clearing him from the picture, Hank Aaron would have hit the most extra singles with 49, followed by Pujols and Tony Gwynn with 48. Speaking of doubles, Pujols would have got an extra 18, and three players would have 13 more than what they reached in their careers. Triples are much less frequent and only two players, Roberto Clemente and George Brett, would have batted for three extra triples. Finally, in the home-run category, Bonds would have hit for an extra 44 homers, followed by Pujols and Aaron (13 plus) and Ken Griffey.

Had all these numbers been real and IBB cleared from the face of Earth, historical career leaderboards would have not changed a lot, at least at the highest positions, but some records would be seen as even more unbreakable than they are now. Someone would have to break the 4,300-hit barrier again to surpass Pete Rose. Bonds’ new mark of 806 HR would be unimaginable to reach by anyone nowadays (Pujols, still active, would be almost 200 HR away while entering his age-38 season next fall).

It may not had been a critical change, but baseball would have been (and be) way more fun to watch. Just looking at our starting 39 guys, we would have seen the ball being hit 1,928 more times (out of 7,423 IBB, which is a 26% more than we have), witnessed 300 more home runs being called and annotated a couple of unthinkable numbers in MLB’s history books. Now just imagine how much baseball-fun we’ve lost if I remind you that there have been 73,272 walks awarded during the past 61 seasons (yes, your calculation is correct, around 19,000 extra hits by our group’s measures).


Giving Players the Bonds Treatment

There is no higher compliment that can be given to a ballplayer than to be given “The Bonds Treatment” — being intentionally walked with the bases empty, or even better, with the bases loaded. It’s called “The Bonds Treatment” because Barry Bonds recorded an astounding 41 IBBs with the bases empty, and is one of only two players to ever record a bases-loaded intentional walk. In other words, 28% of IBBs ever issued with the bases empty were given to Bonds — and 50% of IBBs with the bases loaded. Bonds was great, no denying that — but is there anyone out there today who is worthy of such treatment?

We can find out using a Run Expectancy matrix. An RE matrix is based on historical data, and it can tell you how many runs, on average, a team could expect to score in a given situation. A sample RE matrix, from Tom Tango’s site tangotiger.net, is shown below.

RE Matrix

The chart works as follows — given a base situation (runners on the corners, bases empty, etc.) move down to the corresponding row, then move to the corresponding column and year to find out how many runs a team could expect to score from that situation. In 2015, with a runner on 3rd and 1 out, teams could expect to score .950 runs on average (or, RE is .950). If the batter at the plate struck out, the new RE would be .353.

We can take this a step further. Sean Dolinar created a fantastic tool that allows us to (roughly) examine RE in terms of a batter’s skill. Having Mike Trout at the plate vastly improves your odds of scoring more than having Alcides Escobar, and the tool takes this into account. We can use this tool to look at who deserves the Bonds treatment in 2017 (or, to see if anyone deserves the Bonds treatment): defined as being walked with the bases empty, or the bases loaded.

First, we can look at a given player and their RE scores for having the bases empty or full. In this instance, we will use Michael Conforto, who batted leadoff for the Mets against the Texas Rangers on August 9. Conforto’s wOBA entering the game was .404, and the run environment for the league is 4.65 runs per game, so Conforto’s relevant run expectancy matrix looks like this:

Michael Conforto RE Matrix

Batting behind him was Jose Reyes, who, entering the game, had a wOBA of .283. Let’s assume that Conforto receives the Bonds Treatment, and is IBB’d in a given PA with bases empty or loaded. What would the run expectancy look like with Reyes up? In other words, what is Reyes’ run expectancy with a runner on first, or with the bases loaded after a run has been IBB’d in?

To do this, we can look at Reyes’ RE with a runner on first and with the bases loaded. Reyes’ RE with a man at 1B is indicative of what the RE would be like if Conforto had been given an intentional free pass. For a bases-loaded walk, we look at Reyes’ RE with the bases loaded, and then add a run onto it (to account for Conforto walking in a run).

Jose Reyes RE Matrix

Then, we can compare the corresponding cells of the matrices to see if the Texas Rangers would benefit any from walking Conforto. If RE with Conforto up and the bases empty is higher than RE with a runner on first and Reyes up, or RE with the bases loaded and Conforto up is higher than RE with Reyes up and a run already scored, then we can conclude that it makes sense to give Conforto that free pass.

In this instance, we can see that if the Rangers were to face Conforto with the bases empty and two out, it would make more sense for them to IBB Conforto and pitch to Reyes than it would for them to pitch to Conforto, because RE with Conforto up (.172) is higher than RE with Reyes up and Conforto on (.145). As a result, Conforto is a candidate for the Bonds treatment in this lineup configuration, if the right situation arises.

Who else could be subjected to the Bonds treatment? It would take me a few months of work to run through every single individual lineup for every team to figure out who should have been pitched to and who should have gotten a free pass, so to simplify things, I looked at hitters with 400+ PA, looked at when they most frequently batted, who batted behind them most frequently, and whether or not they should have received the Bonds treatment based on who was on deck. While no lineup remains constant throughout the season, looking at these figures gave me a good idea of who regularly batted behind whom.

Three candidates emerged to be IBB’d with the bases empty every time, regardless of outs— Yasiel Puig, Jordy Mercer, and Orlando Arcia. These players usually bat in the eighth slot on NL teams, and right behind them is the pitchers’ slot — considering how historically weak pitchers are with the bat, it makes sense that RE tells us to walk them with the bases empty every single time.

The same could be said of almost anyone batting ahead of a pitcher — according to our model, given an average-hitting pitcher, any hitter with a wOBA over .243 should be IBB’d with the pitcher on deck (only one qualified hitter — Alcides Escobar — has a lower wOBA than .243). The three names above stuck out in the analysis because they were the only players with 400+ PA that had spent most of their PAs batting eighth.

So, an odd takeaway of this exercise is that in the NL, unless a pinch-hitter is looming on deck, the eighth hitter should almost always be intentionally walked with the bases empty, because it lowers the run expectancy. Weird!

The model also identified two hitters who deserved similar treatment to Michael Conforto in the above example (IBB with 2 out and no one on) — Buster Posey and Chase Headley.

Posey has batted with almost alarming regularity ahead of Brandon Crawford, who is running an abysmal .273 wOBA on the season. Headley is a little more curious — Headley is usually a weak hitter, but earlier in the season, Headley batted ahead of Austin Romine frequently, who was even worse than Crawford.

Headley technically isn’t that much of a candidate for the Bonds Treatment since Romine hasn’t batted behind him since June 30, but Crawford has backed up Posey as recently as August 3 — if he’s batted behind Posey again, the situation could very well arise where it becomes beneficial for teams to simply IBB Posey with two out and bases empty.

But ultimately, no one, aside from NL hitters in the eighth slot, emerges as a candidate to be IBB’d every time with the bases empty. And no one, regardless of the situation, deserves a bases-loaded intentional walk. Which raises the question — was it appropriate to give the man himself, Barry Bonds, the Bonds Treatment?

Bonds received an incredible 19 bases-empty IBBs in 2004 (more than doubling the record he set in 2002), so we’ll use 2004 Bonds and his .537 wOBA as the center of our analysis.

In 2004, Bonds batted almost exclusively fouth, and the two men who shared the bulk of playing time batting fifth behind him (Edgardo Alfonzo and Pedro Feliz) had almost identical wOBAs that season (.333 and .334, respectively) — so we’ll assume that the average hitter behind Bonds in 2004 posted a wOBA of .333. This yields RE matrices that look like this:

Barry Bonds RE Matrix compared to 5th Hitter, 2004

Bonds proves himself worthy not only of a bases-empty IBB with two out, but he just barely misses with a bases-loaded IBB. While no one ended up giving Bonds a bases-loaded IBB in 2004, they did give him one in 1998.

For perspective, Bonds was running a .434 wOBA in 1998, and Brent Mayne (who was on deck) was running a .324 wOBA — so this actually wasn’t a move that moved RE or win probability in the right direction.

Win probability, Diamondbacks @ Giants, 5/28/1998
The final spike in WPA is Bond’s IBB — it gave the Giants a better chance of winning. Ultimately, it was a bad idea that didn’t backfire in the Diamondback’s faces.

And of course, I would be remiss in not mentioning the other player to have ever received a bases-loaded IBB — Josh Hamilton.

With apologies to Hamilton, he wasn’t the right guy to get the Bonds treatment here, either — Hamilton ran a .384 wOBA in 2008, and Marlon Byrd, who was on deck, had a .369 wOBA, which means that an IBB in this instance was a really awful move. An awful move that, like Bonds’ IBB, was rewarded by Byrd striking out in the next AB.

Have there been other players deserving of bases-loaded IBBs? It’s possible, but the most likely candidates — Ted Williams and Babe Ruth — usually had good enough protection in the lineup. Of course, there are few hitters that could have protected Bonds from himself — hence why it’s almost a good idea to IBB him with the bases loaded.


Home Runs and Temperature: Can We Test a Simple Physical Relationship With Historical Data?

Unlike most home-run-related articles written this year, this one has nothing to do with the recent home run surge, juiced balls, or the fly-ball revolution. Instead, this one’s about the influence of temperature on home-run rates.

Now, if you’re thinking here comes another readily disproven theory about home runs and global warming (a la Tim McCarver in 2012), don’t worry – that’s not where I’m going with this. Alan Nathan nicely settled the issue by demonstrating that temperature can’t nearly account for the large changes in home-run rates throughout MLB history in his 2012 Baseball Prospectus piece.

In this article, I want to revisit Nathan’s conclusion because it presents a potentially testable hypothesis given a large enough data set. If you haven’t read his article or thought about the relationship between temperature and home runs, it comes down to simple physics. Warmer air is less dense. The drag force on a moving baseball is proportional to air density. Therefore (all else being equal), a well-hit ball headed for the stands will experience less drag in warmer air and thus have a greater chance of clearing the fence. Nathan took HitTracker and HITf/x data for all 2009 and 2010 home runs and, using a model, estimated how far they would have gone if the air temperature were 72.7°F rather than the actual game-time temperature. From the difference between estimated 72.7°F distances and actual distances, Nathan found a linear relationship between game-time temperature and distance. (No surprise, given that there’s a linear dependence of drag on air density and a linear dependence of air density on temperature.) Based on his model, he suggests that a warming of 1°F leads to a 0.6% increase in home runs.

This should in principle be a testable hypothesis based on historical data: that the sensitivity of home runs per game to game-time temperature is roughly 0.6% per °F. The issue, of course, is that the temperature dependence of home-run rates is a tiny signal drowned out by much bigger controls on home-run production [e.g. changes in batting approach, pitching approach, PED usage, juiced balls (maybe?), field dimensions, park elevation, etc.]. To try to actually find this hypothesized temperature sensitivity we’ll need to (1) look at a massive number of realizations (i.e. we need a really long record), and (2) control for as many of these variables as possible. With that in mind, here’s the best approach I could come up with.

I used data (from Retrosheet) to find game-time temperature and home runs per game for every game played from 1952 to 2016. I excluded games for which game-time temperature was unavailable (not a big issue after 1995 but there are some big gaps before) and games played in domed stadiums where the temperature was constant (e.g. every game played at the Astrodome was listed as 72°F). I was left with 72,594 games, which I hoped was a big enough sample size. I then performed two exercises with the data, one qualitatively and one quantitatively informative. Let’s start with the qualitative one.

In this exercise, I crudely controlled for park effects by converting the whole data set from raw game-time temperatures (T) and home runs per game (HR) to what I’ll call T* and HR*, differences from the long-term median T and HR values at each ball park over the whole record. Formally, for any game, T* and HR* are defined such that T* = T Tmed,park and HR* = HR – HRmed,park, where Tmed,park and HRmed,park are median temperature and HR/game, respectively, at a given ballpark over the whole data set. A positive value of HR* for a given game means that more home runs were hit than in a typical ball game at that ballpark. A positive value for T* means that it was warmer than usual for that particular game than on average at that ballpark. Next, I defined “warm” games as those for which T*>0 and “cold” games as those for which T*<0. I then generated three probability distributions of HR* for: 1) all games, 2) warm games and 3) cold games. Here’s what those look like:

The tiny shifts of the warm-game distribution toward more home runs and cold-game distribution toward fewer home runs suggests that the influence of temperature on home runs is indeed detectable. It’s encouraging, but only useful in a qualitative sense. That is, we can’t test for Nathan’s 0.6% HR increase per °F based on this exercise. So, I tried a second, more quantitative approach.

The idea behind this second exercise was to look at the sensitivity of home runs per game to game-time temperature over a single season at a single ballpark, then repeat this for every season (since 1952) at every ballpark and average all the regression coefficients (sensitivities). My thinking was that by only looking at one season at a time, significant changes in the game were unlikely to unfold (i.e. it’s possible but doubtful that there could be a sudden mid-season shift in PED usage, hitting approach, etc.) but changes in temperature would be large (from cold April night games to warm July and August matinees). In other words, this seemed like the best way to isolate the signal of interest (temperature) from all other major variables affecting home run production.

Let’s call a single season of games at a single ballpark a “ballpark-season.” I included only ballpark-seasons for which there were at least 30 games with both temperature and home run data, leading to a total of 930 ballpark-seasons. Here’s what the regression coefficients for these ballpark-seasons look like, with units of % change in HR (per game) per °F:

A few things are worth noting right away. First, there’s quite a bit of scatter, but 75.1% of these 930 values are positive, suggesting that in the vast majority of ballpark-seasons, higher home-run rates were associated with warmer game-time temperatures as expected. Second, unlike a time series of HR/game over the past 65 years, there’s no trend in these regression coefficients over time. That’s reasonably good evidence that we’ve controlled for major changes in the game at least to some extent, since the (linear) temperature dependence of home-run production should not have changed over time even though temperature itself has gradually increased (in the U.S.) by 1-2 °F since the early ‘50s. (Third, and not particularly important here, I’m not sure why so few game-time temperatures were recorded in the mid ‘80s Retrosheet data.)

Now, with these 930 realizations, we can calculate the mean sensitivity of HR/game to temperature, resulting in 0.76% per °F. [Note that the scatter is large and the distribution doesn’t look very Gaussian (see below), but more Dirac-delta like (1 std dev ~ 1.66%, but middle 33% clustered within ~0.4% of mean)].

Nonetheless, the mean value is remarkably similar to Alan Nathan’s 0.6% per °F.

Although the data are pretty noisy, the fact that the mean is consistent with Nathan’s physical model-based result is somewhat satisfying. Now, just for fun, let’s crudely estimate how much of the league-wide trend in home runs can be explained by temperature. We’ll assume that the temperature change across all MLB ballparks uniformly follows the mean U.S. temperature change from 1952-2016 using NOAA data. In the top panel below, I’ve plotted total MLB-wide home runs per complete season (30 teams, 162 games) season by upscaling totals from 154-game seasons (before 1961 in the AL, 1962 in the NL), strike-shortened seasons, and years with fewer than 30 teams accordingly. In blue is the expected MLB-wide HR total if the only influence on home runs is temperature and assuming the true sensitivity to be 0.6% per °F. No surprise, the temperature effect pales in comparison to everything else. Shown in the bottom plot is the estimated difference due to temperature alone in MLB-wide season home run totals from the 1952 value of 3,079 (again, after scaling to account for differences in number of games and teams). You can think of this plot as telling you how many of the total home runs hit in a season wouldn’t have made it over the fence if air temperatures at remained constant at 1952 levels.

While these anomalies comprise a tiny fraction of the thousands of home runs hit per year, one could make that case (with considerably uncertainty admitted) that as many as 59 of these extra temperature-driven home runs were hit in 2016 (or about two per team!).


Altuve Is Defying the Evolution of Baseball

In 1912, the now-known as International Association of Athletics Federations recognised the first record in the 100 metres for men in the field of Olympics’ athletics. Donald Lippincott, on July 6, 1912, became the first man to hold an official record on the discipline with a time of 10.2 seconds from start to finish. He measured 5’10’’ and 159 lbs. It wasn’t until 1946 – 34 years later – that a man broke the 10-second barrier in the 100 meters. James Ray Hines did it at 6’0’’ and 179 lbs. Now fast-forward to 2009 and look up a name: Usain Bolt. There is no one faster on Earth. The Jamaican set the 100 metres world record (9.69 seconds) in Berlin holding a size of 6’5’’ and 207 lbs. I don’t think it is hard to see the evolution of the athletes’ bodies here. We, as human beings, are becoming taller and stronger, physically superior each year. At least some.

While we can’t compare the MLB and baseball as is with Olympic athletes and the demands of track and field, the evolution of sportsmen have been parallel to some extent between both fields. Look at this season’s sensation Aaron Judge. He’s huge. He’s a specimen of his own, truly unique in his size and power. Basically, he’s what we may call the evolution of the baseball player made real. Given that we have height and weight data from 1871 to 2017 provided by Baseball-Reference.com we can plot the evolution of both the height and weight of MLB players over the past 146 years. Here are the results.

Unsurprising, if anything. As we could expect, small baseball players populated the majors during the XIX century and the first third of the XX one, only to get reduced to a minimum that has never got past three active players of 67 inches or less for the past 61 years. On the contrary, players taller than 78 inches started to appear prominently in the 60’s and 70’s to reach their most-active peak in 2011 with 72 players spread over multiple MLB rosters. A similar story can be told about the weight of ballplayers, who tended to be lighter in the early days of the game than from the 70’s on, starting to be overcome in presence by heavier players at around the mid-to-late 90’s.

But even with as clear a trend as this is, there are always outliers out there. And in this concrete case of player size, Jose Altuve is defying the rules of evolution by no small margins. At 5’6’’, the Venezuelan is the shortest active MLB player, and he started painting his path to the majors by signing with Houston for a laughable $15,000 international bonus after being rejected earlier by the Astros due to him being too short. This happened in 2007, and by 2011 Jose Altuve was already playing in the MLB and finishing his rookie season with an 0.7 bWAR (good for 5th-best among 21 years old-or-less rookies, tied with RoY Mike Trout). By his second season, Altuve made the All-Star Game, became a staple at Houston’s second-base position and posted a 1.4 bWAR. From that point on he’s had seasons valued at 1.0, 6.1, 4.5, 7.6 and 6.2 bWAR. The next table includes the 20+ bWAR – during their first seven seasons playing in the majors – players of height 5’6’’ or smaller the MLB has seen since 1871.

Look at the debut season of all those players. Of the eight that made the list, two are from the XIX century and five from 1908 to 1941. That is, the closest “small” player with a 20+ bWAR during his first seven seasons of play to Jose Altuve is from more than 75 years ago – and Altuve’s yet to finish the 2017 season, which will probably enlarge his bWAR total.

Focusing on the 2017 season, a total of 1105 position players and pitchers have generated offensive statistical lines and accrued bWAR values by Baseball-Reference.com. Here’s how they are distributed in terms of height/bWAR.

It is not hard to see how the average MLB player holds a height of around 72 inches (6’0’’), varying from 69 to 76 in most of the cases. There way taller (Chris Young, Alex Meyer, Dellin Betances) and way smaller (Tony Kemp, Alexi Amarista) outliers, and if we add bWAR to the equation, then there is Jose Altuve. Yes, Altuve is the blue dot in the chart, at the bottom right part of it. Not only is he the shortest player of the league, but he’s also the most valuable at this point (6.2 bWAR by Sunday, August 6) and by a good margin over his closer rivals Andrelton Simmons (5.7), Paul Goldschmidt (5.5), Aaron Judge (5.1) and Mookie Betts and Anthony Rendon (both 5.0).

Not just happy with that, Altuve is leading the league in hits (151, with just an 11.9 K% – 16th-best among qualified hitters), batting average (.365), OPS+ (176) and total bases (238). He has improved in virtually every statistical category during the current season, participated in his fourth consecutive All-Star Game, led the MVP race in the AL, and he’s on pace to get also his fourth Silver Slugger award at the second-base position. Even with all that, the likes of Judge and Trout are coming and finishing the year strongly, and there are no guarantees for Jose to become the first Venezuelan to win the MVP since Miguel Cabrera did it five years ago in 2012.

All in all, and looking at how his top rivals stack up in terms of size and production, their numbers could be somehow expected. What Altuve is doing at his size, though, not so much. We have been told that we’re living in the era of the strikeout and that of that of the home-run resurrection, but Jose is determined to turn back the clock and make us all appreciate the wonders of small ballplayers roaming the majors’ fields. Appreciate it while you can, because what he’s doing is truly unique in the history of the sport and its evolution expectations, although it doesn’t seem like anything will be stopping Jose “Gigante” Altuve any time soon.