Quantifying Rumor Mongering in the Baseball Media Ecosystem
In what feels like interminable scrolling of the internet this offseason waiting for something to finally happen, it occurred to me to ask, does any of this rumor-mongering actually tell us anything? It is certainly strange that we as consumers of baseball, a modified game of tag with hitting and throwing a ball, care so much about the internal machinations of billion-dollar organizations and the personal decision-making calculus of people we will never meet. Regardless of this peculiarity, I myself still spend hours a week wondering if George Springer would be willing to play for a team who doesn’t have a guaranteed home stadium for the foreseeable future and subsequently will be located in a foreign country in Canada.
This interest is what feeds the North American baseball media ecosystem and employs thousands of people, from reporters to web designers, social media managers to news aggregators, and many more. I wouldn’t necessarily argue that this content holds no value if it is biased or inaccurate, because the time we spend consuming this offseason content really just satiates our longing for baseball when we can’t watch our favorite teams live. But the question remains, does this content hold any predictive value, or are we just fooling ourselves?
This article is based on data scraped from MLB Trade Rumors, the leading aggregator of rumors around baseball, on December 9, 2020. I pulled the last 2,000 posts that each team was tagged in and analyzed what information we’re actually getting from reading and discussing the rumors and reports inside the baseball media ecosystem. To begin, we can observe the volume of rumors for teams by seeing how many days one would have to go back to reach a cumulative 2,000 posts.
A few teams really stand out of from the pack, with the Mets, Yankees, and Red Sox on the high end and the Rockies on the low end. The Mets rumors only go back as far as December 23, 2015, so the remainder of the analysis focuses on the common time period held by all teams beginning on Opening Day 2016. Correlation over the time period of April 2016 to March 2020 between number of rumors and number of transactions is approximately .35. That means there is certainly some relationship there, but a larger factor appears to be franchise value (.67), which in turn correlates with fan interest. The volume of rumors around a team has much more to say about how many people care about the team and thus report on them than it does about the volume of transactions a team actually makes.
Below is the table which shows the date required to go back to for each team to reach 2,000 posts. It’s even more arresting to see the difference between the top and bottom is in excess of three and a half years.
Team | Count | Earliest Date | Days |
---|---|---|---|
New York Mets | 2000 | 12/23/2015 | 1813 |
New York Yankees | 2000 | 11/9/2015 | 1857 |
Boston Red Sox | 2000 | 9/12/2015 | 1915 |
Los Angeles Dodgers | 2000 | 5/20/2015 | 2030 |
Baltimore Orioles | 2000 | 3/31/2015 | 2080 |
Texas Rangers | 2000 | 3/24/2015 | 2087 |
Toronto Blue Jays | 2000 | 2/14/2015 | 2125 |
Washington Nationals | 2000 | 12/13/2014 | 2188 |
Philadelphia Phillies | 2000 | 12/9/2014 | 2192 |
Atlanta Braves | 2000 | 11/25/2014 | 2206 |
San Diego Padres | 2000 | 11/21/2014 | 2210 |
Chicago Cubs | 2000 | 11/8/2014 | 2223 |
San Francisco Giants | 2001 | 7/26/2014 | 2328 |
Miami Marlins | 2000 | 7/13/2014 | 2341 |
Tampa Bay Rays | 2000 | 6/28/2014 | 2356 |
Los Angeles Angels | 2000 | 5/1/2014 | 2414 |
Houston Astros | 2001 | 1/12/2014 | 2523 |
Seattle Mariners | 2000 | 12/14/2013 | 2552 |
Minnesota Twins | 2000 | 12/11/2013 | 2555 |
Cleveland Indians | 2000 | 10/26/2013 | 2601 |
Detroit Tigers | 2000 | 10/21/2013 | 2606 |
Pittsburgh Pirates | 2001 | 9/14/2013 | 2643 |
Arizona Diamondbacks | 2000 | 7/21/2013 | 2698 |
Cincinnati Reds | 2000 | 6/10/2013 | 2739 |
Milwaukee Brewers | 2000 | 6/1/2013 | 2748 |
Chicago White Sox | 2000 | 5/25/2013 | 2755 |
Oakland Athletics | 2000 | 4/3/2013 | 2807 |
St Louis Cardinals | 2000 | 1/28/2013 | 2872 |
Kansas City Royals | 2000 | 12/17/2012 | 2914 |
Colorado Rockies | 2000 | 5/18/2012 | 3127 |
It is also interesting to observe when these posts are being written for each team. Note again that the next table begins in April 2016, so we see which teams have more written about them in the offseason vs. in-season.
Team | In-season | Offseason | Total | Offseason % |
---|---|---|---|---|
Oakland Athletics | 339 | 682 | 1021 | 66.8 |
Los Angeles Angels | 435 | 751 | 1186 | 63.3 |
Detroit Tigers | 374 | 639 | 1013 | 63.1 |
Milwaukee Brewers | 411 | 662 | 1073 | 61.7 |
New York Mets | 649 | 1031 | 1680 | 61.4 |
New York Yankees | 636 | 1007 | 1643 | 61.3 |
Toronto Blue Jays | 513 | 804 | 1317 | 61.0 |
Atlanta Braves | 496 | 762 | 1258 | 60.6 |
Seattle Mariners | 442 | 676 | 1118 | 60.5 |
Pittsburgh Pirates | 412 | 629 | 1041 | 60.4 |
Kansas City Royals | 363 | 550 | 913 | 60.2 |
Baltimore Orioles | 550 | 828 | 1378 | 60.1 |
Los Angeles Dodgers | 558 | 841 | 1399 | 60.1 |
Texas Rangers | 604 | 881 | 1485 | 59.3 |
Boston Red Sox | 649 | 927 | 1576 | 58.8 |
San Diego Padres | 485 | 692 | 1177 | 58.8 |
Washington Nationals | 551 | 784 | 1335 | 58.7 |
Cleveland Indians | 463 | 650 | 1113 | 58.4 |
Philadelphia Phillies | 503 | 705 | 1208 | 58.4 |
Tampa Bay Rays | 511 | 716 | 1227 | 58.4 |
Arizona Diamondbacks | 423 | 590 | 1013 | 58.2 |
San Francisco Giants | 525 | 729 | 1254 | 58.1 |
Chicago Cubs | 521 | 713 | 1234 | 57.8 |
Miami Marlins | 501 | 684 | 1185 | 57.7 |
St Louis Cardinals | 440 | 589 | 1029 | 57.2 |
Cincinnati Reds | 450 | 597 | 1047 | 57.0 |
Houston Astros | 467 | 618 | 1085 | 57.0 |
Minnesota Twins | 528 | 692 | 1220 | 56.7 |
Chicago White Sox | 471 | 544 | 1015 | 53.6 |
Colorado Rockies | 376 | 431 | 807 | 53.4 |
Interestingly, there is not a team which has a majority of posts written in-season despite the fact that it is longer in time than the offseason. I think that says a lot about this industry and why it is so popular. It exists because some number of fans want more content about their favorite team, and the desire only intensifies when there is no new baseball to consume. Looking at the list, there is an interesting mix of teams at the extremes. Colorado not only has by far the fewest posts, but they also have the highest percentage of their posts in-season — in fact they don’t even have the least posts in-season. The biggest difference for the Colorado rumors is that they have a shockingly low number of posts in the offseason, only 80% of the next-closest team. I don’t know much about the baseball consumption or culture of Colorado Rockies fans, but I’d appreciate anyone who could shed more light on why this big outlier exists.
It’s worth noting that so much of Oakland and the Los Angeles Angels’ rumors occur in the offseason. The latter doesn’t surprise me as much me since so much of the past five years in Anaheim have been about trying to put a contending team around the best player in the game, Mike Trout, and also signing Shohei Ohtani. Oakland is a little bit more surprising. If one would like to ascribe a reason for that, it must be one that doesn’t apply to the Tampa Bay Rays, a club below the median in the same statistic and whom I consider to have a similar conversation about them in terms of team-building.
Lastly on the descriptive front for team analysis, let’s examine which clubs most frequently occurred in the same rumors together to get a sense of which teams the baseball media ecosystem seem to revolve around. A quick note about the structure of MLB Trade Rumors: often times they do multi-team posts by division, which doesn’t necessarily mean the teams in that post are actually talking to each other or about the same player.
Unsurprisingly, right at the heart of the diagram making the most strong connections is the Yankees and the Red Sox. They are both tied closely with Baltimore, a division mate, but they also have strong connections with the Giants and the Phillies, which we will see later has a lot to do with one Manny Machado. A small-market team at the center of the diagram is the Miami Marlins, which is likely because of their penchant for trading away star players like J.T. Realmuto, Marcell Ozuna, Giancarlo Stanton, and Christian Yelich. Linger on that sentence for a moment longer because all those baseball players are so good.
Regardless, there are many more interesting links which appear, too numerous to name, but I’d also like to highlight the connection between San Diego and Boston, who made deals around Craig Kimbrel and Drew Pomeranz, as well as both being linked to Machado between 2016-18.
I mentioned Machado twice already because I believe he was so important in explaining the map of team connections on MLBTR. Below is table of the top 40 mentions of players specific to a team. The only player to appear more than twice on this list is Machado. To be fair, he also played for three different teams in this time period, but to be even more fair, two of the teams he played for are not even on the list. He does appear again with the White Sox (another team he’s never played for), the Dodgers, and the Padres with 427, 343, and 330 mentions, respectively.
Team | Word | Mentions |
---|---|---|
Baltimore Orioles | Machado | 1006 |
Miami Marlins | Stanton | 973 |
Washington Nationals | Harper | 943 |
Boston Red Sox | Price | 870 |
Miami Marlins | Realmuto | 838 |
Boston Red Sox | Betts | 831 |
New York Mets | Syndergaard | 793 |
New York Mets | Céspedes | 792 |
New York Mets | Bruce | 755 |
San Francisco Giants | Bumgarner | 732 |
Boston Red Sox | Sale | 718 |
New York Yankees | Chapman | 702 |
Chicago Cubs | Bryant | 695 |
St Louis Cardinals | Martínez | 694 |
Los Angeles Angels | Ohtani | 689 |
New York Mets | Degrom | 689 |
Baltimore Orioles | Britton | 682 |
Philadelphia Phillies | Harper | 677 |
New York Mets | Wheeler | 660 |
Washington Nationals | Strasburg | 636 |
New York Yankees | Sabathia | 635 |
Toronto Blue Jays | Bautista | 623 |
Boston Red Sox | Martínez | 618 |
Oakland Athletics | Gray | 618 |
Cleveland Indians | Kluber | 615 |
Colorado Rockies | Arenado | 615 |
New York Yankees | Stanton | 615 |
Chicago White Sox | Abreu | 614 |
Los Angeles Dodgers | Hill | 613 |
Los Angeles Dodgers | Kershaw | 611 |
Philadelphia Phillies | Machado | 610 |
Toronto Blue Jays | Donaldson | 603 |
Pittsburgh Pirates | Mccutchen | 601 |
Washington Nationals | Rizzo | 600 |
Pittsburgh Pirates | Cole | 582 |
Toronto Blue Jays | Stroman | 581 |
New York Mets | Harvey | 578 |
New York Yankees | Machado | 567 |
Los Angeles Angels | Fletcher | 560 |
Los Angeles Angels | Trout | 536 |
A potential note of confusion: the Cardinals’ Martínez likely refers to both Carlos and José, which artificially boosts that number, and the Nationals’ Rizzo likely refers predominately to general manager Mike Rizzo, not first baseman Anthony Rizzo.
My last observation is around the high placement of different Mets in the table. Most of this list comprised of high-profile players who were in talks regarding changing team either by trade, free agency, or both. Céspedes signed his deal before this time period began, Syndergaard has never been traded or hit free agency, and Jay Bruce has produced 2.7 bWAR — and only 1.6 of that in two stints with the Mets. A lot is made over the tumult within the Mets organization, and perhaps that is fair, but if a media ecosystem around a team can talk about Jay Bruce more than any other team media ecosystem can talk about all but six opposing players across the league in the past five years, there is clearly a hyper focus and scrutiny on this team which others do not face.
With this descriptive analysis, we can see a lot of the trends in the coverage around baseball, where the focus is, who it’s on, and when it is most voluminous. However, the question remains: how much predictive value does this all have? To dig deeper, I checked the number of times a word appears per post about a team, and whether it was written in the offseason, to try to see what it said about the quality of the club. Using this data between the 2016 regular season to the 2018 offseason, I built a multiple linear regression model using keywords which were most relevant in determining a team’s win total. On the positive side was “luxury tax” and “game,” and on the negative side was “organization,” “seasons,” “trade,” and “surgery.”
There’s one other big one, and that is “payroll.” It matters whether or not this word is appearing during the season or in the offseason, because it turns out there’s a more even spread across teams talking about payroll in the offseason and that actually is not a bad indicator for a team. However, if a team is talking about payroll in-season, that is the single worst indicator of a team’s poor performance. This model isn’t particularly fancy and actually knows nothing about how good a single player is, and yet it is still able to explain almost 40% of the variation in wins between teams. Here is a chart outlining the expected number of wins in 2019 by each team according to the model against their actual total.
There’s definitely a bias to overestimate a team’s win total, as it predicts 2,615 wins across the league despite the fact there are only 2,430 games played in a regular season. However, the general order which the teams are placed accurately reflected where they ended up finishing, save for some big misses on the Cubs and Red Sox. This means the coverage which is being aggregated on MLB Trade Rumors certainly is reflective of how a team performs on the field.
It’s also interesting to note the reoccurring words which are pointing us in one direction or the other. There’s a common theme on both sides, as teams who do well are talking about luxury tax (often times because they are spending into it or contemplating the matter), while teams who are struggling tend to talk more about money and payroll, which is often about cutting it rather than increasing it. While this gives a sense of the narrative around individual clubs and how it connects to the team’s performance, it does not tell us about the validity of the rumors we are seeing.
I looked at the offseasons from 2016 to 2020 and compared the number of posts league-wide in a month and the number of signings in that month and returned a correlation of .4, which isn’t nothing. Here are charts of the volume of signings and rumors each of the past five offseasons over time.
In 2016 and to a lesser degree 2017, rumors and signings actually line up really well with a discordance in March, which is likely attributable to season preview pieces in spring training after most of the signings occurred. However, the past three offseasons have been much more level across the months in terms of signings, but the rumors were still coming at a pace more like the previous years in 2018 and 2019. It appears that in 2020, the rumor mill began to resemble the actual status of the signing market much more in terms of frequency of the months.
This is certainly a phenomenon many people have written about in the past five years, but I can’t help but wonder if a uniform distribution across time for free agents signings is actually exactly what we should expect of this market. I am not an economist, but perhaps the forces which currently exist on both sides of the ledger for coming to deal and holding out longer make any given date in the offseason just as likely as any other to be the date which an agreement is made. It may also be a fact that the trend of players signing later and later in the offseason will keep moving in that direction, but I’d be weary of our bias toward overestimating trends and underestimating regression to the mean.
The final question is, can the information in these posts be used to predict which team a player will sign with before he does? Again, using the words in posts about a player before he signs as well as the frequency a player is linked to a specific team and whether or not it is the most linked team, one can create a powerful logistic regression model. Notably, the higher proportion of rumors which specifically link a team and player, especially if the team is the one most frequently tied to that player, the higher chance of a signing.
The words which were most relevant in predicting a player signing with a team were on the positive side were “comments,” “deal,” “sign,” and “report,” while the negative words were “teams,” “pick,” “baseball,” “DH,” and by far the most impactful single word… “owner.” Notably, on the positive side, when a post actually uses words like deal and sign, the pairing is more likely to come together. However, on the negative side, the talk of an owner being involved dramatically decreases the odds of a team signing a specific player, 2.2 times more impactful on the outcome of the signing than any other word, positive or negative.
To get a sense of how good this model actually is, it was used to predict the 2020 offseason (on which it was not trained) and correctly predicted the status of the signing 86% of the time. However, consider the fact that on average, a player was mentioned in an article with 9.12 unique teams that offseason, and they can sign with only one of them. This means that if the model were to just predict that a team would not sign a player they have been linked to every time, it would be 89% accurate. So more relevantly, the percentage of non-signings the model found (specificity) was 89%, and the percentage of signings it found (sensitivity) was 66%. However, the model only was correct in its prediction of a signing (precision) 48% of the time.
The model wasn’t particularly reliable when it supposed a signing, but it was good at balancing the downside risk of misattributing a signing in order to find a higher percentage of them. This implies our independent variables are indeed providing relevant information in terms of signing. Included in the correctly predicted signing was Gerrit Cole to the Yankees, Jason Kipnis to the Cubs, and Edwin Encarnación to the White Sox.
To conclude, this analysis turned up some interesting facts and trends in the coverage of specific teams and how they interact with other teams. It also shows that the words being used often are reflective of the way a team is performing on the field, and in the offseason, can often be used to fairly accurately predict whether or not a team and a player reach an agreement in free agency. The code used to pull the data from MLB Trade Rumors, pull the transaction data from Baseball Reference and analyze it is available on my GitHub here. I’d encourage anyone to use this data and see if they can mine more from this resource, perhaps with more advanced Natural Language Processing (NLP) techniques such as topic modelling or Word2Vec prediction models.
Best article I’ve ever read @FanGraphs/Community Research! My 1st time commenting too… I’m leading a new team & love this example of how to mine complex data to turn it into information, which allows it to be action-able. Thanks & great work!
Thank you very much, I really appreciate that!
Great job, Peter! I did read through your code (I honestly just wanted to play with the MLBTR data myself), but it was in need of some formatting. Good tip for RStudio: ctrl+shift+A will auto-format your code.
Thanks! That’s a great tip, I’ve since updated the R file on Github.