Quantifying Rumor Mongering in the Baseball Media Ecosystem

In what feels like interminable scrolling of the internet this offseason waiting for something to finally happen, it occurred to me to ask, does any of this rumor-mongering actually tell us anything? It is certainly strange that we as consumers of baseball, a modified game of tag with hitting and throwing a ball, care so much about the internal machinations of billion-dollar organizations and the personal decision-making calculus of people we will never meet. Regardless of this peculiarity, I myself still spend hours a week wondering if George Springer would be willing to play for a team who doesn’t have a guaranteed home stadium for the foreseeable future and subsequently will be located in a foreign country in Canada.

This interest is what feeds the North American baseball media ecosystem and employs thousands of people, from reporters to web designers, social media managers to news aggregators, and many more. I wouldn’t necessarily argue that this content holds no value if it is biased or inaccurate, because the time we spend consuming this offseason content really just satiates our longing for baseball when we can’t watch our favorite teams live. But the question remains, does this content hold any predictive value, or are we just fooling ourselves?

This article is based on data scraped from MLB Trade Rumors, the leading aggregator of rumors around baseball, on December 9, 2020. I pulled the last 2,000 posts that each team was tagged in and analyzed what information we’re actually getting from reading and discussing the rumors and reports inside the baseball media ecosystem. To begin, we can observe the volume of rumors for teams by seeing how many days one would have to go back to reach a cumulative 2,000 posts.

A few teams really stand out of from the pack, with the Mets, Yankees, and Red Sox on the high end and the Rockies on the low end. The Mets rumors only go back as far as December 23, 2015, so the remainder of the analysis focuses on the common time period held by all teams beginning on Opening Day 2016. Correlation over the time period of April 2016 to March 2020 between number of rumors and number of transactions is approximately .35. That means there is certainly some relationship there, but a larger factor appears to be franchise value (.67), which in turn correlates with fan interest. The volume of rumors around a team has much more to say about how many people care about the team and thus report on them than it does about the volume of transactions a team actually makes.

Below is the table which shows the date required to go back to for each team to reach 2,000 posts. It’s even more arresting to see the difference between the top and bottom is in excess of three and a half years.

Going Back 2,000 Posts on MLBTR
Team Count Earliest Date Days
New York Mets 2000 12/23/2015 1813
New York Yankees 2000 11/9/2015 1857
Boston Red Sox 2000 9/12/2015 1915
Los Angeles Dodgers 2000 5/20/2015 2030
Baltimore Orioles 2000 3/31/2015 2080
Texas Rangers 2000 3/24/2015 2087
Toronto Blue Jays 2000 2/14/2015 2125
Washington Nationals 2000 12/13/2014 2188
Philadelphia Phillies 2000 12/9/2014 2192
Atlanta Braves 2000 11/25/2014 2206
San Diego Padres 2000 11/21/2014 2210
Chicago Cubs 2000 11/8/2014 2223
San Francisco Giants 2001 7/26/2014 2328
Miami Marlins 2000 7/13/2014 2341
Tampa Bay Rays 2000 6/28/2014 2356
Los Angeles Angels 2000 5/1/2014 2414
Houston Astros 2001 1/12/2014 2523
Seattle Mariners 2000 12/14/2013 2552
Minnesota Twins 2000 12/11/2013 2555
Cleveland Indians 2000 10/26/2013 2601
Detroit Tigers 2000 10/21/2013 2606
Pittsburgh Pirates 2001 9/14/2013 2643
Arizona Diamondbacks 2000 7/21/2013 2698
Cincinnati Reds 2000 6/10/2013 2739
Milwaukee Brewers 2000 6/1/2013 2748
Chicago White Sox 2000 5/25/2013 2755
Oakland Athletics 2000 4/3/2013 2807
St Louis Cardinals 2000 1/28/2013 2872
Kansas City Royals 2000 12/17/2012 2914
Colorado Rockies 2000 5/18/2012 3127

It is also interesting to observe when these posts are being written for each team. Note again that the next table begins in April 2016, so we see which teams have more written about them in the offseason vs. in-season.

When Do Clubs Get Written About Most?
Team In-season Offseason Total Offseason %
Oakland Athletics 339 682 1021 66.8
Los Angeles Angels 435 751 1186 63.3
Detroit Tigers 374 639 1013 63.1
Milwaukee Brewers 411 662 1073 61.7
New York Mets 649 1031 1680 61.4
New York Yankees 636 1007 1643 61.3
Toronto Blue Jays 513 804 1317 61.0
Atlanta Braves 496 762 1258 60.6
Seattle Mariners 442 676 1118 60.5
Pittsburgh Pirates 412 629 1041 60.4
Kansas City Royals 363 550 913 60.2
Baltimore Orioles 550 828 1378 60.1
Los Angeles Dodgers 558 841 1399 60.1
Texas Rangers 604 881 1485 59.3
Boston Red Sox 649 927 1576 58.8
San Diego Padres 485 692 1177 58.8
Washington Nationals 551 784 1335 58.7
Cleveland Indians 463 650 1113 58.4
Philadelphia Phillies 503 705 1208 58.4
Tampa Bay Rays 511 716 1227 58.4
Arizona Diamondbacks 423 590 1013 58.2
San Francisco Giants 525 729 1254 58.1
Chicago Cubs 521 713 1234 57.8
Miami Marlins 501 684 1185 57.7
St Louis Cardinals 440 589 1029 57.2
Cincinnati Reds 450 597 1047 57.0
Houston Astros 467 618 1085 57.0
Minnesota Twins 528 692 1220 56.7
Chicago White Sox 471 544 1015 53.6
Colorado Rockies 376 431 807 53.4
SOURCE: MLB Trade Rumors

Interestingly, there is not a team which has a majority of posts written in-season despite the fact that it is longer in time than the offseason. I think that says a lot about this industry and why it is so popular. It exists because some number of fans want more content about their favorite team, and the desire only intensifies when there is no new baseball to consume. Looking at the list, there is an interesting mix of teams at the extremes. Colorado not only has by far the fewest posts, but they also have the highest percentage of their posts in-season — in fact they don’t even have the least posts in-season. The biggest difference for the Colorado rumors is that they have a shockingly low number of posts in the offseason, only 80% of the next-closest team. I don’t know much about the baseball consumption or culture of Colorado Rockies fans, but I’d appreciate anyone who could shed more light on why this big outlier exists.

It’s worth noting that so much of Oakland and the Los Angeles Angels’ rumors occur in the offseason. The latter doesn’t surprise me as much me since so much of the past five years in Anaheim have been about trying to put a contending team around the best player in the game, Mike Trout, and also signing Shohei Ohtani. Oakland is a little bit more surprising. If one would like to ascribe a reason for that, it must be one that doesn’t apply to the Tampa Bay Rays, a club below the median in the same statistic and whom I consider to have a similar conversation about them in terms of team-building.

Lastly on the descriptive front for team analysis, let’s examine which clubs most frequently occurred in the same rumors together to get a sense of which teams the baseball media ecosystem seem to revolve around. A quick note about the structure of MLB Trade Rumors: often times they do multi-team posts by division, which doesn’t necessarily mean the teams in that post are actually talking to each other or about the same player.

Unsurprisingly, right at the heart of the diagram making the most strong connections is the Yankees and the Red Sox. They are both tied closely with Baltimore, a division mate, but they also have strong connections with the Giants and the Phillies, which we will see later has a lot to do with one Manny Machado. A small-market team at the center of the diagram is the Miami Marlins, which is likely because of their penchant for trading away star players like J.T. Realmuto, Marcell Ozuna, Giancarlo Stanton, and Christian Yelich. Linger on that sentence for a moment longer because all those baseball players are so good.

Regardless, there are many more interesting links which appear, too numerous to name, but I’d also like to highlight the connection between San Diego and Boston, who made deals around Craig Kimbrel and Drew Pomeranz, as well as both being linked to Machado between 2016-18.

I mentioned Machado twice already because I believe he was so important in explaining the map of team connections on MLBTR. Below is table of the top 40 mentions of players specific to a team. The only player to appear more than twice on this list is Machado. To be fair, he also played for three different teams in this time period, but to be even more fair, two of the teams he played for are not even on the list. He does appear again with the White Sox (another team he’s never played for), the Dodgers, and the Padres with 427, 343, and 330 mentions, respectively.

Most-Rumored Players
Team Word Mentions
Baltimore Orioles Machado 1006
Miami Marlins Stanton 973
Washington Nationals Harper 943
Boston Red Sox Price 870
Miami Marlins Realmuto 838
Boston Red Sox Betts 831
New York Mets Syndergaard 793
New York Mets Céspedes 792
New York Mets Bruce 755
San Francisco Giants Bumgarner 732
Boston Red Sox Sale 718
New York Yankees Chapman 702
Chicago Cubs Bryant 695
St Louis Cardinals Martínez 694
Los Angeles Angels Ohtani 689
New York Mets Degrom 689
Baltimore Orioles Britton 682
Philadelphia Phillies Harper 677
New York Mets Wheeler 660
Washington Nationals Strasburg 636
New York Yankees Sabathia 635
Toronto Blue Jays Bautista 623
Boston Red Sox Martínez 618
Oakland Athletics Gray 618
Cleveland Indians Kluber 615
Colorado Rockies Arenado 615
New York Yankees Stanton 615
Chicago White Sox Abreu 614
Los Angeles Dodgers Hill 613
Los Angeles Dodgers Kershaw 611
Philadelphia Phillies Machado 610
Toronto Blue Jays Donaldson 603
Pittsburgh Pirates Mccutchen 601
Washington Nationals Rizzo 600
Pittsburgh Pirates Cole 582
Toronto Blue Jays Stroman 581
New York Mets Harvey 578
New York Yankees Machado 567
Los Angeles Angels Fletcher 560
Los Angeles Angels Trout 536
SOURCE: MLB Trade Rumors

A potential note of confusion: the Cardinals’ Martínez likely refers to both Carlos and José, which artificially boosts that number, and the Nationals’ Rizzo likely refers predominately to general manager Mike Rizzo, not first baseman Anthony Rizzo.

My last observation is around the high placement of different Mets in the table. Most of this list comprised of high-profile players who were in talks regarding changing team either by trade, free agency, or both. Céspedes signed his deal before this time period began, Syndergaard has never been traded or hit free agency, and Jay Bruce has produced 2.7 bWAR — and only 1.6 of that in two stints with the Mets. A lot is made over the tumult within the Mets organization, and perhaps that is fair, but if a media ecosystem around a team can talk about Jay Bruce more than any other team media ecosystem can talk about all but six opposing players across the league in the past five years, there is clearly a hyper focus and scrutiny on this team which others do not face.

With this descriptive analysis, we can see a lot of the trends in the coverage around baseball, where the focus is, who it’s on, and when it is most voluminous. However, the question remains: how much predictive value does this all have? To dig deeper, I checked the number of times a word appears per post about a team, and whether it was written in the offseason, to try to see what it said about the quality of the club. Using this data between the 2016 regular season to the 2018 offseason, I built a multiple linear regression model using keywords which were most relevant in determining a team’s win total. On the positive side was “luxury tax” and “game,” and on the negative side was “organization,” “seasons,” “trade,” and “surgery.”

There’s one other big one, and that is “payroll.” It matters whether or not this word is appearing during the season or in the offseason, because it turns out there’s a more even spread across teams talking about payroll in the offseason and that actually is not a bad indicator for a team. However, if a team is talking about payroll in-season, that is the single worst indicator of a team’s poor performance. This model isn’t particularly fancy and actually knows nothing about how good a single player is, and yet it is still able to explain almost 40% of the variation in wins between teams. Here is a chart outlining the expected number of wins in 2019 by each team according to the model against their actual total.

There’s definitely a bias to overestimate a team’s win total, as it predicts 2,615 wins across the league despite the fact there are only 2,430 games played in a regular season. However, the general order which the teams are placed accurately reflected where they ended up finishing, save for some big misses on the Cubs and Red Sox. This means the coverage which is being aggregated on MLB Trade Rumors certainly is reflective of how a team performs on the field.

It’s also interesting to note the reoccurring words which are pointing us in one direction or the other. There’s a common theme on both sides, as teams who do well are talking about luxury tax (often times because they are spending into it or contemplating the matter), while teams who are struggling tend to talk more about money and payroll, which is often about cutting it rather than increasing it. While this gives a sense of the narrative around individual clubs and how it connects to the team’s performance, it does not tell us about the validity of the rumors we are seeing.

I looked at the offseasons from 2016 to 2020 and compared the number of posts league-wide in a month and the number of signings in that month and returned a correlation of .4, which isn’t nothing. Here are charts of the volume of signings and rumors each of the past five offseasons over time.


In 2016 and to a lesser degree 2017, rumors and signings actually line up really well with a discordance in March, which is likely attributable to season preview pieces in spring training after most of the signings occurred. However, the past three offseasons have been much more level across the months in terms of signings, but the rumors were still coming at a pace more like the previous years in 2018 and 2019. It appears that in 2020, the rumor mill began to resemble the actual status of the signing market much more in terms of frequency of the months.

This is certainly a phenomenon many people have written about in the past five years, but I can’t help but wonder if a uniform distribution across time for free agents signings is actually exactly what we should expect of this market. I am not an economist, but perhaps the forces which currently exist on both sides of the ledger for coming to deal and holding out longer make any given date in the offseason just as likely as any other to be the date which an agreement is made. It may also be a fact that the trend of players signing later and later in the offseason will keep moving in that direction, but I’d be weary of our bias toward overestimating trends and underestimating regression to the mean.

The final question is, can the information in these posts be used to predict which team a player will sign with before he does? Again, using the words in posts about a player before he signs as well as the frequency a player is linked to a specific team and whether or not it is the most linked team, one can create a powerful logistic regression model. Notably, the higher proportion of rumors which specifically link a team and player, especially if the team is the one most frequently tied to that player, the higher chance of a signing.

The words which were most relevant in predicting a player signing with a team were on the positive side were “comments,” “deal,” “sign,” and “report,” while the negative words were “teams,” “pick,” “baseball,” “DH,” and by far the most impactful single word… “owner.” Notably, on the positive side, when a post actually uses words like deal and sign, the pairing is more likely to come together. However, on the negative side, the talk of an owner being involved dramatically decreases the odds of a team signing a specific player, 2.2 times more impactful on the outcome of the signing than any other word, positive or negative.

To get a sense of how good this model actually is, it was used to predict the 2020 offseason (on which it was not trained) and correctly predicted the status of the signing 86% of the time. However, consider the fact that on average, a player was mentioned in an article with 9.12 unique teams that offseason, and they can sign with only one of them. This means that if the model were to just predict that a team would not sign a player they have been linked to every time, it would be 89% accurate. So more relevantly, the percentage of non-signings the model found (specificity) was 89%, and the percentage of signings it found (sensitivity) was 66%. However, the model only was correct in its prediction of a signing (precision) 48% of the time.

The model wasn’t particularly reliable when it supposed a signing, but it was good at balancing the downside risk of misattributing a signing in order to find a higher percentage of them. This implies our independent variables are indeed providing relevant information in terms of signing. Included in the correctly predicted signing was Gerrit Cole to the Yankees, Jason Kipnis to the Cubs, and Edwin Encarnación to the White Sox.

To conclude, this analysis turned up some interesting facts and trends in the coverage of specific teams and how they interact with other teams. It also shows that the words being used often are reflective of the way a team is performing on the field, and in the offseason, can often be used to fairly accurately predict whether or not a team and a player reach an agreement in free agency. The code used to pull the data from MLB Trade Rumors, pull the transaction data from Baseball Reference and analyze it is available on my GitHub here. I’d encourage anyone to use this data and see if they can mine more from this resource, perhaps with more advanced Natural Language Processing (NLP) techniques such as topic modelling or Word2Vec prediction models.





4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Eric WojciechowskiMember since 2020
4 years ago

Best article I’ve ever read @FanGraphs/Community Research! My 1st time commenting too… I’m leading a new team & love this example of how to mine complex data to turn it into information, which allows it to be action-able. Thanks & great work!

Jason
4 years ago

Great job, Peter! I did read through your code (I honestly just wanted to play with the MLBTR data myself), but it was in need of some formatting. Good tip for RStudio: ctrl+shift+A will auto-format your code.