Using the Toxicological Prioritization Index To Visualize Baseball

Major League Baseball is awash in advanced statistics that more reliably describe key aspects of players’ offensive and defensive performance. It has been reported that through the use of Statcast, the MLB Advanced Media group can supply teams with 70 fields x 1.5 billion rows of data per season [i]. Yes, billion with a b. This flood of information has supercharged MLB teams’ and the sabermetric community’s development of ever-more useful statistics for describing player performance.

However, this amount of data brings significant challenges. Perhaps chief among them is that while certain individuals may be comfortable with reams of tables and ever-increasing numbers of descriptive statistics, many others prefer or require analyses and visualization tools that convert disparate metrics into informative and readily interpretable graphics.

MLB’s situation has certain similarities to the discipline of safety toxicology, where the use of high-information content assays for characterizing chemicals’ toxicological profiles has exploded [ii]. Drawing conclusions from multiple biomarkers and test systems is challenging, as it requires synthesis of large amounts of dissimilar data sets. One tool that toxicologists have found useful is the Toxicological Prioritization Index, or ToxPi for short [iii]. ToxPi is an analytical software package that was developed to combine multiple sources of evidence by transforming data into integrated, visual profiles.

This article describes the repurposing of ToxPi software to synthesize complex MLB player performance metrics into summary graphics and statistics. Figure 1 shows two prototypical profiles and is provided to familiarize readers with the basic ToxPi structure. The ToxPi interface is based on a Java-executable script that is freely available at their website. In addition to the main Java script, the single compressed download includes a user manual, all libraries, and example data files.

Besides the software’s ability to distill complex statistics into informative summary graphics, it is important to emphasize its ability to generate useful composite metrics in the form of an aggregate “ToxPi score,” as well as individual “slice scores.” Furthermore, as will become apparent throughout this article, a key advantage of the ToxPi platform is that graphics and scores are accompanied by indications of model uncertainty in the form of 95% confidence intervals.

Bryce Harper Time Series

ToxPi profiles that describe Bryce Harper’s overall performance are shown in Figure 2. The four individual-year profiles are instructive, as they show a precipitous drop-off in defense for the 2018 season. The profile to the far right concisely synthesizes these four years of data into one information-rich figure. The aggregate image provides us with the opportunity to showcase a key advantage of ToxPi, that is, its ability to convey model uncertainty in the form of 95% confidence intervals (i.e., the lighter-shaded bands at the periphery of each wedge). In the current Harper example, the wide UZR/150 confidence interval emphasizes the large variation in defensive performance over the four years under consideration. While this was apparent across the four individual per-season profiles, it is more efficiently conveyed in the single figure that includes confidence intervals.

Position Player Comparisons

ToxPi profiles from a three-factor model that describes position players’ overall performance over the years 2016-19 are shown in Figure 3. These results are for 24 players with a wide range of abilities, from below average to exceptional. A league-average profile (wRAA, BsR, and UZR/150 values of 0) is included for reference. This figure illustrates one of ToxPi software’s graphical output options, where the highest overall score appears in the upper left while successively lower scores continue in a left-to-right, top-to-bottom manner. Along with the graphical results, it is possible to include ToxPi scores with confidence intervals (as done here) as well as individual slice scores (not shown). Click to enlarge.

A second graphical output option is shown by Figure 4 as each ToxPi score is plotted against the corresponding rank order value. This illustrates the continuous nature of the scores and provides an interactive view of the player data. That is, when data points are selected on the graph, the appropriate ToxPi images are shown along with associated quantitative data.

A third graphical output option is shown in Figure 5. Here we made use of ToxPi’s unsupervised hierarchical clustering algorithm, which automatically groups similar profiles. This powerful analysis makes several distinct categories apparent: a clade of elite, average, and below average players. Note that these high-level descriptions do not capture all the information that is generated by the clustering algorithm. For instance, given his historically great performance over the years 2015-19, Mike Trout forms a unique subclade within the elite player grouping (denoted by ‘A1’ in Figure 5).

In an effort to objectively evaluate the reliability of our provisional three-factor model, we compared ToxPi scores to respective WAR values (Figure 6). The resulting R^2 value of 0.85 suggests that the ToxPi model is working, at least to a first approximation. While there is certainly room for improvement, these results suggest the current modeling is sufficiently reliable for demonstrating the ToxPi platform’s attributes and capabilities.

Mike Trout Pitch-Type Splits

Figure 7 shows ToxPi output describing Trout’s performance according to pitch type based on the 2012-19 seasons. Here as elsewhere, whenever slices summarize results over multiple years, we chose to show confidence intervals. This provides an indication of model uncertainty, which translates to variability in a player’s performance. Whether one considers the slash line statistics or the advanced wRAA metric, this modeling exercise suggests that even the game’s premier hitter has left some room for improvement — less so for fastballs, more so for breaking balls.

Pitcher Comparisons

Figure 8 shows ToxPi profiles from a six-factor model that describes pitchers’ overall performance over the years 2016-19 (and 2012-15 in the case of Clayton Kershaw). These results are for nine players with a wide range of abilities, from just average to multiple-year Cy Young winners. Among other things, this figure shows how remarkable Kershaw’s 2012-15 seasons were, with dominating performances across each of the metrics studied. As described in greater detail in the Methods section below, we multiplied each of the six statistics by innings. This approach to weighting resulted in a higher correlation to WAR values (R^2 = 0.92, Figure 9) compared to a model that included innings as a seventh-slice metric instead of a multiplier (R^2 = 0.81).

Of course, using innings as a multiplier had the effect of increasing the overall score for pitchers such as Justin Verlander, who regularly contributed high numbers of innings per season, while penalizing Blake Snell, a pitcher that was utilized quite differently by his team at the time (i.e., much briefer outings).

Hierarchical clustering results are shown by Figure 10. This provides an opportunity to discuss an important advantage of the ToxPi platform. It is easy, and indeed human nature, to interpret graphics and point-estimate scores such as those presented in Figure 8 as highly accurate and precise depictions of rank order: early Kershaw (ToxPi score 0.9650) was superior to Verlander (0.8554) was superior to Scherzer (0.8158), etc.

However, it is more appropriate to consider modeling results such as these with measures of uncertainty in mind, for instance in the context of confidence intervals. The hierarchical clustering algorithm and associated graphics are helpful in this regard. In the current example, it places Kershaw, Verlander, and Scherzer in the same clade, meaning their performances were distinctly superior to the other pitchers. On the other hand, Verlander and Scherzer are clustered very closely, meaning assessments about their relative rank should be made with caution and appropriately qualified.

Conclusions

ToxPi is an analytical software package that is useful for combining multiple types and sources of data into integrated, visual profiles. There are two key characteristics of the software that we found particularly useful when applied to MLB statistics.

First, the platform allowed us to easily synthesize disparate data into high-information content images. While WAR and other summary-type metrics are certainly valuable, they are not transparent. When WAR values are discussed, it is not obvious which aspect(s) of a player’s performance are contributing and detracting from the numerical score. On the other hand, when one studies player performance with ToxPi software, the aggregate scores are always accompanied by visuals that convey information about each component of the underlying model. For instance, our example with 24 position players clearly showed that Mookie Betts’ high ToxPi score is attributable to stellar performance across offense, defense, and baserunning. Conversely, his teammate at the time, J.D. Martinez, exhibited a high aggregate score that was strictly attributable to hitting prowess (see Figure 11). We suspect that many teams’ front office personnel, fans, and other stakeholders would benefit from this level of granularity.

A second key feature of ToxPi is that it has a mechanism for conveying model uncertainty into the graphical and numerical output. This is important as it helps guard against human nature, which tends to interpret point estimates and player rankings as accurate and precise to the many decimal points provided by the algorithms. This is an erroneous way to interpret results, and models such as ToxPi that provide confidence intervals or other indications of uncertainty help guard against these temptations to over-interpret results.

Finally, we would like to address the readers we managed to outrage because their favorite player was maligned by our maiden modeling efforts. Recall that we do not represent this work as exhaustive and definitive. Rather it is presented to showcase the ToxPi platform’s capabilities. Subject experts will have different ideas about which statistics should be included in these models and how they should be weighted. We look forward to seeing what types of interesting analyses they could accomplish with this versatile data synthesis and visualization tool.

Methods and Data

This methods and data section is provided in order to support readers that are interested in applying ToxPi analyses to their own research.

The ToxPi user manual describes data file formatting requirements, so we will only briefly describe our file structures for the work described herein. All analyses used season-level statistics and were entered into Microsoft Excel for Mac (v16.16.14). In order for ToxPi to recognize the files, they were exported as comma separated value (csv) files. For the purpose of this presentation, we will refer to one data entry format as “time series.” For time series models, a single column was used to record player and season information (e.g., Bryce Harper 2015, Bryce Harper 2016, etc.). Other columns were used for each of the statistics included in the model (e.g., wRAA, BsR, etc.). In this manner, each row captured a different season’s worth of statistics. This format allowed us to assign one statistic to a unique ToxPi slice. These time series-formatted files were used to create one ToxPi profile for each year studied.

We refer to a second data entry format as “inter-player comparison.” For this type of model, one column was used to record player (Mookie Betts, Mike Trout, etc.) and one column was dedicated to each statistic/year combination studied (e.g., wRAA 2015, BsR 2015, wRAA 2016, BsR 2016, etc.). Each row therefore captured an individual player’s statistics over multiple years. This format allowed us to assign multiple years of one statistic to a unique ToxPi slice, and thereby generate one ToxPi profile for each player studied, summarized across the specified years. Note that several position players contributed fewer than 72 games during 2016 (e.g., Aaron Judge), and in these cases that season was not studied.

ToxPi requires non-negative numbers, so some statistics such as BsR, wRAA, and UZR/150 require offsets. In Excel, before creating a csv file for uploading into ToxPi, seven was added to each BsR value and twenty was added to each wRAA and UZR/150 value.

Irrespective of the format used, once a csv file was uploaded into the ToxPi program, statistics were assigned to particular slices and then re-positioned, given different weights, and assigned a unique color. The resulting models were exported as new csv files which encoded these choices so they did not have to be re-specified when new data were entered.

We appreciate that the data-formatting descriptions may be somewhat abstract to readers that do not have experience running Java-executable scripts. That being said, we can attest that ToxPi is a very user-friendly interface, and data formatting becomes intuitive after a short period of first-hand experience. For readers interested in learning more about ToxPi, we will make the files/models used herein available upon request.

ToxPi Analyses: Position Player Comparisons

Three comprehensive statistics were used to evaluate 24 position players’ performances over the years 2016-19: BsR, wRAA, and UZR/150. After exploring several approaches for assigning weighting factors, we achieved adequate performance for the purpose of demonstrating ToxPi software’s main features by considering the relationship between players’ WAR values and each individual statistic’s per-year values. Specifically, weights were set to be proportional to the R^2 result when the 24 players’ per-year WAR values were plotted against each individual statistic in turn. In this manner, WAR vs. BsR, wRAA, and UZR/150-associated R^2 values became weighting factors of 10%, 80%, and 10%, respectively. The inter-player comparison file format described above was used, and the resulting model allowed us to generate a ToxPi profile for each player that summarized performance for the three or four years studied.

To evaluate the performance of the three-factor model, a metric was needed to represent ground truth. We chose WAR summed over the four-year period under study. We plotted each player’s WAR value against their corresponding ToxPi score, and the R^2 value was calculated with Excel. This statistic was used to assess how closely the three-factor ToxPi model correlated with WAR, and by extension, how well the model performed.

ToxPi Analyses: Bryce Harper Time Series

The same three-factor model described above (BsR, wRAA, and UZR/150 weighted at 10%, 80%, and 10%, respectively) was used to evaluate Bryce Harper’s performance over four consecutive years, 2016-2019. In this case, we used the time series-type format. Executing this model allowed us to generate a ToxPi profile for each of four years under consideration. In combination with Bryce Harper results from the inter-player comparison format described above, this example highlights the ability of ToxPi to depict results on a per season basis, as well as summarized across multiple years.

ToxPi Analyses: Mike Trout Pitch-Type Splits

A four-factor ToxPi model was built to study Mike Trout’s performance against six pitch types over eight years, 2012-19. Three of the metrics were the slash line statistics, AVG, OBP, and SLG. The fourth metric was the comprehensive offensive statistic wRAA. We weighted each of the slash line statistics at 16.67% and wRAA at 50%. Rather than generating a ToxPi profile for each year and each pitch type (six pitches x eight years = 48 profiles), we included each of the seasons’ statistics into the four ToxPi slices. This is analogous to the inter-player comparison format described above, but with a different pitch type appearing in each row instead of a different player. Given this format, ToxPi distilled the number of profiles down to six, one per pitch type, summarized across eight years.

ToxPi Analyses: Pitcher Comparisons

A seven-factor ToxPi model was built to study nine different pitchers’ performances over the years 2016-19. One pitcher was represented a second time over an earlier four-year span (Kershaw’s historically great performance from 2012-15 was added as an interesting frame of reference). Six of the statistics included in the model were WHIP, ERA-, FIP-, BABIP, K%, and K/BB, which were all weighted equally. A seventh statistic, innings, was incorporated in an alternate manner. Specifically, each of the six aforementioned metrics were multiplied by the corresponding year’s innings value.

To conduct these analyses, data were entered according to the previously described inter-player comparison format. Thus this model was not constructed to produce a ToxPi profile for each year and each pitcher (four years x 10 pitcher comparisons = 40 profiles). Rather, ToxPi distilled the number of profiles down to 10, one per pitcher summarized across four years.

Pitcher performance modeling provided an opportunity to highlight an important data transformation requirement. Specifically, whenever a statistic signifies better performance with a lower value as opposed to a greater one, a data transformation step is required. Therefore for the statistics WHIP, ERA-, FIP-, and BABIP, we used a reciprocal transformation (e.g., 1/WHIP, etc.).

To evaluate the performance of the model, a ground truth-type metric was required. As with the position player models, we used the sum of WAR for the years under consideration. The R^2 value was used to assess how closely the ToxPi model correlated with WAR.

References

[i]    Keith Law (2017), Smart Baseball (New York: Harper Collins Publishers, 2017).

[ii]   Stephen D. Dertinger, Yukari Totsuka, Jason H. Bielas, Ann T. Doherty, Jos Kleinjans, Masamitu Honma, Francesco Marchetti, Maik J. Schuler, Veronique Thybaud, Paul White, Carole L. Yauk, “High information content assays for genetic toxicology testing: A report of the International Workshops on Genotoxicity Testing (IWGT)”, Mutation Research 847 (2019) 403022.

[iii]   Skylar W. Marvel, Kimberly To, Fabian A. Grimm, Fred A. Wright, Ivan Rusyn, David M. Reif, “ToxPi Graphical User Interface 2.0: Dynamic exploration, visualization, and sharing of integrated data models”, BMC Bioinformatics 19 (2018) 80.





8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Broken Batmember
2 years ago

Hugh?

Jackeldermember
2 years ago

This struck me as akin to Einstein’s Theory of Relativity … I will never under it. You went to a lot of trouble and there may be MLB gigs waiting for you out there. But jeez, this is dense.

Jackeldermember
2 years ago

I will never under(stand) it.

Serbian to Vietnamese to French is back
2 years ago
Reply to  Jackelder

Conclusions:

Tokpi is analytical software useful for combining many types of data and record sources. There are two main features of software that we find particularly useful when applied to statistics.

First of all, the platform allows us to easily synthesize data into images with a high content of information. While war data and records may be certain to have value, it is not clear. When discussing military merit, it is not clear which aspects of a player’s performance contribute to and reduce the score. On the other hand, when someone studies player performance using Tokpi cumulative estimator software, it is always accompanied by visual effects, conveying information about each part of the basic model. For example, our example with you in 24 positions clearly shows that at the peak of the current Tokpi scheme we are free to perform extremely well in attack, defense and running on the basis. On the contrary, his then teammate J. D. Martinez pointed to a high overall score, only thanks to the skill of free kicks (see Figure 11). We doubt that the staff of many teams, fans and other stakeholders would benefit from this level of detail.

A key feature of Tokpi is that it has a mechanism for transferring uncertainty from sample to chart and beyond. This is important because it will help protect against human nature, which strives to interpret point by point and to be precise and accurate to the multitude of decimal places that the algorithm provides. This is a misinterpretation of the results, and models like Tokpi meanwhile provide news or other signs of uncertainty that help protect against the temptation of over-interpreting the results.

Finally, we wanted to address readers who we may have offended because they had to be slandered because of our first attempt at modeling. Please note that we are not claiming that this work is complete and final. Instead, it was used to demonstrate the capabilities of the Tokpi platform. Subject experts have differing opinions on which statistics should be included in the model and how they should be weighed. We can’t wait to see what interesting analysis they can do with these versatile data collection and visualization tools.

pbarstonmember
2 years ago

This is fantastic stuff. I love the unique application.

I am going to check out your SABR presentation shortly.