Improving WPS

“All happy families are alike; each unhappy family is unhappy in its own way.”  — L. Tolstoy

You can say something similar about baseball games. All boring games are alike; but exciting games are interesting in their own ways. Every boring game has one team building up a big early lead, which is never threatened. But there are many ways to have an exciting game: the pitcher’s duel, the slugfest, the late-inning comeback, extra innings, all in various combinations. And in between them are the bulk of games that are simply ordinary.

All of which makes ranking exciting games a tricky process, at least compared ranking to how boring they are. How does one compare Game 7 of the 1991 WS (1-0 in 10 innings) to Game 4 of the 1993 WS (15-14 in 9 innings) on the same scale? They’re great in different ways.

Back in 2005 I created a system to do just that, a rating system based simply on the runs scored in line score. I may have been the Christopher Columbus of that new world. And ranking the games allows you to rate post-season series-es.

The line-score system did work in the sense that it could tell the difference between a great game and a good one, and between a good one and an ordinary one. But while the line score gives you the basic outline of the game, it was blind to the details of what happens DURING each inning. Zero runs scored in the top of the 1st rates exactly the same; whether there were three pop-ups, or if three singles were followed by a triple play.

Eventually I realized that Baseball-Reference.com (ALL HAIL BBREF) has the play-by-play data for all playoff games, which includes a probability of victory after each play (anything that changes the outs, baserunners or score). Plotted, you can easily see if a game was good; It looks like and earthquake. If it was bad, it looks like the EKG of a corpse. Using those probabilities, we can create a much more accurate game rating. I fiddled with many rating schemes over the last 10 years before settling on one that seems both conceptually simple and that yields reasonable results.

Of course, by then I had been beaten to the basic concept by Dave Studeman (WPA) and Shane Tourtellotte (WPS). Twelve years is too long for laurels resting.

WPA = Sum(change in probability between plays)

Modified WPS = Sum(change in probability between plays) + top three plays + Final play

What I have developed is similar to their work, but I think it has some small advantages. Generally, my ratings will be quite close to Shane’s (R-squared > 99.5%). He correctly realized that simply summing the probabilities doesn’t quite work, which is why he modified it. An example…

There are seven post-season games with a WPA of exactly 4.52. Among them are:

1995 NLCS Game 2

Reds beat the Braves 6-2 in ten innings.

95 Plays, 13 plays changed the odds by at least 10%

top Play a Mark Portugal bases-loaded wild pitch +18%

70 plays with the odds in the 30% to 70% range

compared to

1960 WS Game 7

Pirates 10 Yankees 9 in nine innings

77 plays, 15 plays changed the odds by at least 10%

Of those 4 changed the odds by at least 20%

Of those 3 changed the odds by at least 30%

Of those 1 changed the odds by more than 50%

25 plays with the odds in the 30% to 70% range

 

There is simply no way those games are equal. The 1960 game has five different plays better than any play in the 1995 game. The 1995 game makes up the ground by (1) having 18 more plays (2) having fewer plays where nothing happened because the game was usually within one run.

1960 is still better because a +40% play isn’t twice as exciting as two +20% plays. Bill Mazeroski’s game-ending homer rates as +37%. Bobby Richardson’s game-starting line-out rates at +2%. Making a walk-off homer the equal of about 3 ½ innings with zero hits. NOPE. WRONG.

Shane accounted for this with his modified method. By counting the top three plays twice and Mazeroski’s walk-off homer three times, the ratings are now

1960: 6.49

1995: 5.19

And science prevails.

Of course, there is nothing magical about TOP THREE plays or LAST play. You could try using the top five plays and last five plays (believe me, I did).  But I do think that using Top-3 + Last can sometimes lead you astray. I will now present exhibits A and B to demonstrate where it can swing and miss.

Exhibit A: 1988 WS Game 1

Exhibit B: 1985 NLCS Game 6

I expect you to know them. The two biggest home runs in terms of changing the odds in post-season history courtesy of Mr. Clark and Mr. Gibson.

1985: WPA 4.48 in 83 plays and 9 innings

1988: WPA 3.94 in 82 plays and 9 innings

The 1985 game had more action with the same number of plays, which you can easily see in the line scores

StL          0              0              1              0              0              0              3              0              3              (7)

LA           1              1              0              0              2              0              0              1              0              (5)

 

Compared to

 

Oak        0              4              0              0              0              0              0              0              0              (4)

LA           2              0              0              0              0              1              0              0              2              (5)

 

The ‘85 game has a game tie in the 7th, broken tie in the 8th and lead change in the 9th

The ‘88 game has a lead change in the 2nd and a lead change in the 9th

Modified WPS says

1985: 4.48 + 1.34 + 0.01 = 5.83 (Tied for 94th best game)

1988: 3.94 + 1.43 + 0.87 = 6.28 (Tied for 58th best game)

I don’t think you can argue that the 1988 game is much better than the 1985 game; I don’t think it’s a better game at all. And it’s the last-play bonus that is to blame. Had the 1985 game been played in St. Louis then Clark’s homer would have been a walk-off and the game would have rated 6.56, well ahead of the 1988 game.

If you think about it, a last-play bonus is biased towards games won by the home team. If the home team loses, the last play will rarely amount to anything.

Only 23 times has it been at least 20%. When the home team wins, it is at least 20% 122 times.

Only 11 times has it been at least 30%. When the home team wins, it is at least 20% 96 times.

I also know this because I tried last play, last five plays, and last ten plays in trying to construct a rating system. I also tried top five plays, top ten plays, all plays over 10%, WPA – .03 per play (yielding the bizarre result of games with negative excitement).

Eventually I tried a simple power transformation on EVERY play. First, I tried summing the squares of the probabilities changes, like any good statistician would.

When I did that, the 1985 game Rated 10th and the 1988 game rated 5th. Which is the wrong order, and both games are just rated too high. Then I tried other powers…the Goldilocks approach, looking for the one that was just right.

 

Power             Rank               Rank

2.0          1985       10th         1988       5th best game

1.9          1985       12th         1988       8th Best game

1.8          1985       15th         1988       20th Best game

1.7          1985       23rd        1988       25th Best game

1.6          1985       32nd        1988       36th Best game

1.5          1985       38th        1988       51st Best game

1.4          1985       53rd        1988       76th Best game

1.3          1985       61st         1988       104th Best game

1.2          1985       79th        1988       133rd Best game

1.1          1985       100th      1988       158th Best game

1.0          1985       116th      1988      185th Best game

Everything above 1.7 was eliminated since it rated 1988 better than 1985

 

Here’s some shorthand I’m going to use:

Game 6 of the 1985 NLCS: STL 7, LA 5 in 9 innings — WPA 4.48 (9-4-2-1)

Game 1 of the 1988 WS: LA 5, SF 4 in 9 innings — WPA 3.98 (5-2-2-1)

The 1985 game had 9 plays rated>= 0.1, 4 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5

The 1988 game had 5 plays rated>= 0.1, 2 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5

For a sense of scale, the average game is WPA 2.67 (4.89-0.88-0.33-0.03)

(You can check the examples listed below on BBRef to get more detail on each game)

 

Checking 1.7, both exhibits rated higher than

Game 2 of the 2017 WS: HOU 7, LA 6 in 11 innings — WPA 5.30 (10-5-3-0)

Game 1 of the 2015 WS: KC 5, NYM 4 in 14 innings — WPA 6.36 (16-3-1-0)

1.7 weights the big plays too much

 

Checking 1.6, both test games rated higher than

Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)

Game 6 of the 1986 NLCS: NYM 7, HOU 6 in 16 innings — WPA 5.80 (11-3-2-0)

1.6 weights the big plays too much

 

Checking 1.5,

the 1985 game rated higher than

Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)

The 1988 game rated higher than

Game 4 of the 2001 WS: NYY 4, ARI 3 in 10 innings — WPA 4.58 (10-3-2-0)

1.5 weights the big plays too much, but it’s getting hard to find clear mistakes

 

Checking 1.4,

the 1985 game rated higher than

Game 3 of the 1976 NLCS: CIN 7, PHI 6 in 9 innings — WPA 4.72 (14-3-2-0)

Lead changes in the 7th, 8th and 9th innings.

The 1988 game rated higher than

Game 4 of the 1986 ALCS: CAL 4, BOS 3 in 11 innings — WPA 4.64 (7-4-2-0)

1.4 weights the big plays too much, but I’m now splitting hairs

 

Checking 1.3, I like this one. Let me check 1.2

 

Checking 1.2,

the 1985 game rated lower than

Game 2 of the 1996 ALDS: NYY 5, TEX 4 in 12 innings — WPA 5.02 (8-2-0-0)

Game 2 of the 1990 WS: CIN 5, OAK 4 in 10 innings — WPA 4.50 (10-2-0-0)

1.2 weights the big plays too little. Famous games are losing to games without any highlights.

 

So, I think 1.3 is the sweet spot.

My rating score is = Sum((change in probability between plays)^1.3) *2

The *2 at the end is purely cosmetic. It allows the very best game to score close to ten.

 

With base WPA, Gibson’s homer (.87) is worth about 25x a normal play (.035). With WPS it’s worth bout 75x a normal play. Raising all the plays to the 1.3 power means that Gibson’s homer is now worth about 65x a typical play.

With base WPA, Clark’s homer (.74) is worth about 21x a normal play (.035). With WPS it’s worth bout 42x a normal play. Raising all the plays to the 1.3 power means that Clark’s homer is now worth about 53x a typical play.

With a little algebra,

WPA:  Gibson = 1.18 * Clark

WPS: Gibson = 1.76 * Clark

Power 1.3: Gibson = 1.23 * Clark

A nice property of the transformation is that when the change in odds doubles, the play is worth ~ two and half times a much (2.46x)

 

EXCITEMENT IS NOT LINEAR

 

A 10% play is now worth 2.46 times as much as 5% play

A 20% play is now worth 2.46 times as much as 10% play

A 50% play is now worth 2.46 times as much as a 25% play

The system has a single parameter applied to ALL plays, so a game isn’t screwed if it has four great plays or the best play comes in the 8th inning. Ranking games this way, here are the five games better than, and worse than, my two test cases.

 

Series Road Team home team IP  (WPA^1.3)
*2
 WPA Top
Play
 # Plays  P>= .1  P>= .2  P>=.3  P>=.5
2014
ALCS G1
Royals 8 Orioles 6 10 5   5.14 35.0%         96        13         3         2        –
1935
WS G3
Tigers 6 Cubs 5 11 4.97   5.02 36.0%         96        15         5         1        –
1976
NLCS G3
Phillies 6 Reds 7 9 4.95   4.72 46.0%         82        14         3         2        –
2015
ALDS2 G2
Rangers 6 Blue Jays 4 14 4.93   5.46 37.0%       115         7         2         1        –
1997
ALCS G4
Orioles 7 Indians 8 9 4.92   4.92 38.0%         88        16         4         1        –
1985
NLCS G6
Cardinals 7 Dodgers 5 9 4.92   4.48 74.0%         83         9         4         2         1
1975
NLCS G3
Reds 5 Pirates 3 10 4.88   4.52 55.0%         81        14         3         3         1
1933
WS G4
Giants 2 Senators 1 11 4.87   4.94 55.0%         92         9         3         1         1
2011
ALCS G2
Tigers 3 Rangers 7 11 4.86   5.10 34.0%         92        13         3         1        –
2012
ALDS2 G2
Athletics 4 Tigers 5 9 4.86   4.86 41.0%         85        11         4         1        –
1999
NLCS G6
Mets 9 Braves 10 11 4.85   5.12 26.0%       108        14         3        –        –

 

 

Series Road home team IP  (WPA^1.3)
*2
 WPA Top
Play
 # Plays  P>= .1  P>= .2  P>=.3  P>=.5
1952
WS G5
Dodgers 6 Yankees 5 10 4.51   4.70 44.0%         92        10         4         1        –
1923
WS G1
Giants 5 Yankees 4 9 4.51   4.54 40.0%         78        12         2         2        –
1984
NLCS G4
Cubs 5 Padres 7 9 4.51   4.54 37.0%         83        10         4         2        –
1992
WS G2
Blue Jays 5 Braves 4 9 4.5   4.40 65.0%         85        11         1         1         1
1998
ALCS G2
Indians 4 Yankees 1 12 4.48   4.78 33.0%         96        11         3         1        –
1988
WS G1
Athletics 4 Dodgers 5 9 4.47   3.98 87.0%         82         5         2         2         1
2000
NLCS G2
Mets 6 Cardinals 5 9 4.46   4.66 32.0%         91        13         3         2        –
2016
NLDS2 G5
Dodgers 4 Nationals 3 9 4.46   4.66 21.0%         84        14         1        –        –
1977
WS G1
Dodgers 3 Yankees 4 12 4.45   4.80 30.0%         97        11         2         1        –
1954
WS G1
Indians 2 Giants 5 10 4.43   4.74 29.0%         89        11         1        –        –
1958
WS G1
Yankees 3 Braves 4 10 4.43   4.56 40.0%         88        10         3         2        –

 

 

I hope you’ll look at these and see that while they have different shapes, they all contain a similar ‘volume’ of excitement.

Another way to evaluate the method is to look at games with the same WPA. Going back to where I began in this article, here are the seven games with a base WPA of 4.52 (No promises that BBRef has not revised the scores since I captured the data…). They are each tied for the 108th highest WPA. But after using the 1.3 power factoring, you get this:

  Game Outcome RANK (WPA^1.3)*2  WPA   #
Plays
 Top 5
Plays
 # plays
30-70%
 P>=
.1
 P>=
.2
 P>=
.3
 P>=
.5
1960 WS G7 Pit 10 NYY 9 in 9 52              5.10   4.52     77   1.74         25    15      4      3      1
1975 NLCS G3 Cin 5 Pit 3 in 10 63              4.88   4.52     81   1.60         49    14      3      3      1
1911 WS G3 A’s 3 Giants 2 in 11 110              4.41   4.52     86   1.10         58    15      3      1     –
1998 NLCS G1 SD 3 Atl 2 in 10 117              4.36   4.52     84   1.10         59    11      2      1     –
2011 NLDS2 G5 Ari 3 Mil 2 in 10 119              4.35   4.52     85   1.05         71    13      2      1     –
1926 WS G5 NYY 3 StL 2 in 10 130              4.20   4.52     86   0.84         66    16      1     –     –
1995 NLCS G2 Atl 6 Cin 2 in 10 139              4.12   4.52     95   0.75         70    13     –     –     –

 

1960 gets the love it deserves, moving up 56 spots to the 52nd best game. That despite of having the fewest plays in the 30%-70% victory range. Games with more plays do worse since that means they have smaller impact plays on average. Think of the Top 5 plays as the highlight reel for the game. 1995 NLCS Game 2 has no play >0.2 and therefore drops 31 spots in the rankings.

Adjusted WPS? Weighted WPS? Power WPS? I really do need to give it a proper name.

 

A Final example, from among the greatest Playoff games ever.

2000 NLDS G3: Mets 3, Giants 2 in 13 innings — ModWPS Rank = 11, PowerWPS Rank = 22

1986 ALCS G5: Red Sox 7, Angels 6 in 11 innings — ModWPS Rank = 22, PowerWPS Rank = 12

1980 NLCS G5: Phillies 8, Astros 7 in 10 innings — ModWPS Rank = 25, PowerWPS Rank = 14

 

The 2000 game had the higher WPS, partly because it had more plays. ModWPS likes it more due to the additional action and walk-off homer, which the better top-three plays in 80/86 could not overcome.

 

year        WPS      Plays      Last Play    Top-3     ModWPS

2000       6.34        109         0.42          0.98                 7.74

1986       5.86       97           0.05             1.42                 7.33

1980       6.06        93           0.04            1.11                 7.21

 

So why do I think 1986/1980 are better?

Because, the deeper you go beyond the top three, the better the other two are revealed to be.

 

2000                                       1986                                       1980

1.28                                        1.94                                        1.61                        Sum of Top-5 Plays

42-31-25-16-14                  73-35-34-32-20                  40-38-35-26-24     Top-5 Plays

1.88                                        2.77                                        2.43                        Sum of Top-10 Plays

16-3-2-0                               14-5-4-1                               17-6-3-0               10%-20%-30%-50% plays

 

Or simply check the line scores.

2000

0 0 0 2 0 0 0 0 0 0 0 0 0 (2) Giants

0 0 0 0 0 1 0 1 0 0 0 0 1 (3) Mets

1986

0 2 0 0 0 0 0 0 4 0 1         (7) Red Sox

0 0 1 0 0 2 2 0 1 0 0         (6) Angels

1980

0 2 0 0 0 0 0 5 0 1             (8) Phillies

1 0 0 0 0 1 3 2 0 0             (7) Astros

 

The 2000 game IS a fabulous game. But the 1986 and 1980 games are more epic, with all the late-inning heroics. The 2000 game has exactly the required three big plays and the walk-off. It checks all the boxes.

I do kinda feel bad writing this. It sounds like I’m just picking on modified WPS here. LOOK AT WHAT ELSE IT GOT WRONG…

But as I said before, Power WPS is barely better. And to show that it’s better at all, I need to show those rare cases where it makes a better call. And it was an excellent benchmark, comparing differences between it and my sixty-eleven schemes helped me identify the flaws in sixty-ten of them.

Of course, even this is not the perfect system. Any play-by-play method will still fail to capture the in-play action. A bases-empty foul pop-out rates exactly the same as a bases-empty thrown-out-at-home-trying-to-stretch-a-triple. But it is the best we can do for now.

Whereas I used to guess my line score method captured maybe 70% of the excitement of a game, PBP ratings must be capturing upwards of 90%. Which means greater confidence in game rankings and playoff series ratings.

Anyway, if anyone has any thoughts, feedback, or questions I’d love to hear them. If no one can shoot the idea full of holes, or even one hole; then comes ranking and lists of games and series.





Comments are closed.