Improving WPS
“All happy families are alike; each unhappy family is unhappy in its own way.” — L. Tolstoy
You can say something similar about baseball games. All boring games are alike; but exciting games are interesting in their own ways. Every boring game has one team building up a big early lead, which is never threatened. But there are many ways to have an exciting game: the pitcher’s duel, the slugfest, the late-inning comeback, extra innings, all in various combinations. And in between them are the bulk of games that are simply ordinary.
All of which makes ranking exciting games a tricky process, at least compared ranking to how boring they are. How does one compare Game 7 of the 1991 WS (1-0 in 10 innings) to Game 4 of the 1993 WS (15-14 in 9 innings) on the same scale? They’re great in different ways.
Back in 2005 I created a system to do just that, a rating system based simply on the runs scored in line score. I may have been the Christopher Columbus of that new world. And ranking the games allows you to rate post-season series-es.
The line-score system did work in the sense that it could tell the difference between a great game and a good one, and between a good one and an ordinary one. But while the line score gives you the basic outline of the game, it was blind to the details of what happens DURING each inning. Zero runs scored in the top of the 1st rates exactly the same; whether there were three pop-ups, or if three singles were followed by a triple play.
Eventually I realized that Baseball-Reference.com (ALL HAIL BBREF) has the play-by-play data for all playoff games, which includes a probability of victory after each play (anything that changes the outs, baserunners or score). Plotted, you can easily see if a game was good; It looks like and earthquake. If it was bad, it looks like the EKG of a corpse. Using those probabilities, we can create a much more accurate game rating. I fiddled with many rating schemes over the last 10 years before settling on one that seems both conceptually simple and that yields reasonable results.
Of course, by then I had been beaten to the basic concept by Dave Studeman (WPA) and Shane Tourtellotte (WPS). Twelve years is too long for laurels resting.
WPA = Sum(change in probability between plays)
Modified WPS = Sum(change in probability between plays) + top three plays + Final play
What I have developed is similar to their work, but I think it has some small advantages. Generally, my ratings will be quite close to Shane’s (R-squared > 99.5%). He correctly realized that simply summing the probabilities doesn’t quite work, which is why he modified it. An example…
There are seven post-season games with a WPA of exactly 4.52. Among them are:
1995 NLCS Game 2
Reds beat the Braves 6-2 in ten innings.
95 Plays, 13 plays changed the odds by at least 10%
top Play a Mark Portugal bases-loaded wild pitch +18%
70 plays with the odds in the 30% to 70% range
compared to
1960 WS Game 7
Pirates 10 Yankees 9 in nine innings
77 plays, 15 plays changed the odds by at least 10%
Of those 4 changed the odds by at least 20%
Of those 3 changed the odds by at least 30%
Of those 1 changed the odds by more than 50%
25 plays with the odds in the 30% to 70% range
There is simply no way those games are equal. The 1960 game has five different plays better than any play in the 1995 game. The 1995 game makes up the ground by (1) having 18 more plays (2) having fewer plays where nothing happened because the game was usually within one run.
1960 is still better because a +40% play isn’t twice as exciting as two +20% plays. Bill Mazeroski’s game-ending homer rates as +37%. Bobby Richardson’s game-starting line-out rates at +2%. Making a walk-off homer the equal of about 3 ½ innings with zero hits. NOPE. WRONG.
Shane accounted for this with his modified method. By counting the top three plays twice and Mazeroski’s walk-off homer three times, the ratings are now
1960: 6.49
1995: 5.19
And science prevails.
Of course, there is nothing magical about TOP THREE plays or LAST play. You could try using the top five plays and last five plays (believe me, I did). But I do think that using Top-3 + Last can sometimes lead you astray. I will now present exhibits A and B to demonstrate where it can swing and miss.
Exhibit A: 1988 WS Game 1
Exhibit B: 1985 NLCS Game 6
I expect you to know them. The two biggest home runs in terms of changing the odds in post-season history courtesy of Mr. Clark and Mr. Gibson.
1985: WPA 4.48 in 83 plays and 9 innings
1988: WPA 3.94 in 82 plays and 9 innings
The 1985 game had more action with the same number of plays, which you can easily see in the line scores
StL 0 0 1 0 0 0 3 0 3 (7)
LA 1 1 0 0 2 0 0 1 0 (5)
Compared to
Oak 0 4 0 0 0 0 0 0 0 (4)
LA 2 0 0 0 0 1 0 0 2 (5)
The ‘85 game has a game tie in the 7th, broken tie in the 8th and lead change in the 9th
The ‘88 game has a lead change in the 2nd and a lead change in the 9th
Modified WPS says
1985: 4.48 + 1.34 + 0.01 = 5.83 (Tied for 94th best game)
1988: 3.94 + 1.43 + 0.87 = 6.28 (Tied for 58th best game)
I don’t think you can argue that the 1988 game is much better than the 1985 game; I don’t think it’s a better game at all. And it’s the last-play bonus that is to blame. Had the 1985 game been played in St. Louis then Clark’s homer would have been a walk-off and the game would have rated 6.56, well ahead of the 1988 game.
If you think about it, a last-play bonus is biased towards games won by the home team. If the home team loses, the last play will rarely amount to anything.
Only 23 times has it been at least 20%. When the home team wins, it is at least 20% 122 times.
Only 11 times has it been at least 30%. When the home team wins, it is at least 20% 96 times.
I also know this because I tried last play, last five plays, and last ten plays in trying to construct a rating system. I also tried top five plays, top ten plays, all plays over 10%, WPA – .03 per play (yielding the bizarre result of games with negative excitement).
Eventually I tried a simple power transformation on EVERY play. First, I tried summing the squares of the probabilities changes, like any good statistician would.
When I did that, the 1985 game Rated 10th and the 1988 game rated 5th. Which is the wrong order, and both games are just rated too high. Then I tried other powers…the Goldilocks approach, looking for the one that was just right.
Power Rank Rank
2.0 1985 10th 1988 5th best game
1.9 1985 12th 1988 8th Best game
1.8 1985 15th 1988 20th Best game
1.7 1985 23rd 1988 25th Best game
1.6 1985 32nd 1988 36th Best game
1.5 1985 38th 1988 51st Best game
1.4 1985 53rd 1988 76th Best game
1.3 1985 61st 1988 104th Best game
1.2 1985 79th 1988 133rd Best game
1.1 1985 100th 1988 158th Best game
1.0 1985 116th 1988 185th Best game
Everything above 1.7 was eliminated since it rated 1988 better than 1985
Here’s some shorthand I’m going to use:
Game 6 of the 1985 NLCS: STL 7, LA 5 in 9 innings — WPA 4.48 (9-4-2-1)
Game 1 of the 1988 WS: LA 5, SF 4 in 9 innings — WPA 3.98 (5-2-2-1)
The 1985 game had 9 plays rated>= 0.1, 4 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5
The 1988 game had 5 plays rated>= 0.1, 2 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5
For a sense of scale, the average game is WPA 2.67 (4.89-0.88-0.33-0.03)
(You can check the examples listed below on BBRef to get more detail on each game)
Checking 1.7, both exhibits rated higher than
Game 2 of the 2017 WS: HOU 7, LA 6 in 11 innings — WPA 5.30 (10-5-3-0)
Game 1 of the 2015 WS: KC 5, NYM 4 in 14 innings — WPA 6.36 (16-3-1-0)
1.7 weights the big plays too much
Checking 1.6, both test games rated higher than
Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)
Game 6 of the 1986 NLCS: NYM 7, HOU 6 in 16 innings — WPA 5.80 (11-3-2-0)
1.6 weights the big plays too much
Checking 1.5,
the 1985 game rated higher than
Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)
The 1988 game rated higher than
Game 4 of the 2001 WS: NYY 4, ARI 3 in 10 innings — WPA 4.58 (10-3-2-0)
1.5 weights the big plays too much, but it’s getting hard to find clear mistakes
Checking 1.4,
the 1985 game rated higher than
Game 3 of the 1976 NLCS: CIN 7, PHI 6 in 9 innings — WPA 4.72 (14-3-2-0)
Lead changes in the 7th, 8th and 9th innings.
The 1988 game rated higher than
Game 4 of the 1986 ALCS: CAL 4, BOS 3 in 11 innings — WPA 4.64 (7-4-2-0)
1.4 weights the big plays too much, but I’m now splitting hairs
Checking 1.3, I like this one. Let me check 1.2
Checking 1.2,
the 1985 game rated lower than
Game 2 of the 1996 ALDS: NYY 5, TEX 4 in 12 innings — WPA 5.02 (8-2-0-0)
Game 2 of the 1990 WS: CIN 5, OAK 4 in 10 innings — WPA 4.50 (10-2-0-0)
1.2 weights the big plays too little. Famous games are losing to games without any highlights.
So, I think 1.3 is the sweet spot.
My rating score is = Sum((change in probability between plays)^1.3) *2
The *2 at the end is purely cosmetic. It allows the very best game to score close to ten.
With base WPA, Gibson’s homer (.87) is worth about 25x a normal play (.035). With WPS it’s worth bout 75x a normal play. Raising all the plays to the 1.3 power means that Gibson’s homer is now worth about 65x a typical play.
With base WPA, Clark’s homer (.74) is worth about 21x a normal play (.035). With WPS it’s worth bout 42x a normal play. Raising all the plays to the 1.3 power means that Clark’s homer is now worth about 53x a typical play.
With a little algebra,
WPA: Gibson = 1.18 * Clark
WPS: Gibson = 1.76 * Clark
Power 1.3: Gibson = 1.23 * Clark
A nice property of the transformation is that when the change in odds doubles, the play is worth ~ two and half times a much (2.46x)
EXCITEMENT IS NOT LINEAR
A 10% play is now worth 2.46 times as much as 5% play
A 20% play is now worth 2.46 times as much as 10% play
A 50% play is now worth 2.46 times as much as a 25% play
The system has a single parameter applied to ALL plays, so a game isn’t screwed if it has four great plays or the best play comes in the 8th inning. Ranking games this way, here are the five games better than, and worse than, my two test cases.
Series | Road Team | home team | IP | (WPA^1.3) *2 |
WPA | Top Play |
# Plays | P>= .1 | P>= .2 | P>=.3 | P>=.5 |
2014 ALCS G1 |
Royals 8 | Orioles 6 | 10 | 5 | 5.14 | 35.0% | 96 | 13 | 3 | 2 | – |
1935 WS G3 |
Tigers 6 | Cubs 5 | 11 | 4.97 | 5.02 | 36.0% | 96 | 15 | 5 | 1 | – |
1976 NLCS G3 |
Phillies 6 | Reds 7 | 9 | 4.95 | 4.72 | 46.0% | 82 | 14 | 3 | 2 | – |
2015 ALDS2 G2 |
Rangers 6 | Blue Jays 4 | 14 | 4.93 | 5.46 | 37.0% | 115 | 7 | 2 | 1 | – |
1997 ALCS G4 |
Orioles 7 | Indians 8 | 9 | 4.92 | 4.92 | 38.0% | 88 | 16 | 4 | 1 | – |
1985 NLCS G6 |
Cardinals 7 | Dodgers 5 | 9 | 4.92 | 4.48 | 74.0% | 83 | 9 | 4 | 2 | 1 |
1975 NLCS G3 |
Reds 5 | Pirates 3 | 10 | 4.88 | 4.52 | 55.0% | 81 | 14 | 3 | 3 | 1 |
1933 WS G4 |
Giants 2 | Senators 1 | 11 | 4.87 | 4.94 | 55.0% | 92 | 9 | 3 | 1 | 1 |
2011 ALCS G2 |
Tigers 3 | Rangers 7 | 11 | 4.86 | 5.10 | 34.0% | 92 | 13 | 3 | 1 | – |
2012 ALDS2 G2 |
Athletics 4 | Tigers 5 | 9 | 4.86 | 4.86 | 41.0% | 85 | 11 | 4 | 1 | – |
1999 NLCS G6 |
Mets 9 | Braves 10 | 11 | 4.85 | 5.12 | 26.0% | 108 | 14 | 3 | – | – |
Series | Road | home team | IP | (WPA^1.3) *2 |
WPA | Top Play |
# Plays | P>= .1 | P>= .2 | P>=.3 | P>=.5 |
1952 WS G5 |
Dodgers 6 | Yankees 5 | 10 | 4.51 | 4.70 | 44.0% | 92 | 10 | 4 | 1 | – |
1923 WS G1 |
Giants 5 | Yankees 4 | 9 | 4.51 | 4.54 | 40.0% | 78 | 12 | 2 | 2 | – |
1984 NLCS G4 |
Cubs 5 | Padres 7 | 9 | 4.51 | 4.54 | 37.0% | 83 | 10 | 4 | 2 | – |
1992 WS G2 |
Blue Jays 5 | Braves 4 | 9 | 4.5 | 4.40 | 65.0% | 85 | 11 | 1 | 1 | 1 |
1998 ALCS G2 |
Indians 4 | Yankees 1 | 12 | 4.48 | 4.78 | 33.0% | 96 | 11 | 3 | 1 | – |
1988 WS G1 |
Athletics 4 | Dodgers 5 | 9 | 4.47 | 3.98 | 87.0% | 82 | 5 | 2 | 2 | 1 |
2000 NLCS G2 |
Mets 6 | Cardinals 5 | 9 | 4.46 | 4.66 | 32.0% | 91 | 13 | 3 | 2 | – |
2016 NLDS2 G5 |
Dodgers 4 | Nationals 3 | 9 | 4.46 | 4.66 | 21.0% | 84 | 14 | 1 | – | – |
1977 WS G1 |
Dodgers 3 | Yankees 4 | 12 | 4.45 | 4.80 | 30.0% | 97 | 11 | 2 | 1 | – |
1954 WS G1 |
Indians 2 | Giants 5 | 10 | 4.43 | 4.74 | 29.0% | 89 | 11 | 1 | – | – |
1958 WS G1 |
Yankees 3 | Braves 4 | 10 | 4.43 | 4.56 | 40.0% | 88 | 10 | 3 | 2 | – |
I hope you’ll look at these and see that while they have different shapes, they all contain a similar ‘volume’ of excitement.
Another way to evaluate the method is to look at games with the same WPA. Going back to where I began in this article, here are the seven games with a base WPA of 4.52 (No promises that BBRef has not revised the scores since I captured the data…). They are each tied for the 108th highest WPA. But after using the 1.3 power factoring, you get this:
Game | Outcome | RANK | (WPA^1.3)*2 | WPA | # Plays |
Top 5 Plays |
# plays 30-70% |
P>= .1 |
P>= .2 |
P>= .3 |
P>= .5 |
1960 WS G7 | Pit 10 NYY 9 in 9 | 52 | 5.10 | 4.52 | 77 | 1.74 | 25 | 15 | 4 | 3 | 1 |
1975 NLCS G3 | Cin 5 Pit 3 in 10 | 63 | 4.88 | 4.52 | 81 | 1.60 | 49 | 14 | 3 | 3 | 1 |
1911 WS G3 | A’s 3 Giants 2 in 11 | 110 | 4.41 | 4.52 | 86 | 1.10 | 58 | 15 | 3 | 1 | – |
1998 NLCS G1 | SD 3 Atl 2 in 10 | 117 | 4.36 | 4.52 | 84 | 1.10 | 59 | 11 | 2 | 1 | – |
2011 NLDS2 G5 | Ari 3 Mil 2 in 10 | 119 | 4.35 | 4.52 | 85 | 1.05 | 71 | 13 | 2 | 1 | – |
1926 WS G5 | NYY 3 StL 2 in 10 | 130 | 4.20 | 4.52 | 86 | 0.84 | 66 | 16 | 1 | – | – |
1995 NLCS G2 | Atl 6 Cin 2 in 10 | 139 | 4.12 | 4.52 | 95 | 0.75 | 70 | 13 | – | – | – |
1960 gets the love it deserves, moving up 56 spots to the 52nd best game. That despite of having the fewest plays in the 30%-70% victory range. Games with more plays do worse since that means they have smaller impact plays on average. Think of the Top 5 plays as the highlight reel for the game. 1995 NLCS Game 2 has no play >0.2 and therefore drops 31 spots in the rankings.
Adjusted WPS? Weighted WPS? Power WPS? I really do need to give it a proper name.
A Final example, from among the greatest Playoff games ever.
2000 NLDS G3: Mets 3, Giants 2 in 13 innings — ModWPS Rank = 11, PowerWPS Rank = 22
1986 ALCS G5: Red Sox 7, Angels 6 in 11 innings — ModWPS Rank = 22, PowerWPS Rank = 12
1980 NLCS G5: Phillies 8, Astros 7 in 10 innings — ModWPS Rank = 25, PowerWPS Rank = 14
The 2000 game had the higher WPS, partly because it had more plays. ModWPS likes it more due to the additional action and walk-off homer, which the better top-three plays in 80/86 could not overcome.
year WPS Plays Last Play Top-3 ModWPS
2000 6.34 109 0.42 0.98 7.74
1986 5.86 97 0.05 1.42 7.33
1980 6.06 93 0.04 1.11 7.21
So why do I think 1986/1980 are better?
Because, the deeper you go beyond the top three, the better the other two are revealed to be.
2000 1986 1980
1.28 1.94 1.61 Sum of Top-5 Plays
42-31-25-16-14 73-35-34-32-20 40-38-35-26-24 Top-5 Plays
1.88 2.77 2.43 Sum of Top-10 Plays
16-3-2-0 14-5-4-1 17-6-3-0 10%-20%-30%-50% plays
Or simply check the line scores.
2000
0 0 0 2 0 0 0 0 0 0 0 0 0 (2) Giants
0 0 0 0 0 1 0 1 0 0 0 0 1 (3) Mets
1986
0 2 0 0 0 0 0 0 4 0 1 (7) Red Sox
0 0 1 0 0 2 2 0 1 0 0 (6) Angels
1980
0 2 0 0 0 0 0 5 0 1 (8) Phillies
1 0 0 0 0 1 3 2 0 0 (7) Astros
The 2000 game IS a fabulous game. But the 1986 and 1980 games are more epic, with all the late-inning heroics. The 2000 game has exactly the required three big plays and the walk-off. It checks all the boxes.
I do kinda feel bad writing this. It sounds like I’m just picking on modified WPS here. LOOK AT WHAT ELSE IT GOT WRONG…
But as I said before, Power WPS is barely better. And to show that it’s better at all, I need to show those rare cases where it makes a better call. And it was an excellent benchmark, comparing differences between it and my sixty-eleven schemes helped me identify the flaws in sixty-ten of them.
Of course, even this is not the perfect system. Any play-by-play method will still fail to capture the in-play action. A bases-empty foul pop-out rates exactly the same as a bases-empty thrown-out-at-home-trying-to-stretch-a-triple. But it is the best we can do for now.
Whereas I used to guess my line score method captured maybe 70% of the excitement of a game, PBP ratings must be capturing upwards of 90%. Which means greater confidence in game rankings and playoff series ratings.
Anyway, if anyone has any thoughts, feedback, or questions I’d love to hear them. If no one can shoot the idea full of holes, or even one hole; then comes ranking and lists of games and series.