Measuring tank performance

Updated 2020-08-11

Over-Powered (OP) tanks are maybe the 2nd most popular topic on Blitz YouTube videos and online chats

  • right after “the Matchmaker”. I have long held the view that if a tank is “OP”, this has to be visible in statistics. Otherwise there are only qualitative / subjective views left and those come in all sorts.

plot of chunk fig_tanks_tier_WR_topN plot of chunk fig_tanks_tier_WR_topN

People are susceptible to all kinds of biases in their thinking. Getting ammoracked by a Death Star will raise suspicion that the Death Star is an OP tank even though it is actually the worst tier X tank. Anecdotal experiences distort opinions and no one remembers those countless battles where a Death Star was at the bottom of the list. Therefore, tank performance is not a thing that we should vote about, but a thing we can measure with statistics.

How to measure performance in the game?

Let’s discuss first how to measure performance in the game. I am a proponent of win rate being the best measure for performance (player or tank) - not average damage, not average kills, and not speed, not alpha damage or any other attribute or characteristics. The reason for choosing win rate over other variables is the fact that winning is the objective of the game and all the damage, kills, spotting etc. are just means to win the game. Why measure proxy variables when you can measure the final variable itself?

There are some caveats in using WR as a performance measure:

  1. It requires many battles for the WR to settle near one’s performance level due to both MM and RNG: It takes 400 battles to reach +/- 5% accuracy, 10000 battles to reach +/- 1% accuracy and whopping 1 million battles to reach 0.1% accuracy with 95% confidence level. Check this link at PC WoT forums for details.
  2. Platoon rate impacts on WR but cannot be separated well from the statistics since WG does not publish platoon rate per tank played, but just as a aggregate level over all the tanks. Platooning with a good player can lift one’s WR 5-15%.
  3. Career WR measures historical average, not one’s present performance level, and it reacts slowly once the player has lot (10k+) battles.
  4. WG’s new “newbie MM queue” has distorted the Career stats for rerollers big time. This distorts both tank and player average WR. (Just ignore global & career WR).
  5. Some tanks are more powerful than others. Comparing different players’ WR in different tanks or global WR does not tell us much.
  6. Different tiers have different level of difficulty. Global / Average WR is close to useless for measuring player / tank performance.
  7. Stock tanks’ performance is significantly lower compared to when maxed-out.

But other performance measures have issues too and can be gamed; Easiest way to increase average damage is to play more high tiers and more TDs, WN8 can be gamed by playing popular tech three tanks that are difficult for below average players and not too popular among the unicums.

Despite all the issues related to WR, I consider it the best performance measure over a large number of battles and in case of tanks, over large number of players since it measures directly the objective of the game (=to win battles). It is also a more understandable measure vs. somewhat abstract indexes. But I believe performance indicators like WN8 which are based on input stats (average damage, kills, spots) give a more accurate view of players’ short-term performance (< 100 battles) than WR.

Now going back to the tank performance.

So Average WR it is then, right?

Not so fast. Average WR of a tank is a good starter, but it has its own biases. Let’s have a look at two tier IX mediums: AMX 30 1er prototype and Prototipo Standard B. Everything here is based on 6.9 data.

plot of chunk fig_Tank_WR plot of chunk fig_Tank_WR

TankAverage WRPlayers
Prototipo Standard B52.5%9 060
AMX 30 1er prototype56.4%4 370

Both tanks have also been played by thousands of players, but the AMX seems to have significantly higher average WR. Many would be tempted to claim the AMX is a better tank than Standard B. But is it?

As the t-test shows below, the difference in average WR between the tanks is statistically significant with any confidence level.

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Average WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Average WR`]
## t = -20.276, df = 7143.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04359091 -0.03590513
## sample estimates:
## mean of x mean of y 
## 0.5245230 0.5642711
## p-value: 6.4e-89
## Null hypothesis of the tanks's Relative WR being equal has to be rejected (i.e. there is a statistically significant difference at 99% confidence level

Tank performance is a different variable from player skill

Let us first consider a single player and the factors affecting the player’s chances to win one battle:

Chance of winning ~ player skill * tank performance + RNG + team performance - enemy team performance

Over a large number of battles the variables team performance and enemy team performance approach their respective averages and the impact of RNG approaches zero (“What RNG gives, RNG takes”). The more battles one plays in a tank, only two factors will influence one’s WR in it:

WR ~ player skill * tank performance

And to be more precise, player skill is average player skill in the tank in question. Platooning has been ignored here since unfortunately WG does not publish very usable platooning stats via their API.

Saying “any tank is OP in good players hands” is the same as saying “any car is fast in the hands of a good driver”. While a good driver can make better-than-average lap times with a slow car, a good driver does not make the car any faster, but is just … a good driver. Give the driver a faster car and they can drive even faster.

TL;DR. Tank performance and player skill are two separate things

Tanks have different playerbases

Let’s now go back to the AMX 30 1er prototype vs. Standard B example and compare the playerbases. I have chosen player average WR at the tier as a measure for “player skill”. And more precisely as measured during the update under study (i.e. not career WR). This eliminates couple of biases:

  • WR during the update measures the player’s current performance unlike Career WR that measures the average historical performance (re-rollers vs. normal players)
  • It measures player performance at the tier in question unlike average WR over all the tiers, and thus is not distorted by low-tier stats-padding

The plot below shows player WR distribution at tier IX (in any tank) for the both AMX 30 1er prototype and Standard B players. It is clear that the AMX 30 1er prototype is played by better players than Standard B. Stock tank battles (Standard B) have been filtered out based on an estimate of battles required to max-out the tank (139).

plot of chunk fig_player_WR_histogram plot of chunk fig_player_WR_histogram

TankAvg player WR at Tier IX
Prototipo Standard B52.7%
AMX 30 1er prototype56.3%

While both the tanks have been played by thousands of players, the AMX is a premium tank whereas Standard B is a Tech Three tank. The players of the both the tanks are roughly equally experienced (see the histogram graph below), but the AMX players are significantly better on average.

plot of chunk fig_player_battle_count_histogram plot of chunk fig_player_battle_count_histogram

But do the differences in the tanks’ player bases explain the difference in Average WR differences or not?

Introducing Relative WR

To separate the players’ skill-level distribution from tank performance, we need to compare players’ WR in a tank to their skill-level. As explained above, I have chosen Average WR at the Tier as the measure for player skill. Blitzstars’ Tank-Compare uses players’ average WR (on any tier) as a measure for player skill in its Relative WR graphs. I have chose to use the WR at the tier in question to eliminate the impact of low-tier stat padding.

Relative WR(tank) = Average(WR in a tank - Player’s WR at the Tier)

In a nutshell, Relative WR shows how much more/less the players are winning with the tank vs. their tier average. The higher the Relative WR, the stronger the tank is.

plot of chunk fig_relativeWR_avg plot of chunk fig_relativeWR_avg

TankAverage WRWR at TierRelative WRPlayers
Prototipo Standard B52.6%52.2%0.43%7 960
AMX 30 1er prototype56.9%56.7%0.25%3 570

The Relative WR graph above shows that players perform a bit better with Standard B than the AMX when compared to their tier IX average. The AMX’s average WR is far higher than the Standard B’s, but its playerbase is far better too - on average. when the differences in the playerbase are taken into account, the difference becomes small and to the Standard B’s advantage. The difference IS statistically significant See significance testing below.

Premium tanks and new higher tier tank lines are often played by better players than average. This distorts the average WR of those tanks and makes people regard the tanks as OP whereas the fundamental reason can be that the tanks are just being played by better players. Yes, there are many borderline-OP & ridiculously-broken premium tanks, but the Relative WR analysis allows us to separate tank performance from skill-level differences in the tanks’ playerbases.

From statistical perspective the Standard B performs better than the AMX on average. p-value 0.081 well below the limit 0.05 (95% confidence level).

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 1.7469, df = 5971.7, p-value = 0.0807
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0002202523  0.0038259103
## sample estimates:
##   mean of x   mean of y 
## 0.004291730 0.002488901
## p-value: 0.081
## Null hypothesis of the tanks's Relative WR being equal cannot be rejected (i.e. there is not statistically significant difference at 95% confidence level

TL;DR: Relative WR measures how much higher/lower WR players achieve in a tank vs. their average WR at the same tier.

OK, is this wall of text over now? Nope.

Performance within player skill category

The data shows that the Standard B’s Relative WR is slightly higher than the AMX 30 1er prototype’s - on average - although the difference is not strictly statistically significant. The caveat here is the words on average. How about performance in the hands of below/above average players? Some tanks are known to be difficult for less-skilled players, but well performing in the hands of more skilled players.

Let’s see how the Standard B and the AMX perform when played by different player skill categories.

Performance in the hands of super-good players (WR at tier >70%)

plot of chunk fig_relativeWR_gt70 plot of chunk fig_relativeWR_gt70

TankRelative WRPlayers
AMX 30 1er prototype0.22%267
Prototipo Standard B0.48%158

Standard B seems to perform better than the AMX in the hands of super-good players. However, the player sample is on the small side and the difference is not statistically significant (see below).

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 0.56165, df = 313.91, p-value = 0.5748
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006480443  0.011658220
## sample estimates:
##   mean of x   mean of y 
## 0.004814335 0.002225446
## p-value: 0.57
## Null hypothesis of the tanks's Relative WR being equal cannot be rejected (i.e. there is not statistically significant difference at 95% confidence level

Performance in the hands of very good players (WR at tier 60-70%)

plot of chunk fig_relativeWR_60_70 plot of chunk fig_relativeWR_60_70

TankRelative WRPlayers
Prototipo Standard B0.4%1 030
AMX 30 1er prototype0.35%1 020

When considering only players with 60-70% WR at tier IX, Standard B is still slightly better, but the margin is minuscule. The p-value is 0.8 so the difference is not statistically significant at any reasonable confidence level.

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 0.24816, df = 2029.4, p-value = 0.804
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.003669481  0.004732673
## sample estimates:
##   mean of x   mean of y 
## 0.004038950 0.003507354
## p-value: 0.8
## Null hypothesis of the tanks's Relative WR being equal cannot be rejected (i.e. there is not statistically significant difference at 95% confidence level

Performance in the hands of good players (WR at tier 50-60%)

plot of chunk fig_relativeWR_50_60 plot of chunk fig_relativeWR_50_60

TankRelative WRPlayers
Prototipo Standard B0.49%3 570
AMX 30 1er prototype0.44%1 480

For this player category the results have reversed again, and Standard B performs bit better vs. the AMX. This is a large player category in the overall dataset, but still the difference is not statistically significant (p-value is 0.78).

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 0.27339, df = 2353.5, p-value = 0.7846
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.002847024  0.003769455
## sample estimates:
##   mean of x   mean of y 
## 0.004875794 0.004414578
## p-value: 0.78
## Null hypothesis of the tanks's Relative WR being equal cannot be rejected (i.e. there is not statistically significant difference at 95% confidence level

Performance in the hands of below-average players (WR at tier 40-50%)

plot of chunk fig_relativeWR_40_50 plot of chunk fig_relativeWR_40_50

TankRelative WRPlayers
Prototipo Standard B0.37%2 900
AMX 30 1er prototype-0.16%682

When analyzing players with 40-50% WR at tier IX, Standard B performance clearly better. Considering that this is a large player group, it is easy to understand why Standard B’s average Relative WR is higher than the AMX’s. My guess is that it is the Standard B’s burst DPM that helps the below average players, where as AMX requires more skill in e.g. ridge fighting to perform, even it is a very good tank in the hands of a skilled player. Also, for this player group, the difference in Relative WR is statistically significant (p-value 0.015).

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 2.4359, df = 903.2, p-value = 0.01505
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.001042643 0.009689074
## sample estimates:
##    mean of x    mean of y 
##  0.003716970 -0.001648889
## p-value: 0.015
## Null hypothesis of the tanks's Relative WR being equal has to be rejected (i.e. there is a statistically significant difference at 95% confidence level

Performance in the hands of well below-average players (WR at tier <40%)

plot of chunk fig_relativeWR_lt40 plot of chunk fig_relativeWR_lt40

TankRelative WRPlayers
Prototipo Standard B0.35%292
AMX 30 1er prototype-0.66%112

Here the difference in Relative WR turns even larger between the tanks. Standard B seems to perform significantly better for players with below-average skills. Again, I suspect the reason being the burst damage. Even the sample size is getting small the difference IS statistically significant (p-value 0.044)).

## 
## 	Welch Two Sample t-test
## 
## data:  res[Tank == tanks.post.200608.tank_perf[[1]], `Relative WR`] and res[Tank == tanks.post.200608.tank_perf[[2]], `Relative WR`]
## t = 2.0256, df = 188, p-value = 0.04422
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0002632513 0.0198760920
## sample estimates:
##    mean of x    mean of y 
##  0.003496051 -0.006573621
## p-value: 0.044
## Null hypothesis of the tanks's Relative WR being equal has to be rejected (i.e. there is a statistically significant difference at 95% confidence level

Final words

As you can see, the question of tanks’ performance is not that straightforward. Even though average WR is such a common measure for tanks’ performance, it fails to separate the impact of the different playerbases from the underlying tank performance. And even when comparing tank’s (average) Relative WR it only answers to the question in average - not for very good player or well-below average player.

To understand how does a tank perform in the hands of certain skill-level players, the Relative WR analysis can still come handy as it can be run for a specific skill-level players only. Again, the numbers are averages and there are always players who are relatively better on one tank vs other. But the Relative WR analysis per skill-level can help a player to choose tanks they are more likely to perform better with.

In case of the AMX 30 1er prototype and Standard B, the differences between the tank performance are mostly insignificant, but not far from being statistically significant. As anyone skilled player who has played those, both the tanks are good. I would call it a draw here since there are far larger imbalances in the game.

The graph below shows Relative WR for both the tanks as a function of players’ average WR at the tier (IX). The grey area shows the share of players with particular WR at the tier in the data set. Small sample sizes are likely to cause errors in both the ends of the graph.

plot of chunk fig_player_RelativeWR plot of chunk fig_player_RelativeWR

I find it surprising there were differences between the tanks’ performance in the 60-70% and 70%+ WR player segments. The absolute differences are not that big and this could go into statistical error tolerances. What may explain the result is the fact that player skill follows broadly normal distribution, thus there are more players with (tier) WR closer to 50% than further away from it. Therefore the 60-70% WR player segment consist of mostly 60-65% players vs. 65-70% players. And there is quite a difference skill still between 60-65% and 70%+ players as I can observe every time I watch e.g. Juicy Tender Steak or HisRoyalFatness playing tanks.

That’s all folks - this time