I agree, but trying to apply statistical principals to "absolutely" prove something in a wargame is a fool's errand. There is no standard player, standard list, standard terrain set-up, standard strategy, standard set of schemes, and/or standardized card randomization. Even if you normalized a set of data for the above, people will argue some issue with the normalization.
Moreover, its a frickin' game. I'd rather paint my models than crunch R-square's and p-values in order to further argue an opinion on what is, arguably, a subjective matter.
My thoughts on this generally: there's a difference between a model being game-breakingly overpowered and a model being very, very good in certain situations that, despite being beatable, takes a disproportionately large amount of resources for the opposing player to overcome and makes the game a negative play experience.
To make an analogy to another wargame: Sorcha1's (from Warmachine) ability set before Prime Remix was almost unarguably game-breakingly powerful (especially in conjunction with the tournament rule-set at the time). In contrast, Vlad2 later in Mk I wasn't game-breakingly powerful, but was powerful enough that he dominated the national tournament scene. The model was beatable, but he was powerful enough that if two players of near equal play abilities played each other, Vlad2 gave one play an edge, all else being equal.
With that in mind, there were many an internet argument started by "ZoMG! EVlad is teh borkens!" that played out similar to the arguments in this thread. Interestingly enough, Vlad2 was toned down a bit in the 2nd edition of Warmachine and remains viable but doesn't dominate the tournament circuit. At least in my metagame, this has been considered a good thing for the health of the game.
******
I'm not going to argue specifics of the NB models to that metric, but I think that's a fair metric to go by rather than searching for a perfectly objective statistical analysis.