Statistics — A Mathematician’s View

I know I’ve said this before, but it bears repeating.  I am not a statistician.  I am a mathematician.  My field of study is Algebraic Topology, which I usually describe as ‘that field string theorists use’.  I also call it ‘abstract.  I mean, really abstract’, also known as ‘my mathematical happy place’.

Yes, mathematicians are weird.

Anyway, back to my point.  I prove things.  I take theorems and write proofs.  For example, the Pythagorean Theorem.  Everyone who has taken Geometry has heard of this one:

In a right triangle, the area of the square whose side is the hypotenuse (the side opposite the right angle) is equal to the sum of the areas of the squares on the other two sides.

This has a proof, which isn’t relevant to this discussion so I’m going to leave it out.  But it means that for every right triangle, this is true.  The square of the hypotenuse is equal to the sum of the sum of the squares of the other two sides.  If you have a right triangle, it’s true.  It was proven to be true.

I’m going to emphasize that again.  All right triangles have this property.  It’s a proven truth.

Statistics isn’t like that.  Statistics isn’t “If these conditions hold, this WILL happen” at all.  Statistics says “Well, it could, within this error, it could happen.  But we could be wrong.”

Statistics isn’t math.  Math says “Given these conditions, we can prove this will result”.  Statistics says “I took this sample of this population, and within these errors, we expect this to be true”.  See that ‘expect’?  It isn’t a law.  It isn’t a fact.  It’s a well reasoned guess. 

Statistics isn’t reality.  It’s our attempt to make sense of reality but it shouldn’t be confused with a mathematical proof that says “This is what will happen”.

I read a great quote in a paper recently.   “a core problem is that both scientists and the public confound statistics with reality”.  People assume that because statistics showed something, it must be true when this isn’t necessarily the case.

Am I saying statistics is useless?  Not at all.  It’s very useful but we shouldn’t assign more meaning to it than what it is.  It is our approximation of what could happen, not what will happen.

Statistics is also key in research.  We can’t always research an entire population, but we can research (hopefully) a representative subset… of the right size.  We can then use statistics to tell us what could (notice the could) be true about the entire population.  And note the part about it being a “representative subset… of the right size”. That part is very important for any inference on the entire population. If the size isn’t right or the sample isn’t representative, you can’t infer things about the population as a whole nicely.

I’m going to use a Cybersecurity example since this is a blog for DTRAP.  Suppose I’ve got a huge set of domains.  Huge.  Enormous, even.  And I’m going to pretend (because I’m writing this) that it’s a representative sample of all the possible domains in the world.  I managed to get the entire population of domains and I took a random sample of all of them.

With my new method I’ve devised (due to research) I’ve managed to identify that 2% of the entire set I have is malicious. (This isn’t true. This is made up. No one should take that example as real. This isn’t the fact you’re looking for.)

Statistics lets me generalize that to the entire population, where the generalization is dependent on the size of my sample.  Without stats, all I can say is that within my magic data set, this is true.  If you don’t have my data set, you can’t say the same thing.

So while statistics is key to research, remember that it isn’t a fact.  You’re not creating reality with statistics, you’re using them to describe what reality could be like.

Share