I’m going to start this discussion by reminding everyone that I’m a mathematician, not a statistician. We’re two different creatures. A mathematician determines if the conclusions can be deduced from the initial conditions, a statistician determines if the initial conditions support the conclusions. (I’m more or less quoting a statistician friend of mine. I’m still not a statistician.)
That being said, I want to talk about a statistical problem I see often in Cybersecurity, and that’s the reliance upon convenience samples. In statistics, you often rely upon a subset of the population. That’s because, well, imagine trying to study the entire population of domains. That’s a lot, considering there’s about a googol of possible domains. Or every IP address… including the IPv6 addresses. In mathematical terms, that’s ‘a big number’ also known as ‘a lot of data’.
That subset should be representative of the population. I don’t mean elected representatives, I mean that that subset should look as much like the population as possible. This means that using statistics, we can infer things about the population based on the sample. We can’t do that if the sample isn’t representative.
Meaning if I have a population of University students, my sample shouldn’t come from the Math department. Or if I have a population of all software, my sample isn’t ‘what I have running on my computer’. Or if my population is the ‘set of the most popular websites in the world’ then my sample can’t be ‘those websites my employees go to’.
All three of those samples are called convenience samples. They’re data I happen to have access to. I happened to have been part of a math department, so I could walk out the door and ask some math students. Or I happen to have my computer, so I have a list of software I can use. Or I happen to work at a place and have access to local traffic, so I can get a list of domains.
Convenience samples aren’t representative of the entire population. The software I happen to have on my computer isn’t representative of all software out there, it’s the software I happen to use.
If you use a convenience sample in your research, all you can talk about is what that sample says. It isn’t indicative of the entire world, it’s just what you happened to see. If you use one, make sure you write why you chose to use one. Explain how you think it could be representative of your population, but also recognize it isn’t.