Base Rate Fallacy

This blog is about DTRAP but at the same time, I want to talk about doing research. This is one of those “Doing research” topics that I think is important in effective research.

Cognitive fallacies can be a major problem in research. Knowing what they are and how to avoid them is important. One that is very common Cybersecurity research is the Base Rate Fallacy.

Rather than jumping straight to the definition, I’m going to start with an example. I once had a data set of 100 million domain names that I was using for research. I wanted to find malicious domains, so I designed a method to find them in this data set. I’m pretty good at this, so my method is 99% correct. In other words, 99 times out of 100 I identify a malicious domain. That remaining 1 is a false positive. I say it has the behavior, but it doesn’t.

Now let’s suppose I know that only 1 out of every 10,000 domains in that set of domains is malicious. That’s a pretty small number, right? But I have my method and I’m right 99% of the time.

Well… actually…. if I use some probability (it’s an application of Bayes’ Rule) I can compute the probability of actually having a malicious domain given that my method found one. It’s actually 1%.

That sounds wrong, but it isn’t I promise. It’s because the Base Rate of maliciousness is so small in the set of domains. It also means that if someone uses my method to find maliciousness in that set of domains, they’re going to find more false positives than true positives.

This is a common problem in Cybersecurity. There’s almost a googol of potential domains. So if even 10 million of them were malicious, that’s.. a very tiny percentage. Any method that would attempt to find maliciousness based on the domain name (whether or not it was registered) is going to give more false positives than true positives.

If you have someone trying to use that method, like a member of a Security Operations Center, then they’re going to get really annoyed by the number of false results compared to the number of real results.

There’s so much data to analyze out there that we have to be careful of the fallacy and aware of how it can affect the consumption of the results. It sounds really nice that my method can find malicious domains 99% of the time but with the amount of data that is available in this field, it’s actually less help to the practitioner.

We want our research to be used but if we’re annoying the practitioner more than helping, it won’t be.

1 Response

Leave a Reply Cancel reply

Recent Posts

Archives

About This Blog

1 Response

Leave a Reply Cancel reply

Recent Posts

Archives

Tags

About This Blog