False Positive/False Negative

I posted about the Base Rate Fallacy earlier this summer. I wanted to expand this discussion and talk about why I think False Positives and False Negatives are very important to report.  

To summarize the Base Rate Fallacy, it happens when the base rate of occurrence in the data is very small.  If my method is very good yet the pattern I’m looking for occurs very rarely in the data, then I’m more likely to be wrong than to be correct.

I want to talk about the ‘Being Wrong’ part of research.  No method is 100% accurate, unfortunately.  If anyone says ‘My method is perfect!’ then worry about the method.  Be concerned about the data they used.  I can write a perfect method if I pick my data set carefully and don’t allow for anything that might confuse the issue to be in the data or if I just report the results wrong.  “It’s perfect, nothing to see here, move along.  This is the method you’re looking for, ‘cause, y’know, perfect”.

For this example, I’m going to assume the method is about finding malware. If the method tags a piece of software as malware when it isn’t malware, then that’s a false positive. If it misses a piece of malware and says “That isn’t malware!” then it’s a false negative. Both results are important. False negative rates tell you how good the method is at finding malware. False positive rates tell you how good it is at finding things that aren’t malware.

I could create a method to find malware that essentially runs as ‘Is it malware? YES!’. Just say everything is malware and I’m perfect! I can find all the malware, my false negative rate is 0%. Awesome. I’m done.

Except… this is a bad method. Everything that isn’t malware is now tagged as malware. Oops.

The false positive rates are important. Telling me how often I’m going to tag something as malware that isn’t tells me how much work I have to do that’s wasted. False negative rates are important, telling me how often I’m going to miss a piece of malware tells me the probability of getting malware on my system when I don’t want it.

Report your rates. These are important for the usability of a method. Report your data too that you use to compute the rates. Otherwise I just have to guess how you created results.

Share