Transparency in an Opaque World

One of the principles DTRAP was founded on was scientific rigor.  Transparency in data and transparency in method are two principles that are important to us.  They are realized very differently in different studies though.  What makes sense in an observational study is not the same as what is used in a simulation.  Keep that in mind when you do your research.

 They are core to repeatability and reproducibility and if we want our research to be used, we need to make sure it can be repeated and reproduced.

Transparency in method means “Tell me every step you used to create this algorithm”.  I’ve seen too many papers where the middle of the algorithm was left off.  You know, the important steps that make the algorithm work.  Don’t skip those steps, it makes the paper fail repeatability and reproducibility tests. 

As tempting as it is to say that the wizard in the corner solved everything, don’t.  We want to know your method.

A reader may want to repeat your results, but they also want to understand how you achieved them.  If it’s a black box in the middle of the method, then that comprehension is lacking.  I can’t understand how you analyzed the malware if the wizard in the corner is your middle step.  

It’s not just the algorithm though.  The environment that the algorithm is used in can matter as well.  For example:

  • Using a proxy for your network connection but failing to specify it
  • Using a loopback for the networking code on LAN and assuming that specifying the loopback is enough
  • Using a virtual cluster for your code rather than the real thing but not stating it
  • Specifying the exact library for a specific function.  Some libraries have similarly named functions, it’s important to specify the exact library used
  • Using a non-default value for a parameter with a default but not specifying the parameter

All of these can change the outcomes in the experiment, it is important to detail the environment completely.

Transparency in data means “We want to know what data you used”.  It’s possible that you collected the data, in which case, we want to know how you collected it.  Was it a honey pot or honey net?  Was it pulled out of a hat originally inhabited by a rabbit?  (Don’t use that data, my magic hat broke.)  This can be key to replicating and understanding your work.  If I want to reproduce it, I need to know what kind of data I need.  Do I need purple malware?  Green?  Puce?(Yes, I made that up.)  Or is it supposedly malware in general?  Android malware or IPhone?  Maybe I need passive DNS, or perhaps active DNS.

On the other hand, you may have acquired private or proprietary data. The question is then “How do we maintain transparency?”.  Academic papers aren’t marketing.  This shouldn’t be a version of “I bought this data, look how great it is!” but at the same time, Cybersecurity data is often behind a paywall or constrained by privacy issues.  Sometimes, for a researcher, the most appropriate data available for their research is the data set they purchase.

At DTRAP, we want our readers to be able to increase their knowledge in Digital Threats by reading our papers. It is very hard to learn from results that are based on opaque data or methods.  Transparency is important not just because it fosters better research but because it helps in the spread of knowledge in the field.

If you decide to use proprietary data in your research, explain the trade off and why you think it is worth it. Your data shouldn’t be a black box with no description. You should describe the data in such a way that it not only fosters reproducibility, repeatability and corroboration but so that the reader can understand your results in the context of your data.

For more information, read ACM’s policies on repeatability and reproducibility.  This post is specific for DTRAP, it isn’t a general post for all journals.

Share