I’ve occasionally done research on domain names. The goal of a lot of that research is to determine a method that will determine if a domain is malicious or not. It’s a good goal, we’d like to be able to look at domains and say ‘bad, good, bad, good, good, good’ and perhaps block users or at least label traffic.
If we could stop our users from going to the malicious domains we could probably stop a lot of problems.
I’m going to sound a little philosophical here. If we can label something as ‘bad’, we must be able to compare it to something that isn’t bad. In other words, bad can’t exist without good.
Which leads me to my question: How do I create a set of good domains? And what do I mean by ‘good’? Do I mean domains that have never hosted malware? Or domains that have never been used maliciously?
I need to be able to define what I mean by ‘good’ before I can define that list of domains. Once I decide that, I have another problem.
Any domain can be hijacked and misused. A good security team will do their best to stop that, but if defense is your only play, the other side only has to get it right once. Given a set of domains, I can’t say for sure that none of them have been used for malicious ends.
Which takes me back to my original question: How do you create a set of good domains?
This is an important step in domain research that should be tackled. Creating a set of ground truth domains allows us to say ‘these aren’t those’, in other words, find the malicious domains.
Of course there is always the “Alexa top N” or “sites that we have not observed trying to drop malware” (I’m recalling Kathy Wang’s ‘war driver’ (wrong name) of many years ago that drive web browsers to access sites and watch for malware drops and registry key changes)…but the problem is even more subtle. Bad to whom? One person’s advertising server can be both good (to the advertiser) and bad (to the advertiser).
The philosophical problem deepens to include perspective and intent.
Is characterizing domain names a classification problem or clustering problem? To label a given domain name as good, one could factor in the following:-
(a) The web site is mathematically provable via formal software verification to be unhackable, i.e. Web site as well as associated DNS attack surface = 0 or tends to be=0. There can be a situation where a given domain name exists ( registered) but does not host a web site. There is no proof that Alexa top n is not vulnerable or not hosting malware, hence just because a domain is appearing on Alexa n is not proof of being good but they are socially accepted and trusted. ( technical proof)
(b) Choice of strings that form the domain name do not typo squat, cyber squat, or deployed for domain tasting purposes. (Law compliance proof )
(c) Linguistically acceptable/trustable. e.g questionable pronunciation could cause a human to not trust a domain name/web site. (Linguistic proof)
(d) Socially acceptable/trustable. e.g. A domain name meeting conditions a,b,c but operating in the dark web may not be socially acceptable or, the sensational nature of the choice of the name, or the business operating behind the web site. (Social proof)
(e) Domain name is untainted by crowdsourced data or metadata. When a given domain name shows up in crowd-sourced reputation databases such as Virus Total, the domain name may not be readily trusted, although the domain name/web site meets conditions a,b,c,d.
One theoretical approach to formulating an absolute good domain name then could be f(a,b,c,d)=1.