We use the word “prevalence” a lot at F-Secure Labs. And what’s prevalence?
The prevalence of an executable file is defined as the number of times it’s been seen across our entire customer base. Malicious executables tend to be rare over time, most live and die quickly, and thus the number of times we’ve seen a binary can give us an indication as to its suspiciousness. Since our protection technologies connect to, and use the cloud as part of how they work, prevalence numbers are easy to acquire.
In order to understand why malicious executables are rare, we need to take a trip into the distant past.Many of the very first viruses were actually not malicious at all. They were created by hackers purely to show off their leet skillz. If you’re interested in seeing some of them in action, check out the Malware Museum that Mikko helped set up on archive.org.
As time passed, more and more viruses were created with the sole purpose of performing malicious actions. As the number of malicious files in the world grew large enough to pose a problem to the general public, the AV industry was born (and, eventually, the term “malware”, a portmanteau of the words malicious and software, was coined).
Back then, only a handful of new malware family/variants were created every week, and because of this, detection approaches were really simple. Match a few bytes here or there, or search for a specific string, and you were good to go. Since the malware existed as a single binary, your detection would catch it every time.
Malware authors soon noticed that the hard work they’d put into their shiny new creations was being nullified within hours of the sample being discovered and shared. They had to come up with a new bag of tricks to evade these simple AV detection methods. And so they did.
One of the most successful tricks employed, starting in the early 90’s, was the use of polymorphic code. By wrapping their code in a layer of encryption, malware authors were able to generate new copies of their payload that were functionally identical, but structurally different enough that they evaded simple detection methods. Each new copy of the malware was created using a slightly different key, and the creation of the binaries themselves was automated. Every new malicious sample was generated on-demand from a server-side distribution point. Since every new sample found in the wild was unique, the file-based detection methods of the late 80’s and early 90’s fell short.
The practice of flooding the world with multiple functionally similar, but structurally different binaries still exists to this day. Malware authors still need to protect their binaries against the simplest of detection approaches (such a signature-based approaches). That malware authors provide this constant moving target for AV companies is part of the cat-and-mouse game that attackers and defenders still play to this day.
So why does prevalence work so well? It all comes down to clean files. The range of legitimate software out there, and in use by our customer base is fairly well-known. And we constantly scour the Internet for new, legitimate clean files that we can add to our list of trusted binaries. On top of that, many legitimate binaries are signed by trusted providers, making them relatively easy to whitelist. This “known-clean” file set changes rather infrequently. In contrast, we see a deluge of new malicious binaries every day (tens of thousands). Any binary not falling into either of our known-clean or known-malicious sets is simply considered unknown. And of those, the rarer the file, the more likely it is to be malicious. Endpoint behavioral analysis very often confirms that.
As an example, according to statistics generated from F-Secure’s internal systems monitoring known threats, in a random sample of malicious programs found in the first four months of 2013, 99.7% of real threats were rarely seen in our user base.
As you can probably imagine, prevalence is just as effective at stopping the sort of unique binaries employed in sophisticated targeted attacks. By utilizing a combination of prevalence and behavioral analysis, we typically do pretty well against real-world threats.