We’ve recently received quite a few questions regarding the use of machine learning techniques in cyber security. I figured it was time for a blog post. Interestingly, while I was writing this post, we got asked even more questions, so the timing couldn’t be better.
It seems that there are quite a few companies out there making noise about using machine learning techniques in their security products like it’s a new thing. It’s not. We’ve been using machine learning techniques since 2005, and nowadays you’ll find machine learning being used almost everywhere.
Machine learning techniques were first used by the security industry to train anti-spam engines. That fact prompted us to experiment with machine learning in an attempt to identify malicious files. In late 2005, we developed an engine designed to rate the suspiciousness of files based on both structural and behavioral characteristics. This engine was originally designed to suppress false positives generated by our new behavioral blocking technology, but since then has cemented itself as a solid piece of detection technology. Both of these components were introduced into our product line in 2006.
As I mentioned, we’re using machine learning all over the place. Here are a few examples of what we’re doing with it.
Sample analysis and categorization – We’re using expert systems and machine learning to automatically categorize the 500,000 new samples we receive each day. These systems generate a lot of high-quality metadata that is transformed into actionable threat intelligence.
URL reputation and categorization – We feed content from URLs into a machine learning system in order to categorize sites both for maliciousness and type of content (such as adult content, shopping, bank, et cetera).
Client-side detection logic – We use machine learning to train client-side components to identify suspicious files based on file structure and behavioral characteristics. We refer to these components as heuristic engines. On August 25th, Sven Krasser at CrowdStrike published an informative and detailed blog post on how these techniques work that I recommend reading if you’d like to know more.
Breach detection – This is something I haven’t covered much yet, but plan to in the future. We use machine learning techniques to identify suspicious behavior on networks. These signals are sent to security experts working in our Rapid Detection Center, who investigate the incident and alert the customer if the information is valid. Naturally, the same techniques that uncover signs of breaches can also alert us to malicious insider activity.
Machine learning can be quite false-positive prone. This is why we prefer to use a hybrid approach that utilizes both human and machine. Combining machine learning with expert-developed rules and extensive automation allows us to reduce false positives and make much more accurate determinations of threats and suspicious behavior. For instance, in our sample categorization systems, machine learning techniques do a good job clustering incoming samples. However, for new samples it’s never seen before, we still use real humans to identify, label and categorize those clusters.
We’ve found machine learning to be extremely useful. However, it’s not a substitute for real human expertise just yet. As one colleague of mine put it, if you treat machine learning as a silver bullet, you’ll very quickly find that bullet in your foot. And that’s our advice to everyone out there – it’s critical that you don’t rely solely on machine-learning to protect your systems, and especially not solutions that can only identify file-based threats.
And there’s a couple of reasons why you shouldn’t do that. Firstly, you’ll not be protected against scams, phishing, and social engineering. For that, you need a URL blocking component. If you don’t have one, you can still easily end up on a site designed to steal your credentials, identity, or banking information. A solution designed to identify malicious files won’t be enough to keep you properly protected on the Internet.
Secondly, you definitely want protection against exploits. Exploits are the choke-point in the kill chain. There are hundreds of thousands of compromised or malicious sites out there, and hundreds of thousands of unique malicious files. However, there aren’t all that many unique exploits. Blocking all known exploits is much easier than ensuring every bad site out there and every single payload is handled. Here at F-Secure, we frequently gather the threat intelligence needed to find these exploits from in-house automation that relies on machine learning. However, the rules are still hand-written by our experts. This is one example of a client-side protection technology that simply doesn’t lend itself all that well to machine learning.
Finally, here are some questions @kevtownsend asked us, and my answers.
Absolutely not! Attackers, be they malware writers or actors looking to breach corporate networks, are humans. They think creatively and design attacks that can easily bypass purely automated solutions. Because of this, defenders need to be able to think creatively, too. Until artificial intelligence is capable of human-level creativity, humans will continue to be crucial in the field.
Behavioral engines are difficult to integrate into Virus Total’s system. Every sample run through their system would need to be executed in an environment containing each vendor’s protection solution. Practically speaking, this means bringing up a virtual machine, installing or updating a vendor’s product, injecting the sample into the VM, executing it, extracting the product’s verdict, and then destroying the VM. This all has to happen under special network conditions to ensure malware is not spread further.
This whole process is not only super-resource intensive, it’s hell to maintain, especially when you consider that VT’s systems already contain over 50 products. Even if VT had the infrastructure available to do this for 500,000 samples times 50 vendors per day, they’d still need to hire a fleet of people to maintain the environment and keep the products up to date.
This is an apples and oranges comparison. Machine learning techniques are used to “train” client-side detection logic. The actual machine learning process is run on heavy back end infrastructure, since it requires large volumes of samples and a significant amount of processing power. The logic bundle, once generated, is delivered to the client via product updates. Although some vendors don’t specifically talk about rules, signatures, or databases, you can be sure their products do contain them, one way or another. If a database is bundled into the binary itself, it’s still a database. Machine learning can be used to train logic designed to detect suspiciousness based on the structure of a file or its behavior, or both.
We strongly warn people against reading into the marketing hype out there. Most “AV” vendors have been using machine learning techniques to create rules and logic for years already.