What’s The Deal With Machine Learning?

We’ve recently received quite a few questions regarding the use of machine learning techniques in cyber security. I figured it was time for a blog post. Interestingly, while I was writing this post, we got asked even more questions, so the timing couldn’t be better.

It seems that there are quite a few companies out there making noise about using machine learning techniques in their security products like it’s a new thing. It’s not. We’ve been using machine learning techniques since 2005, and nowadays you’ll find machine learning being used almost everywhere.

Machine learning techniques were first used by the security industry to train anti-spam engines. That fact prompted us to experiment with machine learning in an attempt to identify malicious files. In late 2005, we developed an engine designed to rate the suspiciousness of files based on both structural and behavioral characteristics. This engine was originally designed to suppress false positives generated by our new behavioral blocking technology, but since then has cemented itself as a solid piece of detection technology. Both of these components were introduced into our product line in 2006.

Skynet

I couldn’t resist. (Source: https://theaviationist.com)

As I mentioned, we’re using machine learning all over the place. Here are a few examples of what we’re doing with it.

Sample analysis and categorization – We’re using expert systems and machine learning to automatically categorize the 500,000 new samples we receive each day. These systems generate a lot of high-quality metadata that is transformed into actionable threat intelligence.

URL reputation and categorization – We feed content from URLs into a machine learning system in order to categorize sites both for maliciousness and type of content (such as adult content, shopping, bank, et cetera).

Client-side detection logic – We use machine learning to train client-side components to identify suspicious files based on file structure and behavioral characteristics. We refer to these components as heuristic engines.  On August 25th, Sven Krasser at CrowdStrike published an informative and detailed blog post on how these techniques work that I recommend reading if you’d like to know more.

Breach detection – This is something I haven’t covered much yet, but plan to in the future. We use machine learning techniques to identify suspicious behavior on networks. These signals are sent to security experts working in our Rapid Detection Center, who investigate the incident and alert the customer if the information is valid. Naturally, the same techniques that uncover signs of breaches can also alert us to malicious insider activity.

Machine learning can be quite false-positive prone. This is why we prefer to use a hybrid approach that utilizes both human and machine. Combining machine learning with expert-developed rules and extensive automation allows us to reduce false positives and make much more accurate determinations of threats and suspicious behavior. For instance, in our sample categorization systems, machine learning techniques do a good job clustering incoming samples. However, for new samples it’s never seen before, we still use real humans to identify, label and categorize those clusters.

We’ve found machine learning to be extremely useful. However, it’s not a substitute for real human expertise just yet. As one colleague of mine put it, if you treat machine learning as a silver bullet, you’ll very quickly find that bullet in your foot. And that’s our advice to everyone out there – it’s critical that you don’t rely solely on machine-learning to protect your systems, and especially not solutions that can only identify file-based threats.

And there’s a couple of reasons why you shouldn’t do that. Firstly, you’ll not be protected against scams, phishing, and social engineering. For that, you need a URL blocking component. If you don’t have one, you can still easily end up on a site designed to steal your credentials, identity, or banking information. A solution designed to identify malicious files won’t be enough to keep you properly protected on the Internet.

Secondly, you definitely want protection against exploits. Exploits are the choke-point in the kill chain. There are hundreds of thousands of compromised or malicious sites out there, and hundreds of thousands of unique malicious files. However, there aren’t all that many unique exploits. Blocking all known exploits is much easier than ensuring every bad site out there and every single payload is handled. Here at F-Secure, we frequently gather the threat intelligence needed to find these exploits from in-house automation that relies on machine learning. However, the rules are still hand-written by our experts. This is one example of a client-side protection technology that simply doesn’t lend itself all that well to machine learning.

Finally, here are some questions @kevtownsend asked us, and my answers.

Will machine learning make jobs in the cyber security industry obsolete?

Absolutely not! Attackers, be they malware writers or actors looking to breach corporate networks, are humans. They think creatively and design attacks that can easily bypass purely automated solutions. Because of this, defenders need to be able to think creatively, too. Until artificial intelligence is capable of human-level creativity, humans will continue to be crucial in the field.

If machine-learning engines can be integrated into Virus Total, why can’t behavioral analysis engines be integrated?

Behavioral engines are difficult to integrate into Virus Total’s system. Every sample run through their system would need to be executed in an environment containing each vendor’s protection solution. Practically speaking, this means bringing up a virtual machine, installing or updating a vendor’s product, injecting the sample into the VM, executing it, extracting the product’s verdict, and then destroying the VM. This all has to happen under special network conditions to ensure malware is not spread further.

This whole process is not only super-resource intensive, it’s hell to maintain, especially when you consider that VT’s systems already contain over 50 products. Even if VT had the infrastructure available to do this for 500,000 samples times 50 vendors per day, they’d still need to hire a fleet of people to maintain the environment and keep the products up to date.

Is there an intrinsic difference between machine learning detection engines and behavioral detection engines?

This is an apples and oranges comparison. Machine learning techniques are used to “train” client-side detection logic. The actual machine learning process is run on heavy back end infrastructure, since it requires large volumes of samples and a significant amount of processing power. The logic bundle, once generated, is delivered to the client via product updates. Although some vendors don’t specifically talk about rules, signatures, or databases, you can be sure their products do contain them, one way or another. If a database is bundled into the binary itself, it’s still a database. Machine learning can be used to train logic designed to detect suspiciousness based on the structure of a file or its behavior, or both.

We strongly warn people against reading into the marketing hype out there. Most “AV” vendors have been using machine learning techniques to create rules and logic for years already.



Coming Soon: iOS 10

I’ve been testing iOS 10 Beta for several weeks (on a secondary iPad mini 2 of mine) and so far, so good. I’m enjoying Swift Playgrounds and looking forward to the final release.

Most of the changes I’ve noticed have been surface (i.e., UI) changes. But today I read an interesting blog post by @nabla_c0d3, regarding iOS 10 security and privacy. Under the hood stuff that sounds very promising.

Full post here: Security and Privacy Changes in iOS 10


If you don’t already use “Limit Ad Tracking”, you’ll find the option from: Settings > Privacy > Advertising > Limit Ad Tracking.

Enabling the option in iOS 10 will cause apps to see your Advertiser ID as all 0s, putting a limit on third-party tracking.

Apps on iOS have long been designed to ask for various permissions as needed, rather than all up front (à la Android), but with iOS 10, Apple will enforce the use of “purpose strings” which should be used to communicate a reason for why the permissions is needed.



Got Ransomware? Negotiate

ICYMI: we recently published a customer service study of various crypto-ransomware families. Communication being a crucial element of ransomware schemes, we decided to put it to a comparative test.

The biggest takeaway? If you find yourself compromised – negotiate.

Our Findings – In A Nutshell

You have little to lose, the majority of extortionists appear to be willing work with their “customers”.

Our report (download) also contains a fascinating email conversation as an appendix…



NanHaiShu: RATing the South China Sea

Since last year, we have been following a threat that we refer to as NanHaiShu, which is a Remote Access Trojan. The threat actors behind this malware target government and private-sector organizations that were directly or indirectly involved in the international territorial dispute centering on the South China Sea. Hence, the name nán hǎi shǔ (南海鼠) which means South China Sea rat.

Based on our observations, the timings of the attacks indicated political motivation, as they occurred either within a month following notable news reports related to the dispute, or within a month leading up to publicly-known political events featuring the said issue.

Timeline of events

Timeline of events

The white paper is a culmination of our research to understand the motivation behind NanHaiShu. To know more about our analysis and other interesting details, please read our white paper from here.

nanhaishu whitepaper cover



Bye Bye Flash! Part 2.5. Microsoft Edge Is Going “Click To Flash”

After last Thursday’s article on how Firefox will start reducing support for Flash, I received some comments pointing me to an announcement from Microsoft, back in April, where they stated that their Edge browser would also move towards a “Click to Flash” approach. The announcement notes that Flash plugins not central to the web page will be intelligently paused, and that content such as games and video will continue to run normally. This change to Edge will be delivered in the anniversary update of Windows 10.

I’d like to point out that we did notice this news back in April, and kudos to Microsoft, and the Edge team, for making this happen.

Microsoft Edge Logo

Microsoft Edge Logo (source: microsoft.com)

Why didn’t we talk about this at the time? Well, Edge only works on newer Windows versions. It seems that Microsoft won’t make their 1 billion target for Windows 10 installs, and at current count, Windows 7 still has about 50% market share. So, we’re still waiting for that all-important announcement about Flash and Microsoft Internet Explorer.



Bye Bye Flash! Part 2 – Firefox Plans To “Reduce” Support For Flash

Earlier this year, in our 2015 Threat Report, our own Sean Sullivan predicted that Chrome, Firefox, and Microsoft would announce an iterative shift away from supporting Flash in the browser by 2017. Last month, we covered the announcement made by Google.

As predicted, just yesterday, the Firefox developers made a similar announcement on their blog.

Mozilla Firefox logo.

Mozilla Firefox logo. Source: https://www.mozilla.org/

Firefox will begin dropping Flash support by blocking specific SWF files via a blocklist. The list will initially contain just plugins designed for “fingerprinting”. As stated by the Firefox developers, the criteria for adding content to the blocklist are:

  • Blocking the content will not be noticeable to the Firefox user.
  • It is possible to reimplement the basic functionality of the content in HTML without Flash.

The blocklist will be expanded to cover more types of content throughout this year, and by the beginning of next year, Firefox will require click-to-activate approval from users before a website activates the Flash plugin for any content. The next major Firefox ESR (Extended Support Release) release, scheduled for March 2017, will, unfortunately still continue to support plugins such as Silverlight and Java until early 2018.

The guys at Mozilla state that these changes will improve browsing stability, battery life, and performance. For us, the great news is that these changes will improve browsing safety, by greatly reducing the attack surface exploit kits have to work with.

And with that announcement, it’s two down, one to go.



Malware History: Code Red

Fifteen years (5479 days) ago… Code Red hit its peak. An infamous computer worm, Code Red exploited a vulnerability in Microsoft Internet Information Server (IIS) to propagate.

Infected servers displayed the following message.

Welcome to http://www.worm.com !

Description: Worm:W32/CodeRed

See @mikko‘s Tweet below for a visualization.



A New High For Locky

After seeing a drop during first weeks of June, the spam campaigns distributing Locky crypto-ransomware has returned as aggressive as ever. Normally we have seen around 4000-10,000 spam hits a day during spam campaigns.

Last week from Wednesday to Friday we observed a notable increase in amount of spam distributing Locky. At most we saw 30,000 hits per hour, increasing the daily total to 120,000 hits.

Yesterday, Tuesday, we saw two new campaigns with a totally different magnitude: more than 120,000 spam hits per hour. In other words, over 200 times more than on normal days, and 4 times more than on last week’s campaigns.

Import stats July 5-14 linegraph

The two campaigns were distributed simultaneously, and they initially spiked yesterday afternoon at 2pm (here in Helsinki), and a second time around midnight.

The spam subject in one campaign is seemingly empty, “Fw:”, with a zip file attachment named: xls_convert_recipientname_randomnumber.zip. The body of the message indicates that the attachment contains requested invoices in Excel file format. With these social engineering techniques the attacker tries to lure the user to open the attached file. Instead, the attached zip file contains a JScript file, downloading and executing the Locky ransomware.

spam_fw_20160712

The other campaign was sent with subject “Profile” containing a similar zip file attachment. The name of the attached file is: recipientname_profile_randomnumber.zip.

spam_profile_20160712

We block these samples with following detections:

  • Trojan-Downloader:JS/Kavala.S
  • Trojan-Downloader:JS/Locky.T
  • Trojan:W32/Locky.X!DeepGuard

SHA1s:

0117ad48e414813709940af1514db5944c4da5eb
8aada8b162b47f27e332c4ccc9a9b5e36594d034
01c99e8ca77851295b840e01ae3ff6ae7faa8d46
08788c185f8af2c4bce08af948daeb09c0d340d9
4d1c0884d9f63e9f361b77b5e6cb4e907e901480



Black Hat USA 2016 Briefings

We get a fair amount of requests from journalists and media organizations asking our opinion on a whole range of tech topics. And when Black Hat rolls around, the pace of those requests often picks up considerably. So, I spent some time last week reading through the Black Hat USA 2016 briefings.

That was a lot of reading.

BlackHat Logo

Source: blackhat.com

I won’t be going to Black Hat USA this year, but if I were, here are some of the talks I’d be most interested in seeing.


$hell on Earth: From Browser to System Compromise – Details on the eight winning browser to super user exploit chains from this year’s Pwn2Own contest? What’s not to like?

Account Jumping Post Infection Persistency & Lateral Movement in AWS – With more and more services moving to hosted cloud services such as AWS, it’s important to understand how attackers will approach these targets. This briefing not only talks about how to breach these systems, it goes on to explain how to gain persistence and move laterally within AWS.

Adaptive Kernel Live Patching: An Open Collaborative Effort to Ameliorate Android N-Day Root Exploits – Android systems often don’t get patched against new vulnerabilities. This is mostly due to the fact that hardware vendors only really have incentives to put out new devices, and to neglect those already in circulation. This talk is about a system being designed to live-patch Android kernels, regardless of which vendor manufactured the device.

AMSI: How Windows 10 Plans to Stop Script-Based Attacks and How Well It Does It – The Microsoft AntiMalware Scan Interface is a really interesting piece of technology. It allows third parties to plug into a framework designed to monitor script execution for malicious behavior. It works with Powershell, VBScript and JScript. Unfortunately, it’s only available on Windows 10. This talk includes a bunch of live demonstrations of AMSI.

An Insider’s Guide to Cyber-Insurance and Security Guarantees – Cyber-Insurance is a rapidly growing service sector. Getting to know more about how it works could be interesting.

Augmenting Static Analysis Using Pintool: Ablation – This looks like a powerful tool for reverse engineers. It will be made open source during the conference.

AVLeak: Fingerprinting Antivirus Emulators for Advanced Malware Evasion – Anti-emulation tricks are used by a lot of malware. By not functioning correctly in virtual environments, they can evade automated dynamic analysis techniques and create problems for researchers. This talk details a framework that allows executables to upstream data about the environments where they’re running in order to help authors improve their anti-emulation tricks.

Blunting the Phisher’s Spear: A Risk-Based Approach for Defining User Training and Awarding Administrative Privileges – As much as people have tried to fix PEBKAC and train users not to do things that will get them owned, the problem still exists. It’s one of the biggest reasons breaches happen so easily. These guys are detailing yet another approach for training users to be more security-aware.

Call Me: Gathering Threat Intelligence on Telephony Scams to Detect Fraud – These guys set up a telephony honeypot to gather threat intelligence on unwanted and scam phone calls. They used automation to fingerprint these calls and found that a majority of these bad calls came from just a few actors.

Captain Hook: Pirating AVs to Bypass Exploit Mitigations – Protection components that perform behavioral analysis rely on hooking. If you can find vulnerabilities in these hooking engines, you can bypass these mechanisms. This talk details some research done into just this.

Cyber War in Perspective: Analysis from the Crisis in Ukraine – Nation-state cyber war? Always an interesting topic.

Does Dropping USB Drives in Parking Lots and Other Places Really Work? – Remember Mr. Robot? This talk explains how effective dropping USB sticks actually is.

Dungeons Dragons and Security – How to teach people about security using Dungeons and Dragons.

Exploiting Curiosity and Context: How to Make People Click on a Dangerous Link Despite Their Security Awareness – Even more on the social engineering theme. This talk examines some research into how to craft messages that even the most security-savvy people would click on.

I Came to Drop Bombs: Auditing the Compression Algorithm Weapon Cache – A talk about decompression bombs (small compressed files that decompress into massive amounts of data).

Iran’s Soft-War for Internet Dominance – More nation state stuff.

Keystone Engine: Next Generation Assembler Framework – For the reverse engineering community, these guys have created a new assembler. Looks pretty cool, and it’s going to be open sourced at the show.

Next-Generation of Exploit Kit Detection by Building Simulated Obfuscators – These guys are looking at tackling exploit kit obfuscation by looking at the obfuscation techniques themselves. They’ve build an open source obfuscator for use by the community.

Pay No Attention to That Hacker Behind the Curtain: A Look Inside the Black Hat Network – A talk from the guys who run the network infrastructure at Black Hat. This talk probably includes a lot of fun stories.

Secure Penetration Testing Operations: Demonstrated Weaknesses in Learning Material and Tools – New pen testers are being trained with widely available material. Attackers know this, and can actually hijack a penetration test being performed by a new guy.

Subverting Apple Graphics: Practical Approaches to Remotely Gaining Root – A talk about how to exploit Apple’s various graphical subsystems in OS X.

The Linux Kernel Hidden Inside Windows 10 – As of the Windows 10 Anniversary Update, Windows will include a Linux kernel in the core of the operating system. This has implications for both security and tooling.

Towards a Holistic Approach in Building Intelligence to Fight Crimeware – These guys are going after crimeware infrastructure in order to identify and stop attacks quicker, and even find the folks behind these campaigns.

Unleash the Infection Monkey: A Modern Alternative to Pen-Tests – Automated pen testing of organization network infrastructure using an Infection Monkey. It’s an open source testing tool that spins up infected virtual machines inside your network perimeter that can even perform non-malicious lateral movement.

Using EMET to Disable EMET – Microsoft’s EMET (Enhanced Mitigation Experience Toolkit) is a utility that helps prevent vulnerabilities in software from being successfully exploited. This is a talk on how to bypass that.

Weaponizing Data Science for Social Engineering: Automated E2E Spear Phishing on Twitter – Spear phishing on Twitter. Performed by a neural network.

When Governments Attack: State Sponsored Malware Attacks Against Activists Lawyers and Journalists – Probably a good place to learn some OPSEC.

When the Cops Come A-Knocking: Handling Technical Assistance Demands from Law Enforcement – A couple of well-versed lawyers will explain what to do when law enforcement turn up asking for technical assistance.


To be honest, there are a lot more talks that I’d like to see, but with nine separate simultaneous tracks going on, I doubt I’d even get to see all of the above. If you’re going to Black Hat this year, have fun!



What’s The Deal With Detection Logic?

Detection logic is used by a variety of different mechanisms in modern endpoint protection software. It is also known by many different names in the cyber security industry. Similar to how the term “virus” is used by laypeople to describe what security people call “malware” (technically, “virus” is the term used to describe a program that spreads by making a copy of itself in another program, data file, or boot sector), detection logic has been called everything from “signatures” to “fingerprints” to “patterns” to “IOCs”. So, when someone talks about a virus, they’re usually actually referring to malware. And when someone talks about signatures, they’re frequently referring to detection logic. Well, unless they’re specifically talking about the simple detections that were used back in the 80s and 90s.

Here at F-Secure, we often refer to detection logic as, simply, “detections”.

In previous installments of this series, I wrote about scanning engines, behavioral engines, and network reputation – and how they work together to block malicious threats. Now, I’d like to explain what the detection logic used by these engines looks like and how it’s created.

This is a slightly longer article, so bear with me. I found that, in order to explain things in enough detail to make any sense, I needed to cover quite a few different areas. There are many touch-points between these different concepts, and hence I opted for one long post instead of a series of separate ones. I’ve passed a draft of this text through quite a few folks, but some people are on their summer vacation, so I’m hoping it’s okay for me to reveal all of this stuff.

There are many different types of detection logic in modern security products. I’ll cover each one in its own section.

Research poster

What was that thing again? Oh yeah, complex mathematics and artificial intelligence.

Cloud-Based Detection Logic

Most modern protection software include one or more client components that perform queries via the Internet. Detection logic is often shared between client and server. In really simple cases, a unique object identifier (such as its cryptographic hash) is sent to a server, a simple good/bad/unknown verdict is delivered back, and the client factors that result into its final decision.

In more complex cases, the client extracts a set of metadata from an object, sends that to a server, which then applies its own set of rules, and returns a verdict, or a new set of values that the client then processes before reaching a verdict.

Even the simplest of cloud lookups are quite powerful. Back end systems have access to our full sample storage systems as well as information and metadata acquired across our entire customer base. This additional information is something the endpoint alone wouldn’t have access to. By performing a cloud query, this metadata can be provided to the client, where it becomes available to all of the different protection technologies on the system. The usage of prevalence score  to determine suspiciousness is one good example of this.

Cloud-based detections are, for the most part, generated automatically. This provides an extremely fast response time from sample discovery to protection. Turnaround times from sample to detection are in the order or seconds or minutes.

I’d be remiss if I didn’t mention cloud scanning in this section. Samples can be uploaded from the endpoint to the cloud in order to perform complex analysis operations that wouldn’t be possible on the client-side. These analysis steps can include passing the sample through multiple scanning engines, static analysis, and dynamic analysis procedures. Detections on the back end are designed to deliver a verdict by processing the metadata generated by these analyses. I’ll cover this in more detail in the section on back end detections.

Heuristic Detection Logic

Heuristic detection logic is designed to look for patterns typically present in malicious files in order to determine suspiciousness. Heuristic detections can be generated by machine learning techniques, or by hand.

A hand-generated piece of heuristic detection logic might, for instance, check how often a “+” character occurs in a script, which could be indicative of common script obfuscation techniques. More complex manually generated heuristic detections may search for multiple patterns. A suspiciousness score is generally calculated based on presence and type of these patterns, and it is then combined with other metadata in order to reach a final verdict.

By training a machine learning system to recognize structural patterns that commonly occur in malicious files, but not in clean files, complex heuristic logic can be built. This is usually achieved by feeding large sets of expert-vetted malicious and clean files into a machine learning system, and then thoroughly testing the output. When the resulting bundle of detection logic is performant, free of false positives, and free of false negatives, it is deployed to the endpoint. This process should be repeated periodically in order to adapt to threat landscape changes. Again, these detection bundles work by looking for characteristics and patterns in samples and applying a set of mathematical models to calculate a level of suspiciousness.

The structure of a portable executable

Information contained in the structure of a PE file can sometimes be used to determine a file’s suspiciousness.

The quality of heuristic detections depend largely on how well-trained they are. This, in turn, depends on the quality of samples used in the training and testing steps, which, in turn, depends on a mixture of automation and human guidance. Great results can be achieved if you have enough expertise, time and resources to get the process right.

Behavioral Rules

Engines that behaviorally analyze code execution contain a set of rules designed to determine whether observed behavioral patterns are suspicious or malicious.

Some behavioral rules can be created by automation. Others are meticulously hand-crafted. By researching new exploit techniques and attack vectors, behavioral rules can be created before new techniques become widespread in malware. Generic behavioral rules can then be designed to catch all manner of malicious activity, allowing us to stay ahead. Our researchers in Labs are always on the lookout for future trends in the malware landscape. When they identify new methodologies, they go to work on creating new rules based on this research.

So, how do behavioral rules work? When something executes (this can mean running an executable, opening a document in its appropriate reader, etc.), various hooks in the system generate an execution trace. This trace is passed to detection routines which are designed to trigger on certain sequences of events. An example of some of the types of events a behavioral rule might check include:

  • Process creation, destruction or suspension.
  • Code injection attempts.
  • File system manipulation.
  • Registry operations.

Each event passed to a behavioral rule includes a set of useful metadata relevant to that event. Global metadata for the execution trace, which includes things like PID, file path, and parent process, is also made available to rules. Metadata acquired from cloud queries and other protection components is also made available. This includes things like prevalence score and file origin tracking (a history of the file from the moment it appeared on the system, that can include things like the URL where it was downloaded from). As you can imagine, some pretty creative logic can be built with all of this information at hand.

Of course, we constantly update the logic and functionality of our behavioral detection components as the threat landscape changes, and as we make improvements. Who wouldn’t?

Generic Detection Logic

By creating complex programs designed to be run on the endpoint, we can automate certain reverse engineering techniques in order to detect malware. This is where things get interesting, and also a little long-winded to explain.

There are many file types out there that can contain malicious code. Here are a few examples.

In order to work with all of these different file formats, parsers are frequently needed. Parsers identify and extract useful embedded structures from their respective containers. Examples of embedded structures include things like resources (sounds, graphics, videos and data), PE headers, PE sections, Authenticode signatures, Android manifests, and compiled code.

In simple terms, a parser breaks an input file down into its constituent parts, and provides them as an easily accessible data structure to detection logic. A lot of file formats are complex and contain dozens of corner cases. By parsing a file and then providing a data structure to detection logic, we avoid having to replicate complex parsing code over and over. We can also make sure that the code is robust and well-performing.

Files (of varying type) are frequently embedded inside malicious files. For example, a malicious PDF may contain embedded flash code, executable code, shellcode, or scripts. Often these embedded files are encrypted or obfuscated. Hence, when the outermost malicious file is opened or executed, it decrypts objects contained within itself, writes them to memory or disk, and then executes them.

Malicious executables themselves often use off-the-shelf packers or protectors, such as UPX or ASPack, to obfuscate their code and structures. In most cases, multiple packers are applied on top of each other, and in many cases, custom obfuscation, written by the malware author, is also used. Getting to the actual code inside a malicious executable is akin to peeling away layers of an onion.

The problem with packed files is that they all look quite similar. And some non-malicious executables also utilize packers. In order to properly identify maliciousness, you often have to remove those layers.

Removing off-the-shelf packers is pretty straightforward – recognizing standard protectors is simple, as are the methods for decrypting them. Unpacking custom obfuscation is trickier. One way to unravel custom obfuscation is to extract the decryption code and data blocks from the malware itself, run the whole thing in a sandbox, and allow the malware’s own code to unpack the image for you. Another common trick is to write a routine that analyzes disassembly of the malware’s deobfuscation code, extracts the encryption key and location of the data to be worked on, and then extracts the image.

Peeling away all the defensive layers of a malicious executable allows detection logic to analyze the actual malware code hidden deep inside. From here, the sky’s the limit in terms of how you further process the resulting image. These same techniques are used to pull objects out of non-PE files, such as PDFs and MS Office documents, and similar decryption tricks can be used to unobfuscate things like malicious JavaScript.

Creating this sort of detection logic can sometimes be time-consuming. However, the rewards are definitely worth it. Generic detection logic written in this way is not only capable of catching large numbers of malicious files, it has a pretty good shelf-life. In many cases, new variants of the same malware family are detected by a single well-written generic, with no modifications. If you check out our world map, you’ll notice that most, if not all of our top 10 detections are usually manually written generics.

To be honest, what I’ve written here only scratches the surface of what can and is being done with these technologies. Bear in mind that detection logic also has access to information from things like cloud queries and file origin tracking and can work on any sort of data stream, including memory space and incoming network data, and you can probably understand why the topic is so complex. An in-depth explanation would probably fill a book.

Network Detection Logic

Blocking attacks on the network, such as exploit kits, is something we focus on pretty heavily. Stopping threats at an early point in the attack chain is an effective way of keeping machines free from infection.

As mentioned in the previous section, data streams arriving over the network can be analyzed in a similar way to how files on disk are inspected. The difference is that, on the network, as information arrives, mechanisms exist to block or filter access to further attack vectors on-the-fly.

Network detection logic on the endpoint gets access to IP addresses, URLs, DNS queries, TLS certificates, HTTP headers, HTTP content, and a whole host of other metadata. It also gets access to network reputation information, including URL, certificate, and IP reputation (via cloud queries). With access to all of this information, you can do some interesting stuff. Here are some examples.

  • The behavior of a web site can be examined in real-time, allowing exploit kits to be detected and blocked.
  • Communication between a bot and its command and control server can be blocked while the IP address is still being queried, or based on the IP the bot is attempting to contact. We can also block C&C communications based on network traffic patterns.
  • Phishing sites can be blocked based on the content of HTTP header, body, or both.
  • We can block multiple types of malicious redirections, including flash redirections.

Network-level interception technologies allow us to extract and store metadata about objects arriving on a system. That metadata is made available to other protection components on the system, where it is factored into their decision processes.

Memory Detection Logic

Hunting for signs of malicious code in active memory is a useful technique, especially when installing onto a system that wasn’t previously protected. Certain types of malware, especially rootkits, can only be detected using this method.

Some malware actively prevent the installation of endpoint protection software onto a system. Cleaning a system prior to attempting to run our an installer is, therefore, an important step. We run forensics routines, that include memory scanning capabilities, early in the installation phase of our product in order to remove any infections that may have occurred in the past. These same forensics routines can be run periodically or manually on the system to ensure nothing slipped past our protection layers.

As I mentioned earlier, getting through layers of obfuscation can sometimes be a complex task. By looking at the address space that malware is using, you can often find a deobfuscated image. However, this isn’t always the case. Off-the-shelf packers usually dump the program into memory and kick if off, but if the malicious program is using its own custom obfuscator, there are various tricks it can use to keep itself obfuscated, even in memory. A simple way to do this would be to obfuscate strings. In a more complex case, code is converted into a proprietary, one-off, compilation-time generated instruction set that is run under a custom virtual machine.

Back End Detection Logic

The large amounts of processing power available in our back end systems grants us the perfect opportunity to do the sort of expensive, time-consuming examination of samples that wouldn’t be possible on the endpoint. By performing a series of analysis steps and combining the output of those operations with metadata collected from client queries and our sample storage systems, we can automatically process and categorize large volumes of samples in a way that wouldn’t be possible by hand. We also get a huge amount of good threat intelligence out of these processes.

Some of the same technologies used in our endpoint products can be re-purposed to perform rigorous static analyses procedures. This is achieved by writing special detection code designed to extract the metadata and characteristics of a sample, instead of simply delivering a verdict. This is just one method that we use to dissect samples. Other systems, designed to process particular types of files, also provide our decision logic with relevant metadata. Additionally, we process samples through systems designed to heuristically determine suspiciousness based on structural features.

Malware recognition automation

Classy and Gemmy, two malware recognition automation components developed as part of a research project with Aalto university about five years ago.

URLs and samples that can be executed are sent into sandbox environments for dynamic analysis. A trace of the sample’s behavior is obtained by instrumenting the environment where the sample is executed. That trace provides additional metadata used to categorize the sample.

Sometimes these types of analyses provide us with new samples. For instance, a malicious sample may attempt to connect to a command and control server. When this behavior is observed, the address of the server is fed back into the system for analysis. Another example, and one I mentioned earlier, relates to how some malware drop embedded payloads. When this happens, the dropped sample is fed back into the system. This is why you’ll find that we have detection logic not just for initial malicious payloads, but also for all subsequent pieces in the attack chain.

Since our products query the cloud, we are able to gather metadata over our entire customer base. This metadata can be used to determine the suspiciousness of a sample. For instance, the prevalence of a sample can give us clues as to whether it might be malicious.

Once we have processed a file or URL fully, we feed all gathered metadata into a rules engine that we call Sample Management Automation (SMA). The rules in this system are partially hand-written and partially adaptive, based on changes in the threat environment. This system is the brain of our whole operation. It does everything from categorizing and tagging samples in our storage systems to determining verdicts and alerting on new threats.

Beta Detections

No good software is written without proper testing. Sometimes our researchers figure out creative ways to detect malware, but aren’t sure how they’ll work in the real world. For instance, a new type of detection method might end up being be too aggressive, and trigger a lot of false positives, but we won’t really know that until it’s out in the wild. In these cases, we use beta detections.

A beta detection reports back to us what it would have done, without actually triggering any actions in the product or system itself. By collecting upstream data from these pieces of code, they can be tuned to behave optimally, and released once they’re working as intended.

Beta detections are often pretty cutting-edge stuff. In many cases, they’re new pieces of technology designed to deal with threats we’re already catching, only by using different, more efficient methods. We’re always looking to catch as many real samples as possible with every hand-crafted detection we deploy, so beta detections also provide us with a nice test bed for new theories on that front. We utilize beta detections both in our back ends, and on endpoints.

Conclusion

As you’ve probably gathered, detection logic comes in many forms and can be used to do a whole bunch of different things. The technologies behind all of these different detection methods are designed to work together to protect machines and users against a range of different attack vectors. The whole system is rather complex, but it’s also very powerful. And we’re evolving these technologies and methodologies constantly as the threat landscape changes. By using multiple different protection layers, and not putting all of our eggs into one basket, we make it difficult for attackers to bypass our technology stack. So the next time you hear someone talking about “signatures” in the context of modern endpoint protection products, you can be sure they’re either rather uninformed, or they’re peddling fiction.

P.S. In case you’re wondering, I was joking above when I wrote about not knowing if it was okay for me to reveal this information. We’ve been getting lots of nice feedback on this explainer series, and we’re more than happy to openly describe our technologies processes to everyone.



What’s The Deal With Network Reputation?

Drive-by downloads or, more accurately, drive-by installations are some of the scariest threats on the Internet. Exploit kits provide the underlying mechanisms for this behavior. They work by examining your browser’s environment – browser type, browser version, installed plugins, and plugin versions, looking for a vulnerable piece of software. If the exploit kit finds any […]

2016-06-23

Out of Office OPSEC

A “found object” from my Inbox (with sundry modifications). A vacation greeting from our CSS OPSEC experts! It’s absolutely fantastic that you’re soon going on holiday and are not at the office. And we’re sure it’s very well deserved! But before you go, consider this – you don’t have to tell the world where you […]

2016-06-23

What’s The Deal With Threat Intelligence

The term “threat intelligence” is quite trendy right now. For many, threat intelligence is a term used to describe IOC feeds that are plugged into security infrastructure to identify suspicious or malicious activity. For us, it describes a whole lot more. As a company, we’ve been actively gathering and assimilating threat intelligence for over 25 […]

2016-06-14

What’s The Deal With Prevalence

We use the word “prevalence” a lot at F-Secure Labs. And what’s prevalence? The prevalence of an executable file is defined as the number of times it’s been seen across our entire customer base. Malicious executables tend to be rare over time, most live and die quickly, and thus the number of times we’ve seen […]

2016-06-08

Qarallax RAT: Spying On US Visa Applicants

Travelers applying for a US Visa in Switzerland were recently targeted by cyber-criminals linked to a malware called QRAT. Twitter user @hkashfi posted a Tweet saying that one of his friends received a file (US Travel Docs Information.jar) from someone posing as USTRAVELDOCS.COM support personnel using the Skype account ustravelidocs-switzerland (notice the “i” between “travel” […]

2016-06-07

“UltraDeCrypter” Wants To Speak Your Language

There’s a new crypto-ransomware brand in-the-wild called “UltraDeCrypter”. It’s an evolution of CryptXXX that is being dropped by the Angler exploit kit. In our tests, using an older CryptXXX “identification code” with UltraDeCrypter’s decryption service portal redirected to an older CryptXXX portal. So there’s evidence the back ends are interlinked. Regarding the payment support pages… […]

2016-06-03

IC3’s Internet Crime Report

I’ve spent part of my day reading through the Internet Crime Complaint Center’s 2015 Internet Crime Report, and the numbers… are impressive. There were 288,012 complaints received by IC3 in 2015 and more than one billion dollars in losses reported. Hot topics? Business Email Compromise (BEC), Email Account Compromise (EAC), and ransomware. On a positive […]

2016-05-27

CVE Security Vulnerability Data Pr0n

This year’s Adobe related CVE security vulnerabilities are well on track to surpass 2015 levels. Sorting through the data at cvedetails, so far, 2016 is at 51% compared to 2015. And it’s still May. Adobe produced a bumper crop of code execution vulnerabilities (335) in 2015. The trend is repeating itself in 2016. And what’s […]

2016-05-26

What’s The Deal With Behavioral Engines?

I recently wrote a post on how scanning engines evolved from their primitive, signature-based roots in the 1980s to the present day. In that article, I touched upon how file scanning itself is just a small piece of the puzzle when it comes to protecting endpoints from threats such as malware and exploits. Today, I focus on […]

2016-05-23

AV-Comparatives Real-World Test Results

AV-Comparatives runs a monthly “Whole Product Dynamic Real-World Protection” test. The organization just released its third set of results covering April 2016. And we’re pretty happy about how our products have been faring so far this year! The guys at AV-Comparatives run extremely thorough tests. In order to properly ascertain how security products function against […]

2016-05-18