Description of the Tutorial
Length: half-day, i.e., 3 hours plus breaks
Abstract An overview of the field of malware analysis with emphasis on issues related to data science. We discuss the various types of malware, including executable binaries, malicious PDFs, and exploit kits. The most popular tools used for analyzing malicious binaries will be presented and demonstrated, including IDA, Binary Ninja, and x64dbg. Concepts and tools from static and binary analysis will be discussed. We will discuss cluster analysis, malware attribution, and the problems caused by polymorphic malware. We will conclude with our view of important research questions in the field.
Target audience, prerequisites, and benefits The intended audience will be those with some knowledge of database or IR, and computer systems in general. We do not expect attendees to have any prior experience with malware analysis or cyber in general. CIKM is by no means a computer security conference, but knowledge of malware analysis may be useful to data scientists at any level of experience, and there are research issues in malware analysis that pertain to data science.
- About the presenter Charles Nicholas is a Professor of Computer Science at UMBC. He has been involved in the CIKM conference for many years, and has recently turned his attention to the problems of malware analysis “in the large”. His recent work has considered questions related to storing, searching, and finding patterns in large collections of malware. He has taught a combined graduate-undergraduate course in malware analysis at UMBC for several years.
IDA has debugger capabilities, as well as static program analysis. IDA is still probably the single most important tool for malware analysis. IDA is a big, complex system. The IDA Pro Book by Chris Eagle is available from No Starch. An IDA Pro Cheat Sheet; Other alternatives to IDA exist, such as Hopper for OS X and Linux. Dynamic Analysis.
- A once murky alliance forged in a world of internet conspiracy theories appears to have ended in murder this past Sunday, with an infamous QAnon mom accused of having shot a fringe legal theorist.
- If using VMware Workstation, you’ll need the commercial version: Workstation Pro for Windows and Linux or Fusion Pro for macOS. The free versions don’t support snapshots. You’ll want snapshots when examining malware, so you can revert the VM’s state to start a new investigation or backtrack an analysis step.
Before the Tutorial
- While you're at home, with your own Internet connection, you can install any or all of these packages, and perhaps get more out of the tutorial.
- However, people who don't do so will be at no disadvantage.
- Download and install Virtual Box or VMWare Player. Instructions can be found on the web site, and YouTube as well!
- If you have access to the appropriate ISO files, install virtual machines that run Windows XP and Windows 7. Be advised that some XP malware doesn't work on Windows 7. (And even less works on Windows 8 or 10)
- Install a VM running a Linux of your choice. I like Ubuntu for the Desktop.
- It is also sometimes convenient to have a UNIX-workalike on your Windows VMs, even if it's not strictly necessary if Linux is handy.
- Download and run setup.exe from www.cygwin.org, which gives you a working UNIX-like environment on Windows.
- Download and install a disassembler such as IDA Pro. The free version is fine for our purposes.
- Download and install a debugger. Olly is still widely used, but other debuggers are available, such as Immunity (available at the Olly site and elsewhere perhaps) and x64dbg.
- Want a good book on the subject of malware analysis? Consider Practical Malware Analysis, from No Starch Press. Paper and electronic formats, of course. Includes exercises on real malware, but some of the malicious code doesn't work on newer versions of Windows. One or two other books are more recent, but not as good.
Inroduction
- This tutorial is based on a semester-length course on malware analysis that has been offered at UMBC several times.
- Cyber attacks are in the news all the time! Malware is a factor in many if not most cyber attacks. (User blunders being the other factor.)
- See, for example, the latest issue of Cyberwire
- Or the May 15, 2015 issue of Newsweek
- For great fun, check out The Norse Attack Map
- Cyber includes many different subjects, including malware analysis. But many cyber attacks tend to rely on malware to work. Ransomware, for example, is a form of malware that has gotten lots of attention recently.
- Cyber in general, and malware analysis specifically, is an active area of research.
- See for example the Springer Journal of Computer Virology and Hacking Techniques
- and the various relevant Usenix Conferences
- and Defcon
- and the IEEE Conference on Malware and Unwanted Software
- and the occasional Dagstuhl seminar, such as this workshop on Analysis of Executables
- and there are other meetings for industry and government groups, such as the Malware Technical Exchange Meeting
- Current research topics (not an exhaustive list)
- Malware analysis is aided by advances in machine learning , see for example Using Machine Learning to Detect Malware Similarity and even this article
- Spotting malware by string matching is no longer effective. Research is under way to spot malware by methods that rely on more abstract patterns of characters, rather than specific strings.
- There are techniques to hinder or defeat analysis, and research on overcoming these is in progress.
- Look at Symantec and F-Secure and McAfee and Microsoft lab sites. There are many other such labs.
- There is no shortage of data to work with:
- A number of malware collections are available for research purposes. Some noteworthy examples:
- Seymour has recently used VirusTotal to label the very large VirusShare collection.
- VX Heaven is quite dated, but it's still pretty big, and easily accessed. Many malware specimens categorized by type, and lots of related material.
- Zeus Tracker see the FAQ for a link to a zip file with many specimens.
- The CERT malware catalog is big, multiple TBs, and growing. Submitting a specimen to CERT for analysis isn't hard, but that has advantages and disadvantages.
- A number of malware collections are available for research purposes. Some noteworthy examples:
- Anti-virus vendors have large collections of malware.
- Google's archive of Android malware is probably the biggest malware repository of them all. Not easily accessed from the outside.
- The variety of malware may surprise you!
- Executable files, whether binaries (.exe or .dll files) or scripts (.bat or.scr). These files tend to be targeted towards the Windows platform. Executable binaries for Windows will be the focus in this tutorial.
- Much more malware is becoming available for the Android platorm. Mobile phones are a huge target. Android especially, but also iPhone. More on that later, perhaps.
- Macs are not immune! But Mac malware is still a small subset of the whole. A (somewhat dated) overview.
- Web-based malware is now a big deal.
- Exploit kits can attack a variety of platforms.
- Exploit kits such as Blackhole among many others serve to automate the distibution of malware.
- A blog post about the creator of Black Hole.
- We can talk about exploit kits at greater length if there is audience interest.
- PDF files can contain executable content - which can escape the PDF viewer sandbox and cause damage.
- There are even malicious LaTeX files! A word to the wise: Don’t Take LATEX Files from Strangers (pdf)
- We'll look at static vs. dynamic analysis, and consider the applications of data science in each.
- Feel free to follow along! This tutorial is intended to be interactive, without our severe time constraints. I encourage students to use their laptops in class, as appropriate.
- Practical Malware Analysis is focused on Windows XP, but may still be the best (but no longer the only) book available. Published by No Starch Press, which owns the image below. Paper and electronic formats, of course. Includes exercises on real (declawed) malware. Notice the alien peeking.
What does Malware Analysis have to do with Data Science?
Those concerned with Malware Analysis tend to ask a lot of the same questions that our community have been working with for years, such as:
- Malware can be viewed as a particular type of document. Hence we can consider questions related to creation, whether manual or automatic. Dissemination of malware is an interesting social and technical problem. Malware is usually designed to be stealthy, and not easily read and understood. To be more specific:
- Specific malware specimens may require significant system-level knowledge to understand.
- Malware analysis tends to produce documents related to the specimen, such as disassembler output, debugging logs, execution traces, network logs, and so forth. Systems for dealing with large sets of related data is our cup of tea, is it not?
- When are objects similar? Are there families of objects? How can we characterize them? How can we classify them? We will demonstrate visualization of malware and malware families.
- Who created this object, and how? Attribution is an interesting and hard question.
- Malware analysts (like all analysts) make their living by writing reports. Can the data in those reports be mined?
- Hence the tutorial's subtitle: the problems one encounters when dealing with large sets of malware are data science problems!
Tools of the Trade
- Use of virtual machine software such as Virtual Box is essential, but is not without trade-offs.
- There are people who do malware analysis on bare metal...
- The VirusTotal utility is often (but not always) a good first step.
testing VirusTotal on one of the Lab exercises from PMA, we see that the various A/V scanners fail to agree! - Since VirusTotal keeps a record of every file it sees, it gives users the option of redoing an anlysis or just returning the earlier results.
- When would analysts want to use such a tool?
- When would malware authors want to use it?
- What does VirusTotal do with all this data? I wish I knew!
- Discuss use of Virtual Box.
- You may need to purchase more RAM for your laptop.
- Keep host OS as uncluttered as possible.
- Keep copies of clean installs, as snapshots as well as exported appliances
- Shared folders are convenient, but have their risks
- Make backups of VMs using the clone function
- Don't use the same VM for malware analysis and on-line banking :-)
- Become comfortable with building new VMs.
- Become comfortable with running two VMs at once, e.g. a Windows VM for running the malware and a UNIX for simulating the Internet
- Dropbox is useful! Especially since the Dropbox folder can be shared between the host and one or more VMs.
- Screen shot of VirtualBox's main menu
- Tools for malware analysis fall into several categories
- Platform specific utilities for quick inspection, e.g. Microsoft Sysinternals. Useful for triage as well as in-depth.
- You'll need to put the Sysinternals directory on your path, or type the full pathname of the executable.
- I recommend Russinovich's books on Windows Internals.
- What do I mean by triage and in-depth?
- A disassembler such as IDA Pro. Please feel free to get a copy of the freeware version of IDA Pro.
- Binary Ninja is an alternative to traditional disassemblers. It can show the program in graphical format, as does IDA.
and has a scripting feature. The commercial version of Binary Ninja supports Jupyter Notebooks... - Other tools
- A debugger such as Olly, Immunity, or x64dbg, or all of the above.
- A network monitor such as Wireshark. Use sudo apt-get install wireshark to get wireshark for Ubuntu and other flavors of Linux. Virtual Box has some network monitoring of its own.
- Reference databases, such as MSDN Documentation
- Ordinary system utilities, such as IDEs for C and perhaps assembly. I'm used to emacs and make, but you may prefer CodeBlocks or Eclipse.
- [De]compression utilities.
- Malware is usually saved in compressed and encrypted form.
- I usually have 7-Zip installed on my malware analysis VMs.
- A Zip file with the password 'infected' is safe to email, or so one would think.
- You might like to configure a VM or two with these tools installed. Once you like it, make a copy in a safe place, so that it can be cloned as needed later.
- Demonstrate taking a snapshot of a VM, as appropriate.
- Platform specific utilities for quick inspection, e.g. Microsoft Sysinternals. Useful for triage as well as in-depth.
- Isn't a good anti-virus program enough? Not so!
- What are the strengths and weaknesses of AV signatures?
- Do make a habit of installing and updating AV software on your host machine
- Some good AV programs are available for free, according to PC Magazine, such as AVG Antivirus Free.
- Windows Defender seems to work well enough.
- Don't run AV on your VMs for malware analyisis.
- The trouble with AV as such is that the bad guys always have the initiative :-(
- Malware an arms race! Many malware actors work hard to make their malware hard to analyze.
- There is a learning curve!
- You will probably need to dig into details that non-geeks don't care about.
- It would take at least a full-day tutorial to learn it all :-)
Platform-specific Utilities
- For computing MD5, SHA-1, SHA-2*, and more we suggest QuickHash. Feel free to download, and unzip that.
- Example of running QuickHash on itself.
- Some hash functions that preserve similarity exist, such as ssdeep and sdhash.
- People are also using compression-based similarity for this purpose. (see Raff and Nicholas, KDD 2017 as well as Raff and Nicholas on arxiv.org)
- What can we see in a binary?
- Demonstrate the strings command from a cygwin (or UNIX) shell, using WinMD5.exe as the file being inspected. System calls, registry keys, and web sites that seem out of place usually are!
- Recall that Strings is one of several utilities bundled up in Sysinternals. You'll need to put the Sysinternals directory on your path...
- A hex editor such as 010 Editor is a useful addition to your tool kit, although IDA and Binary Ninja provide similar functionality.
- Malware is usually packed, to avoid A/V, to make analysis harder, and to make a smaller footprint.
- Obfuscation is widely used in malware, especially crimeware.
- There are a variety of pack/unpack utilities available, and sometimes other tools know about them. UPX is a widely used pack/unpack utility. (packing is not the same as compression)
- Good overview of unpacking and patching an executable binary.
- Being able to measure the entropy of a file, or part of a file, is useful. See for example “Using Entropy Analysis to Find Encrypted and Packed Malware.” IEEE Security & Privacy Magazine, 2007, pages 40-45. It turns out that entropy can tell you a lot. Calculating the entropy of a file is a useful first programming exercise, suitable for Python or C or maybe even assembler.
- Calculating the entropy of a PE file on a section by section basis has also proven useful.
- For more on entropy, see Sorokin's paper on structural entropy, (UMBC only: with some highlighting pdf)
- Knowledge of x86 assembler and Windows system internals can be really useful.
- The focus in this tutorial will be on Windows more than any other platform.
- The Portable Executable File Format is described in detail at this Wikipedia article which refers to this spec from Microsoft and this PE poster and this article which describes the smallest possible PE file.
- The PE header can tell us several things, and along with the strings command, we can tell if perhaps the file has been packed or obfuscated.
- Several utilities for working with the PE header are available. PEViewer is free, and seems adequate.
- Demonstrate PEViewer, using a program called WinMD5.exe as an example.
- If time permits, which is unlikely, we can do demos of other tools from the Sysinternals suite, such as Dependency Walker, and Resource Hacker.
- The PEiD utility described in PMA is still available, but no longer supported.
- A tool called Detect It Easy has lots of features usually found together in more complex packages like IDA.
and as mentioned above, entropy can sometimes be quite informative...
but what the program imports can often tell you about its functionality - In case you need more PE tools, see this post from Malwarebytes Unpacked. Anecdotal evidence suggests that people pick their favorites, and use them. I happen to prefer DiE over many others.
Ida Pro Free Version
Static Analysis: Disassemblers and Such
We can demonstrate IDA Pro, but before using IDA, a triage step using VirusTotal or pestudio is in order.
- Here is a simple C program, compiled with Code::Blocks
#include <stdio.h>
#include <windows.h>
int main()
{
SYSTEMTIME lt;
GetLocalTime(<);
printf('The local time is %02d:%02dn', lt.wHour, lt.wMinute);
return 0;
} - A link to this code, in case you don't want to type it in yourself. The program should compile and run as expected.
- An oveview from pestudio
- The fact that pestudio looks for malware indicators is handy.
- We can also look at the strings.
Moral of the story: one can sometimes learn a lot from the PE header. We now know the programmer's name! - Opening the file in IDA, we see
- and a little lower, we see code we recognize. (Windows and CodeBlocks put a bunch of library code in as well, making the executable larger than the raw .o file would suggest. The red area indicates the program's end.
- and we can see the call graph
- and a graphical view is also available
- Of course IDA also lets us look at strings.
- But you won't see much if the file is packed, which is something that the PE utilities can tell us. So IDA provides the ability to unpack some of the common packers.
- The hex dump will take you back to your undergraduate days, perhaps. May also indicate where buffers might be located later, if and when the file unpacks itself.
- The libraries the binary imports may tell you a great deal.
This is obviously a C program, with no remarkable system calls. But if we had seen low-level keyboard hooks, or registry access, we'd be more suspicious. - Now compare to a file we know to be be malicious! Let's look at Lab03-04.exe from the PMA book. (PMA comes with an ensemble of sample binaries for analysis.)
- You may see references to another disassembler, PEBrowsePro. PEBrowsePro is worth trying if you don't need a system as complex as IDA.
- Using PEBrowsePro, we can take a quick look at Lab03-04.exe
- Is there anything suspicious? If not, this screen shot wouldn't be here!
- In IDA, we can see some other malware indicators, apart from the strings mentioned above. The program has a mix of system calls, including file system, registry manipulation, socket calls, and then...building an http header, but not being a browser? Suggests an HTTP backdoor, which is malware that sends information to a web server run(or at least controlled) by the attacker.
- and a call to sleep, without any obvious reason. Sleep is sometimes used to hide (or delay the appearance of) functionality that would otherwise appear under dynamic analysis.
- IDA Pro has debugger capabilities, as well as static program analysis. Probably the single most important tool for malware analysis.
- IDA is a big, complex system. The IDA Pro Book by Chris Eagle is available from No Starch.
- An IDA Pro Cheat Sheet (pdf)
- Other alternatives to IDA exist, such as Hopper for OS X and Linux.
Dynamic Analysis
- Make a snapshot. Make a clone and a snapshot.
- Disconnect your VM from the network before beginning dynamic analysis. Make sure you know how to do this!
- The procmon utility can tell you what's going on, in part.
- The ProcessExplorer program gives even more detail.
- Process Explorer may also let us watch what happens when documents are opened using Word or a PDF viewer. If you open such a document and see unexplained activity, a malicious document may be the explanation.
- Look at Norman Sandbox
- PMA refers to the GFI Sandbox and we have an analysis of Lab03-04.exe (pdf) (html). (We just looked at this program with IDA.)
- GFI Sandbox has been acquired by ThreatTrack Security, and the public sandbox may still be available.
- Dynamic analysis may involve just running the program, to see what network activity or file system changes can be noted. This includes changes to the Windows Registry. Do we all know what that is?
- Registry snapshots can be made using regshot.
- In case you haven't done this...
- Feel free to download and install Ollydbg, which is available here
- a summary of Olly commands
- Feel free to download and install x64_dbg, which is available here
- The Immunity Debugger was inspired by Olly, but allows for plug-ins written in Python.
- You can download Immunity starting from here.
- Careful! Some unpackers have to execute the suspect program in order to have it unpack itself.
- Make a copy of Lab 3-4 on the desktop. Let's just run it and see what happens!
- Now open the file with Olly and see what we can see
- Eventually the process terminates
- But the programs acts differently when being debugged...since the file is still where it was. Can we figure out how the file deletes itself on termination? Or how it knows to behave differently when being debugged?
Malware Analysts Write Reports!
Ida Pro Cheat Sheet
- Description of the malware
- name, size, date acquired and how
- MD5 and/or SHA hash
- results from VirusTotal and similar utilities
- what kind of malware? Windows executable? VBscript? Exploit kit?
- name, size, date acquired and how
- Results of analysis, whether static or dynamic
- Excerpts from tools like PEStudio and IDA, such as
- What does the malware do?
- How does it achieve execution?
- How does it achieve persistence?
- Does it communicate with the outside? How? What IP addresses are involved?
- Is there anything unusual about this specimen?
- Is this specimen similar to anything seen before?
- What damage is done? How can the damage be repaired?
- How does this malware spread?
- Who produced it, and why?
- Such malware reports are the format I use for exam questions in the semester-length course. Take home tests.
Ida Pro Cheat Sheet Printable
Malware Analysis in the Large vs. Malware Analysis in the Small
- You will have seen how malware analysis zooms down into details very quickly.
- In my opinion,
- study of families of malware has received relatively little attention
- visualization tools are not yet used as widely as they should be
- Here we have a graph using a subset of the Zeus family, notice the outliers
- Here is an example of the charts those guys at UCSB use. See this blog post. Quoting from them,
'Here, we consider 68 malware samples which were assigned a single family name (Kolik.A) by an Anti Virus (AV) software. When we cluster these samples and view the distance matrix, we can see that there are 4 smaller tight clusters and many singletons. The singletons could be the possible outliers and could be sent back for re-labeling.' - Raff, Nicholas, and various colleagues continue their work on malware similarity
- Tensor decomposition lets us again insight from all kinds of malware data, we hope!
For Further Study
Ida Pro Cheat Sheet 2019
- Android malware is becoming quite important. Dr. Rob Brandon's slides.
- How can you protect yourself from malware? Live off the grid, or
- Use separate VMs for work, personal activity.
- Practice good cyber hygiene: don't reuse passwords, and make them hard to guess
- Keep your software up-to-date, AV but everything else, too
- Beginning malware analysts (and experienced ones too) can find the variety of tools for malware analysis daunting, especially for the Windows environment. Learn what you need, if and when you need it.
- What separates the best malware analysts from the wannabes?
- Experience!
- both yours and others
- Tenacity!
- Willingness to learn new stuff.
- Willingness to invent (or invest in) new tools.
- Experience!
- Lots of security blogs deal with malware analysis topics from time to time.
- New tools come out from time to time. On my list of things to read
- I like Dr. Fu's site. He's got a tutorial on malware analysis.
- An analysis tool called Truman
- Here's a discussion of Sandbox Overloading
- Here's an interesting report from FireEye
- Comments, corrections, and suggestions to improve this tutorial are welcome! Send email to nicholas at umbc dot edu
- Thanks!