To be published in RAID'20,PDF

Welcome to SourceFinder. Our goal is to create the largest reference database of malware source code. We are off to a great start. As of August 2020, we have:

Large Dataset:

Curated dataset: manually validated

Overview of our work:

Where can we find malware source code? This question is motivated by a real need: there is a dearth of malware source code, which impedes various types of security research. Our work is driven by the following insight: public archives, like GitHub, have a surprising number of malware repositories. Capitalizing on this opportunity, we propose, SourceFinder, a supervised-learning approach to identify repositories of malware source code efficiently. We evaluate and apply our approach using 97K repositories from GitHub. First, we show that our approach identifies malware repositories with 89% precision and 86% recall using a labeled dataset. Second, we use SourceFinder to identify 7504 malware source code repositories, which arguably constitutes the largest malware source code database. Finally, we study the fundamental properties and trends of the malware repositories and their authors. The number of such repositories appears to be growing by an order of magnitude every 4 years, and 18 malware authors seem to be "professionals" with a well-established online reputation. We argue that our approach and our large repository of malware source code can be a catalyst for research studies, which are currently not possible.

The opportunity:

Surprisingly, software archives, like GitHub, host many publicly-accessible malware repositories, but this has not yet been explored to provide security researchers with malware source code. In this work, we focus on GitHub which is arguably the largest software storing and sharing platform. As of October 2019, GitHub reports more than 34 million users and more than 32 million public repositories. There are thousands of repositories that have malware source code, which seem to have escaped the radar of the research community so far. We use a broad definition of malware to include any repository containing software that can participate in compromising devices and supporting offensive, undesirable and parasitic activities.

Potential impact:

Security research could greatly benefit from an extensive database of malware source code, which is currently unavailable. This is the assertion that motivates this work. First, security researchers can use malware source code to: (a) understand malware behavior and techniques, and (b) evaluate security methods and tools. In the latter, having the source code can provide the groundtruth for assessing the effectiveness of different techniques, such as reverse engineering methods. Second, currently, a malware source code database is not readily available. By contrast, there are several databases with malware binary code, as collected via honeypots, but even those are often limited in number and not widely available.

Malware Dataset: Please email to to get the malware source code dataset if you are interested to do research on malware source code in GitHub repositories.