Identifying Shared Software Components Through Code

Identifying Shared Software Components to Support Malware Forensics

Brian Ruttenberg, Charles River Analytics, Inc,
Craig Miles, University of Louisiana at Lafayette
Lee Kellogg, Charles River Analytics
Vivek Notani, University of Louisiana at Lafayette
Michael Howard, Charles River Analytics
Charles LeDoux, University of Louisiana at Lafayette
Arun Lakhotia, University of Louisiana at Lafayette
Avi Pfeffer, Charles River Analytics


Recent reports from the anti-malware industry indicate similarity between malware code resulting from code reuse can aid in developing a profile of the attackers. We describe a method for identifying shared components in a large corpus of malware, where a component is a collection of code, such as a set of procedures, that implement a unit of functionality. We develop a general architecture for identifying shared components in a corpus using a two-stage clustering technique. While our method is parametrized on any features extracted from a binary, our implementation uses features abstracting the semantics of blocks of instructions. Our system has been found to identify shared components with extremely high accuracy in a rigorous, controlled experiment conducted independently by MITLL. Our technique provides an automated method to find between malware code functional relationships that may be used to establish evolutionary relationships and aid in forensics.

Full Citation: Ruttenberg, Brian, Craig Miles, Lee Kellogg, Vivek Notani, Michael Howard, Charles LeDoux, Arun Lakhotia, and Avi Pfeffer. “Identifying shared software components to support malware forensics.” In Detection of Intrusions and Malware, and Vulnerability Assessment: 11th International Conference, DIMVA 2014, Egham, UK, July 10-11, 2014. Proceedings 11, pp. 21-40. Springer International Publishing, 2014.

Link to Research Paper: