Just before the summer I announced that the student Rohit Dua would dedicate his time to improve peepdf and add a scoring system to the output. This was possible thanks to Google and his Google Summer of Code (GSoC) program, where I presented several projects as a member of The Honeynet Project. A beta version was presented during Black Hat Europe Arsenal 2015 last November, where I introduced the new functionalities.
The scoring system has the goal of giving valuable advice about the maliciousness of the PDF file that’s being analyzed. The first step to accomplish this task is identifying the elements which permit to distinguish if a PDF file is malicious or not, like Javascript code, lonely objects, huge gaps between objects, detected vulnerabilities, etc. The next step is calculating a score out of these elements and test it with a large collection of malicious and not malicious PDF files in order to tweak it.
The scoring is based on different indicators like:
- Number of pages
- Number of stream filters
- Broken/Missing cross reference table
- Obfuscated elements: names, strings, Javascript code.
- Malformed elements: garbage bytes, missing tags…
- Encryption with default password
- Suspicious elements: Javascript, event triggers, actions, known vulns…
- Big streams and strings
- Objects not referenced from the Catalog