Portable Document Format (PDF) Basics

Some months ago in the Black Hat Europe, Eric Filiol gave a talk about the functionalities of the PDF format. Filiol said that thanks to some features a simple PDF could become malcode executing the attacker instructions. Besides this, the exploitation of vulnerabilities in this type of documents is more and more usual nowadays. This is why I'm going to write about the basics of the PDF structure and how it works internally. Maybe this can be boring but I promise you that next posts about this subject will be more practical;) To make it more enjoyable you can open a PDF file in a text or hexadecimal editor and take a look at what I mention in the next paragraphs.

A PDF file consist of multiple objects connected between them. This objects can belong to one type from eight possible values: boolean, integer and real numbers, text strings, names, arrays, dictionaries, streams and nulls. Apart of the "known" types, names are a kind of tag for the different elements that compose an object, dictionaries, delimited by "<<" and ">>", are a collection of pairs key-value, and streams, delimited by "stream" and "endstream", are bytes sequences, an information flow that the PDF readers can read incrementally, unlike the normal text strings. All the objects can be declared as indirect objects, assigning them an id to be referenced in any part of the file. This type of objects are delimited by the words "obj" and "endobj".

The physic structure of a PDF file is divided in header, body, cross references table and trailer:

 

  • Header: it's the first lines of the file and it helps to know the specification version of the PDF file ("%PDF-X.Y") and if the file contains binary characters or not.
     
  • Body: it's a sequence of indirect objects which compose the PDF content.
     
  • Cross References Table: it begins with the word "xref" and it stores the location of the indirect objects in the file through byte offsets, helping to access them without reading the whole file.
     
  • Trailer: it's located at the end of the file and it allows the applications to carry out a quick reading of the document. The trailer contains a dictionary that begins with the word "trailer" and it's important to the good interpretation of the file. It also contains, after the word "startxref", the number of bytes from the beginning of the file to the cross references table, finishing with the chain "%%EOF".
     

It's important to know that these elements can be repeated more than once if there are some updates in the original document. That is, in a PDF file is stored the original information and the successive modifications. Maybe this is something the American troops in Iraq ignored when they unclassified a relevant report...

Before finishing I want to comment the logic structure of a PDF document. It follows a hierarchy where the catalog is the root of the other nodes and can be accessed through the /Root element in the trailer. Contained within you can find references to other objects that define the documents content an its presentation, among other things. For instances, the /Pages element is a tree composed of all the document pages that are defined in other objects. This would be the logic structure:

 

 

If you want to know more about this you have the full PDF specification online, but maybe it's not very funny :)