Long Form Birth Certificate – The gory details – Part 1

The following is an analysis of the actual PDF document of President Obama’s birth certificate, allowing us to understand how the resulting PDF is built from its 39 objects.Well, there are only 9 real image objects, the other objects deal with the layout, formatting and more.

The PDF file in question contains 385,354 bytes, which shows that significant compression being used, most likely to make the document suitable for internet viewing. For instance assume an 8.5×11 document in 300 DPI (Dots Per Inch), in BITMAP format this would required about 25 MBytes, assuming 3 8-bit color channels. The PDF or Portable Document Format is an Adobe specification and comes in different versions. The version in question for this document is version 1.3.

See also this introduction to PDF and Intro to PDF

The PDF document has 39 objects.

There are 1 /Catalog object, 2 Page objects, 9 /XObjects and 27 other objects

         27: 4, 5, 6, 21, 23, 25, 10, 15, 8, 19, 13, 17, 27, 28, 26, 29, 30, 11, 
            32, 33, 34, 35, 36, 37, 38, 39, 1
 /Catalog 1: 31
 /Page    1: 2
 /Pages   1: 3
 /XObject 9: 20, 22, 24, 9, 14, 7, 18, 12, 16

The /XObjects are the 9 layers which are encoded as a bitmap (/FlateDecode), or JPEG. All the other objects contain data, formatting instructions, color tables etc.

When a PDF file is opened, the application reads from the end to quickly find the trailer and the startxref offset.

The trailer contains all the relevant information to reconstruct the document:

The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end. The last line of the file contains only the end-of-file marker, %%EOF. (See implementation note 15 in Appendix H.) The two preceding lines contain the keyword startxref and the byte offset from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The start- xref line is preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets. Thus the trailer has the following overall structure:

The trailer contains all the relevant references:

    /Size 40     : 1 Higher than the total number of objects (see below)
    /Root 31 0 R : Object 31 is our Root Node
    /Info 1 0 R  : Object 1 is our Information Node
    /ID            [<d6fc2758ceb2f98f54abce9a4b28fc1c><d6fc2758ceb2f98f54abce9a4b28fc1c>] 
                 : Two hash values, the second one changes when the file is updated

First Object 1, the Information Node

obj 1 0
 Type:
 Referencing: 32 0 R, 34 0 R, 35 0 R, 33 0 R, 36 0 R, 37 0 R, 37 0 R, 38 0 R, 39 0 R
<<
    /Title         32 0 R
    /Author        34 0 R
    /Subject       35 0 R
    /Producer      33 0 R
    /Creator       36 0 R
    /CreationDate  37 0 R
    /ModDate       37 0 R
    /Keywords      38 0 R
    /AAPL:Keywords 39 0 R
 >>

The object has a dictionary with 9 entries, the title, the author, subject, producer, creator, creation date, modification date, keywords and Apple specific keywords. The numbers indicate the objects which represent the information.

obj 32 0  : Title (Empty)
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (2, ')'), (1, '\n')]
obj 33 0 : Producer
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (3, 'Mac'), (1, ' '), (3, 'OS'), (1, ' '), (3, 'X'), (1, ' '), 
  (3, '10.6.7'), (1, ' '), (3, 'Quartz'), (1, ' '), (3, 'PDFContext'), (2, ')'), (1, '\n')]
obj 34 0 : Author (Empty)
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (2, ')'), (1, '\n')]
obj 35 0 : Subject (Empty)
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (2, ')'), (1, '\n')]
obj 36 0 : Creator
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (3, 'Preview'), (2, ')'), (1, '\n')]
obj 37 0: Creation/Modification Date
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (3, "D:20110427120924Z00'00'"), (2, ')'), (1, '\n')]
obj 38 0: Keywords (Empty)
 Type:
 Referencing:
 [(1, '\n'), (2, '('), (2, ')'), (1, '\n')]
obj 39 0: AAPL Keywords (Empty)
 Type:
 Referencing:
 [(1, '\n'), (2, '['), (1, ' '), (2, '('), (2, ')'), (1, ' '), (2, ']'), (1, '\n')]

The root node is Object 31 which has to be a Catalog object and sure enough

obj 31 0
 Type: /Catalog
 Referencing: 3 0 R
<<
    /Type /Catalog
    /Pages 3 0 R
 >>

It references the Page tree object which is Object 3

obj 3 0
 Type: /Pages
 Referencing: 2 0 R
<<
    /Type /Pages
    /MediaBox [0 0 612 792]
    /Count 1
    /Kids [ 2 0 R ]
 >>

The MediaBox describes a rectangle with the lower left corner at (0,0) and the top right corner at (612, 792) since the document is encoded in 72 DPI (or dots/pixels per inch) this translates into (8.5”, 11”), the standard letter size.

Okay what information is on the page

obj 2 0
 Type: /Page
 Referencing: 3 0 R, 6 0 R, 4 0 R
<<
    /Type      /Page
    /Parent    3 0 R
    /Resources 6 0 R
    /Contents  4 0 R
    /MediaBox  [0 0 612 792]
 >>

Resources, Contents and another MediaBox. Let’s first look at the Resources, or Object 6

obj 6 0
 Type:
 Referencing: 26 0 R, 11 0 R, 20 0 R, 22 0 R, 24 0 R, 9 0 R, 14 0 R, 7 0 R, 18 0 R, 12 0 R, 16 0 R
<<
   /ProcSet [ /PDF /ImageB /ImageC /ImageI ]
   /ColorSpace
   <<
      /Cs2 26 0 R
      /Cs1 11 0 R
   >>
   /XObject
   <<
      /Im7 20 0 R
      /Im8 22 0 R
      /Im9 24 0 R
      /Im2 9 0 R
      /Im4 14 0 R
      /Im1 7 0 R
      /Im6 18 0 R
      /Im3 12 0 R
      /Im5 16 0 R
   >>
>>

Two color space classes /Cs2 and /Cs1 as well as 9 /XObjects identified as /Im1 … /Im9, which represent the 9 image objects.

First let’s get the color spaces out of the way. The define how particular colors map to device colors. Nothing too important here, but for completeness sake.

/Cs2

obj 26 0
 Type:
 Referencing: 27 0 R
obj 27 0
 Type:
 Referencing: 28 0 R
 Contains stream
<<
   /Length    28 0 R
   /N         1
   /Alternate /DeviceGray
   /Filter    /FlateDecode
>>
obj 28 0
 Type:
 Referencing:
 [(1, '\n'), (3, '2905'), (1, '\n')]

and /Cs1

obj 11 0
 Type:
 Referencing: 29 0 R
obj 29 0
 Type:
 Referencing: 30 0 R
 Contains stream
<<
    /Length     30 0 R
    /N          3
    /Alternate /DeviceRGB
    /Filter    /FlateDecode
>>
obj 30 0
 Type:
 Referencing:
 [(1, '\n'), (3, '2615'), (1, '\n')]

Object 4 is the one doing all the layout work. But it contains many operators that need to be explained. The basic function of Object 4 is to insert all 9 layers, translate them, rotate them and scale them. Additionally, the clipping mask is set and the colors are defined for the relevant bitmaps. For the moment, I will skip a detailed description of this Object. The Stream is encoded by a FlateDecode filter which uses Zlib lossless compression.

obj 4 0
 Type:
 Referencing: 5 0 R
 Contains stream
 <<
    /Length 5 0 R
    /Filter /FlateDecode
 >>
'q Q 
   q 18 14.40002 576 763.2 re W n 
   q 0 -792.96 612.48 0  -0.24 792.48 cm /Im1 Do Q 
   /Cs1 cs 0.1059 0.17650 0.1216 sc 
   q 0 -348.96 436.56 0  89.28 581.28 cm /Im2 Do Q 
   0.34510 0.3922 0.3529 sc 
   q 0  -47.76 186.72 0 304.56 108.96 cm /Im3 Do Q 
   0.302 0.34510 0.3216 sc 
   q 0 - 10.08  65.76 0 170.16  89.76 cm /Im4 Do Q
   0.2549 0.3373 0.2627 sc
   q 0 - 29.52  54.72 0 440.4  274.08 cm /Im5 Do Q 
   0.3412 0.4353 0.3412 sc
   q 0  -11.28  51.84 0 103.44 254.88 cm /Im6 Do Q 
   0.2549 0.3373\n0.2627 sc
   q 0   -8.16  16.8  0 349.68 322.08 cm /Im7 Do Q 
   0.9412 0.9725 0.9216\nsc 
   q 0  -58.32  52.08 0 176.16 185.76 cm /Im8 Do Q 
   /Cs2 cs 0.9647 sc 
   q 0  -31.68  34.08 0 251.76 784.8 cm /Im9 Do Q 
Q'

obj 5 0
 Type:
 Referencing:
 [(1, '\n'), (3, '310'), (1, '\n')]

The exact meaning of this object will be explored in the next part, needless to say, it involves something with the 9 layers.

8 thoughts on “Long Form Birth Certificate – The gory details – Part 1

  1. Interesting… JBIG2 isn’t supported in 1.3. Explains why flate was used for the mono-chrome layers when Preview saved.

  2. Yes… Good observation. So JBIG2 is back in the picture in the compression step but the optimized characters are now saved as exact duplicates.

    More evidence of algorithmic processes and not a forgery.

  3. So are you all saying the LFBC was compressed twice? Once by the scanner software and then again by Preview?

  4. Absolutely. Xerox has a patent that uses MRC by creating one color background layer and multiple monochrome layers (which not only is good for compression rates, but solves the problem that some printers have in printing the traditional three-layer MRC). Xerox also has scanners that create multi-layer PDFs using JBIG2, which can only be used for monochrome compression. The artifacts in the text layers can be explained by the use of JBIG2 compression. Preview can read JBIG2, but can’t save using it, and defaults to flate.

    Put that all together, and it’s clear that there were two steps in the compression – the initial compression by the scanner, plus the compression by saving in Preview (which also left some other artifacts). I personally suspect that Preview further compressed the background image, which would explain why gsgs couldn’t get exact matches to the coefficients (the other possibility is that he didn’t account for the quantization step size). If you compress an image as a JPEG (which uses DCT compression), decompress it, recompress it, and decompress again, the lossy nature of the DCT compression algorithm means that the final image will be subtly different from the middle picture.

  5. So are you all saying the LFBC was compressed twice? Once by the scanner software and then again by Preview?

    Actually I am saying that the LFBC was compressed by the scanner or scanning software and that some of it was lost when using preview to save as PDF. I do not think that Preview did much of anything other than ignoring the jbig2 compression and saving it as flatedecode which is less efficient.

    The problem with preview is that it takes an input PDF and rearranges its internals when saved/printed. I looked at documents I saved with Illustrator and what happened when I applied preview. These tell tale signs, perhaps even the ordering of the objects may tell us something about its history.

    I have seen few efforts of people going back to the gory details of the actual PDF document. Yet, I believe we may find some answers.

  6. I concur with NBC’s statement. Any compression done by Preview was most likely incidental to using the Save As PDF command. The bulk of the compression was from the scanner.

  7. Looking forward to Part 2. I was able to translate what’s happening in Object 4, and I’ll just say: yet more evidence of algorithm, not forgery.

Comments are closed.