Our friend Hermitian has provided his latest ‘analysis’ of the document that was submitted to the Court in the case Taitz v Democrat Party in Mississippi. For some inexplicable reasons, our friend believes that a scanned version of a printed PDF containing President Obama’s birth certificate shows evidence of a ‘forger’. His arguments are that a forger somehow better explains the data than a workflow.
As I will show, he ignores more likely workflow scenarios, relies on comparisons between different scanning programs and document resolutions and believes that the OCR text layer was somehow added by a forger. Of course, a much simpler and more elegant explanation exists where the documents were scanned on a Fujitsu ScanSnap S#1500 scanner into the ScanSnap Manager, a software package that is used to for scanning in the documents. The scanner in question does not have TWAIN support and therefore it is logical that the document was scanned in as follows:
The letter to Fuddy was scanned in as a PDF and the print out of the Hawaiian long form birth certificate was scanned in separately. Both documents were combined into a single PDF which showed some ‘artifacts’ such as an OCR text layer and lines and blocks which all remain invisible. The OCR text is an invisible layer on top of the PDF, which means that you can select certain words on the image and copy the OCR’ed word. However, due to the mediocre quality of the scan, the document shows few successful OCR’ed words and many of the words are misspelled. Only the large font words were captured accurately consistently, which makes sense in the above workflow.
I used my own OCR software to extract the text on the image and while more words were OCR’ed, the quality of the words is incredibly poor, with minor exceptions, including the large font portions.
For someone to properly understand a PDF, one cannot rely on studying it in illustrator, instead one has to do the somewhat harder work of decoding the original document into its objects and instructions. Not trivial but also not that hard.
Hermitian: Assuming that the Obama LFCOLB PDF image on page 4 of the court document 35-1.pdf was created by means of a human operator scanning a printout of page 2 of court document 15-1.pdf in a Fugitsu ScanSnap #S1500 scanner into Acrobat 9 (with OCR turned on) then OCR would assign each word to one of the three following categories:
NBC: Note that the Producer was not the scanner but rather the software: PFU ScanSnap Manager 5.0.21 #S1500, which is to be used to scan. Some quick research reveals that:
The scanner driver does not support TWAIN, which means that you cannot scan directly into Acrobat. Which explains why the paper capture plugin was used.
1. Those words which are deciphered and made selectable
2. Those words which are deciphered but are flagged as suspect for errors – these words are also made selectable
3. Those words which are not deciphered – these words are not made selectable
The Obot claim is that this assumed work flow produced the LFCOLB image which comprises page 4 of court document 35-1.pdf. However, much of the text appearing on page 4 of court document 35-1.pdf was not deciphered and thus fell into category 3.
NBC: For good reasons. The quality of the scan was pretty poor since it was scanned as a 150 DPI document because it was mixed color/gray. The letter was scanned at a higher DPI setting.
Hermitian: Of the certificate words on the page 4 LFCOLB that were deciphered, most were marked as suspect. Those words (or characters) which were made selectable but were not flagged as suspect include “OF”, “61″ (in the certificate number) and the typed Roman numeral “II”. The words (or numbers) “Case”, ”Filed 05/04/12, “Page” which are part of the original case label (i.e. the Green label) were also marked as suspect. However all of the text of both case labels was deciphered and made selectable.
A significant finding of the inspection (of page 4 of document 35-1.pdf) within Adobe Acrobat XI Pro was that none of the form text was deciphered by the purported OCR except for the words “STATE”, “HAWAII”, “CERTIFICATE OF LIVE BIRTH”, and “DEPARTMENT OF HEALTH”. The deciphered words are in the largest font printed on the certificate form. None of the smaller text printed on the form was deciphered and made selectable.
NBC: all of this points to an OCR of a low resolution document.
Hermitian continues to describes various scenarios that have little relevance to the workflow in question. He decides, for no logical reason to scan another document using another program to conclude that under his scenario, the OCR works better.
Hermitian: These results are atypical because the OCR algorithms included with the various versions of Adobe Acrobat typically detect more words than not – as do most of the popular OCR programs. Two popular programs are ABBY PDF Transformer Pro 3.0, and PDF-XChange Viewer Pro version 2.5.
For reference, I applied the ABBY PDF Transformer 3.0 program to the original WH LFCOLB PDF image. This PDF utility does both OCR and MRC. I turned the MRC off and scanned for OCR only. The ABBY OCR algorithm deciphered all of the typed text except for the word “Male”. The OCR scan also failed to decipher the form text “Sex”, “6a.”, “6c.”, “8.”, “20.”, ”Other”, and in box 22 ”Date Accepted by Reg.”, and the date stamp “AUG -8 1961″. The WH LFCOLB file is a one-page PDF file.
I also applied PDF-XChange Viewer Pro version 2.5 to scan the WH LFCOLB PDF image for OCR. All of the typed text was made selectable. The form text that was not made selectable included “Sex”, “6a.”, “6c.”, “8.”, “20.” and the Reg. General’s date stamp “AUG -8 1961″. All of the smallest form text was made selectable.
I also applied the ABBY PDF Transformer 3.0 OCR algorithm to document 15-1.pdf. Page 2 of document 15-1.pdf is identical to the WH LFCOLB image except for the case label added to the top edge of the page. The OCR algorithm deciphered the entire case label, and all of the typed text except for the one word “Male”. Additionally the form text (or numbers) “Sex”,“6a.”, “6c.”, “8.”, “20.”,“Other”, “Date Accepted by Reg.” and the associated date AUG -8 1961 were not deciphered. These OCR results (except for the added case label) are the same as for the WH LFCOLB image. Both pages of document 15-1.pdf were scanned for OCR.
Finally I also applied the ABBY PDF Transformer 3.0 OCR program to the four-page document 35-1.pdf. The scan deciphered both case labels and found all of the typed text with the exception of the “X” in the No box within form box 7g. The form text that was not deciphered included “5a. Month”, “5b. Hour”, “6b. Island”, “Town Limits”, “Island”,“7d. Street Address”, “ district”, “7g. Is Residence on a Farm or Plantation?”, “Mother”, “17b. Date Last Worked”, “Signature of Parent”, “Informant”, “Parent”, “Other”, “18b. Date of Signature”, “hour stated”, “M.D.”, “22. Date Accepted by Reg. General”, “AUG -8 19″. Additionally, the following warning was returned by the scan: “Page 4 Warning Check the document language”.
The difference in image resolution of the mostly text layer of the WH LFCOLB PDF image and the uniform resolution of the page 4 LFCOLB PDF image likely explains why less text was detected in this trial OCR scan of page 4 of document 35-1.pdf than the scans of the WH LFCOLB and the page 2 LFCOLB. The resolution (150 PPI) of the page 4 LFCOLB PDF image (last page of 35-1.pdf) is lower than the resolution (300 PPI) of the mostly text layer of the WH LFCOLB PDF image (and the page 2 LFCOLB PDF image). The smallest form text of the page 4 LFCOLB PDF image would be the most affected by the reduced resolution.
The Obot claim is that page 4 of document 35-1.pdf was created by a scan of a paper copy of page 2 of document 15-1.pdf.
NBC: A logical conclusion based on the evicence.
Hermitian: The METADATA from document 35-1.pdf indicates that the PDF document was created by PFU ScanSnap Manager 5.0.21 #S1500 and produced by the Adobe Acrobat 9.51 Paper Capture Plug-in. Thus the document would have been created by means of a Fugitsu ScanSnap S1500 scanner and Adobe Acrobat 9. The PDF document would have been created using the “PDF from scanner” mode in Acrobat 9 in a customized scan with “Make Searchable (Run OCR)” and “Optimized Scanned PDF” options selected.
NBC: You should really pay more attention to the Creator tag and look at the information about the scanner in question as this would lead you to reject your scenario.
Hermitian: If this indeed was the actual workflow, then the results from the trial OCR scans of the three LFCOLB PDF images reported herein do not explain how the assumed workflow could have yielded the observed poor results. The reported results from the trial scans indicate that OCR should have detected most of the text on the page 4/11 LFCOLB but it did not.
NBC: Hence the scenario about the workflow is suspect and in fact, likely wrong.
Hermitian: This was first detected when the page 4/11 LFCOLB PDF image was opened in Adobe Acrobat XI Pro and the “Find All Suspects” tool was applied. The “Select Text” tool was also utilized. Much of the text on page 4 of 35-1.pdf was found to be not selectable. Of the selectable text, most was also flagged as suspect. The words “Case”, “Filed 05/04/12″ and “Page”in the original case label (i.e. the Green label) were flagged as suspect. However, both case labels were entirely selectable. Of the identified words and numbers that were made selectable by the purported OCR only the word “OF”, the number “61″ and the Roman numeral “II” were not flagged as suspect.
The findings reported herein indicate that the particular words on page 4 of document 35-1.pdf that were made selectable did not result solely from the application of OCR. Rather it is more likely that human intervention also occurred. Otherwise, why was only the largest printed text on the certificate form made selectable? Then, more importantly, why was none of the smaller form text made selectable in this purported OCR scan?
NBC: Quite simple, because only the largest print was readable by the OCR software. Why this requires a human intervention or why the human intervention would insert misspelled words or why a human would even intervene in such way, is totally left unexplained. Again, a workflow, different from the one imagined by Hermitian will be shown to be more likely.
Hermitian: If not this scenario, then the peculiar internal structure of the page 4 PDF image must have defeated the Adobe Acrobat OCR scan. This scenario is also unlikely assuming that the PDF file was created by first scanning a paper document to create a flattened bitmap image and then embedding this bitmap image into a single layer within a PDF document.
NBC: A simple scenario emerges: The printer does not support Twain scanning, but rather software, identified as the creator (note the ‘manager’ in the creator part…). The image is scanned to a color pdf with a likely resolution of 150 DPI. When subsequently imported into the PDF, Paper Capture fails to properly capture the information. So simple.