PDF scanned on Xerox Workcentre 7535 – Part 2

Okay, now the nitty gritty details.

pdf-parser.py -c -f WH\ LFBC\ Scanned\ Xerox\ 7535\ WC.pdf
        WH\ LFBC\ Scanned\ Xerox\ 7535\ WC.pdf.txt

pdf-parser.py (part of pdftools) creates a text file containing the objects contained in the pdf. -f passes the objects through filters. However the parser does not support JBIG2Decode nor DCTDecode filters.

Usage: pdf-parser.py [options] pdf-file|zip-file|url
pdf-parser, use it to parse a PDF document

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s SEARCH, --search=SEARCH
                        string to search in indirect objects (except streams)
  -f, --filter          pass stream object through filters (FlateDecode,
                        ASCIIHexDecode, ASCII85Decode, LZWDecode and
                        RunLengthDecode only)
  -o OBJECT, --object=OBJECT
                        id of indirect object to select (version independent)
  -r REFERENCE, --reference=REFERENCE
                        id of indirect object being referenced (version
  -e ELEMENTS, --elements=ELEMENTS
                        type of elements to select (cxtsi)
  -w, --raw             raw output for data and filters
  -a, --stats           display stats for pdf document
  -t TYPE, --type=TYPE  type of indirect object to select
  -v, --verbose         display malformed PDF elements
  -x EXTRACT, --extract=EXTRACT
                        filename to extract malformed content to
  -H, --hash            display hash of objects
  -n, --nocanonicalizedoutput
                        do not canonicalize the output
  -d DUMP, --dump=DUMP  filename to dump stream content to
  -D, --debug           display debug info
  -c, --content         display the content for objects without streams or
                        with streams without filters
                        string to search in streams
  --unfiltered          search in unfiltered streams
  --casesensitive       case sensitive search in streams
  --regex               use regex to search in streams

The file contains the following header PDF Comment ‘%PDF-1.4\r’, indicating its format is PDF-1.4. There are 17 Xobjects.

     Name        Object     Dimensions              Matrix              color
/XIPLAYER0     12 0 R    1280 x 1664  798.72 614.40  -3.36  -1.20 
/XIPLAYER_CM1  14 0 R    1749 x 1403  336.72 419.76 236.64  95.52    0.0824 0.1333 0.0980 
/XIPLAYER_CM10 15 0 R     369 x  131   31.44  88.56 444.00 344.16    0.2039 0.3216 0.2902 
/XIPLAYER_CM11 16 0 R     228 x  108   25.92  54.72 449.76 445.20    0.1765 0.2863 0.2392 
/XIPLAYER_CM12 17 0 R     130 x   85   22.80  31.20 739.68 257.52    0.9843 
/XIPLAYER_CM13 18 0 R      71 x  104   24.96  17.04 737.76 110.40    0.1137 0.2392 0.1804 
/XIPLAYER_CM14 19 0 R     181 x  172   41.28   8.16 687.84 113.52    0.1765 0.3451 0.2667 
/XIPLAYER_CM15 20 0 R     210 x  109   26.16  50.40 488.16 361.2     0.7804 0.8549 0.7765 
/XIPLAYER_CM16 21 0 R      50 x   79   18.96  12.00 735.84 301.68    0.9804 
/XIPLAYER_CM2  22 0 R     694 x  196   47.04 166.56  71.52 127.92    0.1216 0.2275 0.1922 
/XIPLAYER_CM3  23 0 R     264 x   43   10.32  63.36  88.80 373.20    0.9882 
/XIPLAYER_CM4  24 0 R     474 x  517  124.08 113.76 369.12 123.12    0.9804 
/XIPLAYER_CM5  25 0 R     214 x  120   28.80  51.36 248.16 126.00    0.8000 0.9059 0.7686 
/XIPLAYER_CM6  26 0 R     207 x   51   12.24  49.68 246.24 452.16    0.8118 0.9137 0.8000 
/XIPLAYER_CM7  27 0 R     349 x  133   31.92  83.76 280.80 444.96    0.8118 0.9216 0.8314 
/XIPLAYER_CM8  28 0 R     179 x  279   66.96  42.96 246.24 138.24    0.9804 0.9961 1.0000 
/XIPLAYER_CM9  29 0 R      91 x   32    7.68  21.84 315.36 243.84    0.8196 0.9176 0.8039

XIPLAYER are objects which so far I have only found in Xerox PDF’s. CM1..16 are monochrome bitmaps. In order to extract the objects I use pdfimages.

pdfimages  -j WH\ LFBC\ Scanned\ Xerox\ 7535\ WC.pdf WH7535

pdfimages is a command line tool, part of the xpdf-3.03 package, which extracts all the images within a PDF. The -j switch will extract any DCTDecode encrypted streams as JPEGs. All images are prefixed with WH7535-

pdfimages version 3.03
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
  -f <int>       : first page to convert
  -l <int>       : last page to convert
  -j             : write JPEG images as JPEG files
  -opw <string>  : owner password (for encrypted files)
  -upw <string>  : user password (for encrypted files)
  -q             : don't print any messages or errors
  -cfg <string>  : configuration file to use in place of .xpdfrc
  -v             : print copyright and version info
  -h             : print usage information
  -help          : print usage information
  --help         : print usage information
  -?             : print usage information



There are obvious differences. For instance, the signature field is partially extracted. Since a total of 16 images are created, it is clear that the software is more aggressive in segmentation. I cannot determine if this is due to the original sample, or because of differences in the software itself.

.pbm files cannot be viewed within wordpress but they can be converted to other formats.



















13 thoughts on “PDF scanned on Xerox Workcentre 7535 – Part 2

  1. They are masks. Black is transparent, white is color set by colorspace

    The simplest form of Mask is an ImageMask. This is a one bit image where each pixel is either transparent (and you can see what is behind it) or painted in the current colour. It is rather like a stencil (which lets paint through the holes but stops paint appearing elsewhere).

    An ImageMask is easy to create – it is a 1bit XObject image which has a value of /ImageMask set to true. They are limited to a single colour however. So you can also have a proper Mask for more control.

    A Mask allows you to specify an image which can contain transparent elements. The Mask is an object attached to the XObject. You can either specify a range of colours (which will become transparent in the image) or add a separate bitmap which can define each pixel as transparent or visible (a bit like an ImageMask but offering more flexibility).


    All Objects except the jpeg are ImageMasks

  2. Yes, i understand the mask concept. The mask lets the single color “shine through” on to the background jpg. I am just saying when I opened up the reader’s file in Illustrator the objects show up as black text on white background when looked at individually.

  3. Yes, I extracted the bitmaps using pdfimages which created bitmap files where black indicates 0 and white 1. A “bit” confusing. Pun intended

  4. gorefan

    Was this PDF opened and saved by Preview?

    No, it was emailed directly from the WorkCentre. Look at the metadata. It would be a good exercise to run it though Preview.

  5. Here is what you get when you open the current pdf in WORDPAD:

    For the 8-bit background layer,

    12 0 obj

    And for one of the 15 1-bit layers,

    14 0 obj
    <</BitsPerComponent 1/DecodeParms<>/Filter/JBIG2Decode/Height 1749/ImageMask true/Length 12345/Subtype/Image/Type/XObject/Width 1403>>

    What changes would Preview make if you hit the button “Print – Save as PDF”?

  6. Just post it without the double brackets. Maybe use square brackets instead?

  7. I will post the changes. The metadata would change, JBIG2 would become FlateDecode as PDF 1.3 does not support JBIG2 and the objects would be reorganized. It would also add a margin.

Comments are closed.