bug in PDFs

All: I am using PDF files generated from matplotlib, and a PDF parser from ReportLab, Inc. Their tool encountered a bug in the PDF specification. The company’s email to me follows:

…matplotlib is violating the PDF specification. There
is a structure near the end of the file shown below, and they have put
an ‘n’ instead of an ‘f’ which tells a (suitably pedantic) parser that
the first meaningful content is to be found at byte 0 in the file, not
byte 16 where it really lives.

xref
0 62
0000000000 65535 n <---- should be ‘f’
0000000016 00000 n
0000000065 00000 n
0000000218 00000 n

That row with the ‘00000000 65535’ is present in all PDF files. I
change the ‘n’ to an ‘f’ in a good binary editor and it goes through
fine.

I have also added a special case to our code to correct for this. I
suspect other PDF viewers just skip the first row so were not bitten.

I was able to figure out which module contains the offending code, but not which lines actually print out that data.

I submitted a bug report here:

https://sourceforge.net/tracker/?func=detail&aid=2805455&group_id=80706&atid=560720

Thanks,

Mike

The description of 'n' vs 'f' below doesn't seem to align with what the spec says: that 'n' is for in-use objects and 'f' is for free objects. However, the spec does say:

"The first entry in the table (object number 0) is always free and has a generation number of 65,535; it is the head of the linked list of free objects."

So it seems reasonable to make this change.

This has now been fixed in the SVN 0.98.x maintenance branch and trunk, but not tested against ReportLab's parser. Mike: are you able to check out from SVN and test this for us?

Mike

Michael Hearne wrote:

···

All: I am using PDF files generated from matplotlib, and a PDF parser from ReportLab, Inc. Their tool encountered a bug in the PDF specification. The company's email to me follows:

...matplotlib is violating the PDF specification. There
is a structure near the end of the file shown below, and they have put
an 'n' instead of an 'f' which tells a (suitably pedantic) parser that
the first meaningful content is to be found at byte 0 in the file, not
byte 16 where it really lives.

xref
0 62
0000000000 65535 n <---- should be 'f'
0000000016 00000 n
0000000065 00000 n
0000000218 00000 n

That row with the '00000000 65535' is present in all PDF files. I
change the 'n' to an 'f' in a good binary editor and it goes through
fine.

I have also added a special case to our code to correct for this. I
suspect other PDF viewers just skip the first row so were not bitten.

I was able to figure out which module contains the offending code, but not which lines actually print out that data.

I submitted a bug report here:
https://sourceforge.net/tracker/?func=detail&aid=2805455&group_id=80706&atid=560720

Thanks,

Mike

------------------------------------------------------------------------

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing server and web deployment.
http://p.sf.net/sfu/businessobjects
------------------------------------------------------------------------

_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
matplotlib-users List Signup and Options
  
--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA