Bug 29083

Summary:	Cannot detect PDF file type if binary data exists before magic
Product:	shared-mime-info	Reporter:	Philippe Gauthier <philippe.gauthier>
Component:	freedesktop.org.xml	Assignee:	Shared Mime Info group <shared_mime_info>
Status:	RESOLVED FIXED	QA Contact:
Severity:	minor
Priority:	medium
Version:	unspecified
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Test PDF file with prepended binary data

Description Philippe Gauthier 2010-07-15 11:50:38 UTC

The implementation notes of the PDF Reference document states that the %PDF- magic header may not be strictly at the start of the file [1]:

3.4.1, “File Header”
13. Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m


If I understand correctly, this could be represented the following way:

    <match value="%PDF-" type="string" offset="0:1024"/>


It should be easy to create a test case file (the file I have with this problem is my bank account slip). If such a file does not have a .pdf extension, Evince refuses to open the document because it sees it as application/octet-stream.


[1] http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf Appendix H, section 3.4.1, page 1102.

Comment 1 Bastien Nocera 2010-07-20 08:11:06 UTC

Any chance you could provide such a test file then?

Comment 2 Philippe Gauthier 2010-07-20 08:45:56 UTC

Created attachment 37245 [details]
Test PDF file with prepended binary data

Before :
$ gvfs-info testcase.is-really-a-pdf | grep content-type
  standard::content-type: application/octet-stream
  standard::fast-content-type: application/octet-stream

After :
$ gvfs-info testcase.is-really-a-pdf | grep content-type
  standard::content-type: application/pdf
  standard::fast-content-type: application/octet-stream

Comment 3 Bastien Nocera 2010-07-20 10:34:24 UTC

commit 7d42fc0da8068df8892842cc4005395471f4d2b0
Author: Bastien Nocera <hadess@hadess.net>
Date:   Tue Jul 20 18:33:05 2010 +0100

    Fix PDF magic detection
    
    As per spec, the first 1024 bytes can contain binary garbage, before
    the actual PDF magic header.
    
    With help from Philippe Gauthier <philippe.gauthier@deuxpi.ca>
    
    https://bugs.freedesktop.org/show_bug.cgi?id=29083

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.