Bug 29083 - Cannot detect PDF file type if binary data exists before magic
Summary: Cannot detect PDF file type if binary data exists before magic
Status: RESOLVED FIXED
Alias: None
Product: shared-mime-info
Classification: Unclassified
Component: freedesktop.org.xml (show other bugs)
Version: unspecified
Hardware: All All
: medium minor
Assignee: Shared Mime Info group
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-15 11:50 UTC by Philippe Gauthier
Modified: 2010-07-20 10:34 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Test PDF file with prepended binary data (13.63 KB, application/octet-stream)
2010-07-20 08:45 UTC, Philippe Gauthier
Details

Description Philippe Gauthier 2010-07-15 11:50:38 UTC
The implementation notes of the PDF Reference document states that the %PDF- magic header may not be strictly at the start of the file [1]:

3.4.1, “File Header”
13. Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m


If I understand correctly, this could be represented the following way:

    <match value="%PDF-" type="string" offset="0:1024"/>


It should be easy to create a test case file (the file I have with this problem is my bank account slip). If such a file does not have a .pdf extension, Evince refuses to open the document because it sees it as application/octet-stream.


[1] http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf Appendix H, section 3.4.1, page 1102.
Comment 1 Bastien Nocera 2010-07-20 08:11:06 UTC
Any chance you could provide such a test file then?
Comment 2 Philippe Gauthier 2010-07-20 08:45:56 UTC
Created attachment 37245 [details]
Test PDF file with prepended binary data

Before :
$ gvfs-info testcase.is-really-a-pdf | grep content-type
  standard::content-type: application/octet-stream
  standard::fast-content-type: application/octet-stream

After :
$ gvfs-info testcase.is-really-a-pdf | grep content-type
  standard::content-type: application/pdf
  standard::fast-content-type: application/octet-stream
Comment 3 Bastien Nocera 2010-07-20 10:34:24 UTC
commit 7d42fc0da8068df8892842cc4005395471f4d2b0
Author: Bastien Nocera <hadess@hadess.net>
Date:   Tue Jul 20 18:33:05 2010 +0100

    Fix PDF magic detection
    
    As per spec, the first 1024 bytes can contain binary garbage, before
    the actual PDF magic header.
    
    With help from Philippe Gauthier <philippe.gauthier@deuxpi.ca>
    
    https://bugs.freedesktop.org/show_bug.cgi?id=29083


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.