| Summary: | [PATCH] Add magic detection Office Open XML (.xlsx, .pptx, .docx) | ||
|---|---|---|---|
| Product: | shared-mime-info | Reporter: | Stephen Pike <steve> |
| Component: | freedesktop.org.xml | Assignee: | Shared Mime Info group <shared_mime_info> |
| Status: | RESOLVED MOVED | QA Contact: | |
| Severity: | normal | ||
| Priority: | medium | ||
| Version: | unspecified | ||
| Hardware: | Other | ||
| OS: | Mac OS X (All) | ||
| Whiteboard: | |||
| i915 platform: | i915 features: | ||
| Attachments: | Patch for ooxml support | ||
|
Description
Stephen Pike
2014-05-16 19:51:34 UTC
Comment on attachment 99171 [details] [review] Patch for ooxml support Review of attachment 99171 [details] [review]: ----------------------------------------------------------------- I expect shorter offsets for the magic, and longer magic. ::: freedesktop.org.xml.in @@ +6001,5 @@ > <glob pattern="*.docx"/> > <sub-class-of type="application/zip"/> > <generic-icon name="x-office-document"/> > + <magic priority="50"> > + <match type="string" value="word/" offset="0:3000" /> That's not good enough. First, we can't go searching 3kB inside files, the maximum we currently allow is 256 bytes (that's one of the reasons why we don't have magic for ISO9660 images for example). @@ +6043,5 @@ > <glob pattern="*.xlsx"/> > <sub-class-of type="application/zip"/> > <generic-icon name="x-office-spreadsheet"/> > + <magic priority="50"> > + <match type="string" value="xl/" offset="0:3000"/> Furthermore that's much too small a magic. There's bound to be false positives with such a small magic. Thanks. Longer magic may be possible - is it possible to do OR logic? The purpose of "xl/" is to find a folder inside the zip archive called xl/, so we could instead look for any of a few things:
- xl/worksheets
- xl/styles.xml
- xl/workbook.xml
- xl/_rels/workbook.xml
I'm worried that the max range of 256 bytes might make this impossible to get perfect. I tested 17 .xlsx files I had laying around, and the best single range was the 200 bytes 1750-1950, which worked for 15 of them. I couldn't find an overlapping set of 256 bytes that worked for all the files.
Alternatively, I may be able to find a couple of smaller ranges which when combined are less than 256 bytes total and match all files. Is that useful?
(In reply to Stephen Pike from comment #2) > Thanks. Longer magic may be possible - is it possible to do OR logic? The > purpose of "xl/" is to find a folder inside the zip archive called xl/, so > we could instead look for any of a few things: > > - xl/worksheets > > - xl/styles.xml > > - xl/workbook.xml > > - xl/_rels/workbook.xml You can do something like: <match type="string" value="foo" offset="0"> <match type="string" value="bar" offset="3"> <match type="string" value="baz" offset="3"> </match> Which should match both foobar and foobaz. > I'm worried that the max range of 256 bytes might make this impossible to > get perfect. I tested 17 .xlsx files I had laying around, and the best > single range was the 200 bytes 1750-1950, which worked for 15 of them. I > couldn't find an overlapping set of 256 bytes that worked for all the files. Then it's probably not a good identifying magic. > Alternatively, I may be able to find a couple of smaller ranges which when > combined are less than 256 bytes total and match all files. Is that useful? No, some sources will not support seeking, so looking for 256 bytes 2k inside the file will still download 2k. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xdg/shared-mime-info/issues/21. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.