Created attachment 99171 [details] [review] Patch for ooxml support This is my first attempt at submitting code to a freedesktop project, so let me know if there's anything I'm doing wrong here. .docx, .pptx, and .xlsx files are currently detected as application/zip. This adds magic logic to detect them based on what's done in the `file` command (http://www.darwinsys.com/file/). I updated the 3 test lines to expect passing rather than failure, and all tests pass when I run `make`. There is a small bit of mess in the tests/list diff because my emacs stripped whitespace from a few lines. I think that's for the best, though, so I left it in.
Comment on attachment 99171 [details] [review] Patch for ooxml support Review of attachment 99171 [details] [review]: ----------------------------------------------------------------- I expect shorter offsets for the magic, and longer magic. ::: freedesktop.org.xml.in @@ +6001,5 @@ > <glob pattern="*.docx"/> > <sub-class-of type="application/zip"/> > <generic-icon name="x-office-document"/> > + <magic priority="50"> > + <match type="string" value="word/" offset="0:3000" /> That's not good enough. First, we can't go searching 3kB inside files, the maximum we currently allow is 256 bytes (that's one of the reasons why we don't have magic for ISO9660 images for example). @@ +6043,5 @@ > <glob pattern="*.xlsx"/> > <sub-class-of type="application/zip"/> > <generic-icon name="x-office-spreadsheet"/> > + <magic priority="50"> > + <match type="string" value="xl/" offset="0:3000"/> Furthermore that's much too small a magic. There's bound to be false positives with such a small magic.
Thanks. Longer magic may be possible - is it possible to do OR logic? The purpose of "xl/" is to find a folder inside the zip archive called xl/, so we could instead look for any of a few things: - xl/worksheets - xl/styles.xml - xl/workbook.xml - xl/_rels/workbook.xml I'm worried that the max range of 256 bytes might make this impossible to get perfect. I tested 17 .xlsx files I had laying around, and the best single range was the 200 bytes 1750-1950, which worked for 15 of them. I couldn't find an overlapping set of 256 bytes that worked for all the files. Alternatively, I may be able to find a couple of smaller ranges which when combined are less than 256 bytes total and match all files. Is that useful?
(In reply to Stephen Pike from comment #2) > Thanks. Longer magic may be possible - is it possible to do OR logic? The > purpose of "xl/" is to find a folder inside the zip archive called xl/, so > we could instead look for any of a few things: > > - xl/worksheets > > - xl/styles.xml > > - xl/workbook.xml > > - xl/_rels/workbook.xml You can do something like: <match type="string" value="foo" offset="0"> <match type="string" value="bar" offset="3"> <match type="string" value="baz" offset="3"> </match> Which should match both foobar and foobaz. > I'm worried that the max range of 256 bytes might make this impossible to > get perfect. I tested 17 .xlsx files I had laying around, and the best > single range was the 200 bytes 1750-1950, which worked for 15 of them. I > couldn't find an overlapping set of 256 bytes that worked for all the files. Then it's probably not a good identifying magic. > Alternatively, I may be able to find a couple of smaller ranges which when > combined are less than 256 bytes total and match all files. Is that useful? No, some sources will not support seeking, so looking for 256 bytes 2k inside the file will still download 2k.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xdg/shared-mime-info/issues/21.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.