Bug 78797 - [PATCH] Add magic detection Office Open XML (.xlsx, .pptx, .docx)
Summary: [PATCH] Add magic detection Office Open XML (.xlsx, .pptx, .docx)
Status: RESOLVED MOVED
Alias: None
Product: shared-mime-info
Classification: Unclassified
Component: freedesktop.org.xml (show other bugs)
Version: unspecified
Hardware: Other Mac OS X (All)
: medium normal
Assignee: Shared Mime Info group
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-16 19:51 UTC by Stephen Pike
Modified: 2018-10-13 10:37 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Patch for ooxml support (3.87 KB, patch)
2014-05-16 19:51 UTC, Stephen Pike
Details | Splinter Review

Description Stephen Pike 2014-05-16 19:51:34 UTC
Created attachment 99171 [details] [review]
Patch for ooxml support

This is my first attempt at submitting code to a freedesktop project,
so let me know if there's anything I'm doing wrong here. 

.docx, .pptx, and .xlsx files are currently detected as
application/zip. This adds magic logic to detect them based on what's
done in the `file` command (http://www.darwinsys.com/file/).
    
I updated the 3 test lines to expect passing rather than failure, and
all tests pass when I run `make`.

There is a small bit of mess in the tests/list diff because my emacs
stripped whitespace from a few lines. I think that's for the best, 
though, so I left it in.
Comment 1 Bastien Nocera 2014-06-03 14:56:37 UTC
Comment on attachment 99171 [details] [review]
Patch for ooxml support

Review of attachment 99171 [details] [review]:
-----------------------------------------------------------------

I expect shorter offsets for the magic, and longer magic.

::: freedesktop.org.xml.in
@@ +6001,5 @@
>      <glob pattern="*.docx"/>
>      <sub-class-of type="application/zip"/>
>      <generic-icon name="x-office-document"/>
> +    <magic priority="50">
> +      <match type="string" value="word/" offset="0:3000" />

That's not good enough. First, we can't go searching 3kB inside files, the maximum we currently allow is 256 bytes (that's one of the reasons why we don't have magic for ISO9660 images for example).

@@ +6043,5 @@
>      <glob pattern="*.xlsx"/>
>      <sub-class-of type="application/zip"/>
>      <generic-icon name="x-office-spreadsheet"/>
> +    <magic priority="50">
> +      <match type="string" value="xl/" offset="0:3000"/>

Furthermore that's much too small a magic. There's bound to be false positives with such a small magic.
Comment 2 Stephen Pike 2014-06-03 16:13:31 UTC
Thanks. Longer magic may be possible - is it possible to do OR logic? The purpose of "xl/" is to find a folder inside the zip archive called xl/, so we could instead look for any of a few things:

  - xl/worksheets
    
  - xl/styles.xml

  - xl/workbook.xml

  - xl/_rels/workbook.xml


I'm worried that the max range of 256 bytes might make this impossible to get perfect. I tested 17 .xlsx files I had laying around, and the best single range was the 200 bytes 1750-1950, which worked for 15 of them. I couldn't find an overlapping set of 256 bytes that worked for all the files. 

Alternatively, I may be able to find a couple of smaller ranges which when combined are less than 256 bytes total and match all files. Is that useful?
Comment 3 Bastien Nocera 2015-01-28 15:55:36 UTC
(In reply to Stephen Pike from comment #2)
> Thanks. Longer magic may be possible - is it possible to do OR logic? The
> purpose of "xl/" is to find a folder inside the zip archive called xl/, so
> we could instead look for any of a few things:
> 
>   - xl/worksheets
>     
>   - xl/styles.xml
> 
>   - xl/workbook.xml
> 
>   - xl/_rels/workbook.xml

You can do something like:
<match type="string" value="foo" offset="0">
  <match type="string" value="bar" offset="3">
  <match type="string" value="baz" offset="3">
</match>

Which should match both foobar and foobaz.

> I'm worried that the max range of 256 bytes might make this impossible to
> get perfect. I tested 17 .xlsx files I had laying around, and the best
> single range was the 200 bytes 1750-1950, which worked for 15 of them. I
> couldn't find an overlapping set of 256 bytes that worked for all the files. 

Then it's probably not a good identifying magic.

> Alternatively, I may be able to find a couple of smaller ranges which when
> combined are less than 256 bytes total and match all files. Is that useful?

No, some sources will not support seeking, so looking for 256 bytes 2k inside the file will still download 2k.
Comment 4 GitLab Migration User 2018-10-13 10:37:09 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xdg/shared-mime-info/issues/21.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.