Bug 94409 - update-mime-database writes wrong literal count to mime.cache
Summary: update-mime-database writes wrong literal count to mime.cache
Status: RESOLVED FIXED
Alias: None
Product: shared-mime-info
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: high major
Assignee: Shared Mime Info group
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-05 23:16 UTC by Roman
Modified: 2016-09-20 13:38 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
example which produces wrong literal count (424 bytes, text/plain)
2016-03-05 23:16 UTC, Roman
Details
Write the correct length for literal and glob lists to the cache (1.72 KB, patch)
2016-07-11 20:01 UTC, Matthias Clasen
Details | Splinter Review
Fix for literal/glob patterns counting with no breaking other code (955 bytes, patch)
2016-07-16 21:03 UTC, Roman
Details | Splinter Review
upated patch (1.81 KB, patch)
2016-07-21 17:45 UTC, Matthias Clasen
Details | Splinter Review

Description Roman 2016-03-05 23:16:45 UTC
Created attachment 122122 [details]
example which produces wrong literal count

It seems like update-mime-database writes the count of different literals to mime.cache, while it should write its number in conjunction with mime types (LiteralEntry in spec) 

E.g. if many mime types have the same literal pattern, they will be counted as one.

See example in the attachment. It has two mime types.
text/x-cmake has glob-deleteall and CMakeLists.txt, hence two literal patterns.
text/x-dsrc has glob-deletall, hence one literal pattern.
So 2 different literal patterns in total, but these must be 3 literal entries:

text/x-cmake __NOGLOBS__
text/x-dsrc  __NOGLOBS__
text/x-cmake CMakeLists.txt

literal count must be 3 upon updating.
Start update-mime-database with -V flag (verbose). It will show in which position literal globs were written. Open mime.cache in hex editor, go to this position, it will have 00 00 00 02 value.
Comment 1 Roman 2016-07-08 12:57:40 UTC
Can anyone look at it? This is serious bug for me.
Comment 2 Matthias Clasen 2016-07-11 12:49:09 UTC
Can you describe what misbehavior this causes ? E.g. do you have a program that crashes due to this ?
Comment 3 Roman 2016-07-11 13:52:40 UTC
(In reply to Matthias Clasen from comment #2)
> Can you describe what misbehavior this causes ? E.g. do you have a program
> that crashes due to this ?

I have a library (my own implementation of shared mime database spec) that detects MIME types via mime.cache files. For now it can't read literal patterns properly because update-mime-database writes wrong number of literals into generated mime.cache, hence it's unable to detect MIME type properly in some cases.

I found this bug by writing tests for my implementation of spec and using generated mime.cache as test input.

I hope you understood the problem. If not let's look at how literal entries are defined in spec:

LiteralList:
4			CARD32		N_LITERALS
12*N_LITERALS		LiteralEntry	

LiteralEntry:
4			CARD32		LITERAL_OFFSET
4			CARD32		MIME_TYPE_OFFSET
4			CARD32		WEIGHT in lower 8 bits
                                        FLAGS in rest:
                                        0x100 = case-sensitive

Implementations should read N_LITERALS and then this number of LiteralEntry. update-mime-database writes all existing pairs of literal/mimetype, but since it writes the wrong N_LITERALS, it's impossible to find out the real count of literal and read them all.

It's hard to find other affected program, because most don't respect glob-deleteall at all or detect types using other files (e.g. globs2 or MEDIA/TYPE.xml).
Comment 4 Roman 2016-07-11 14:32:12 UTC
(In reply to Matthias Clasen from comment #2)
> Can you describe what misbehavior this causes ? E.g. do you have a program
> that crashes due to this ?

Ok, I found at least one affected programs. It's dolphin for KDE4 and pcmanfm.

Having source file:

<?xml version="1.0" encoding="UTF-8"?>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">
  <mime-type type="application/x-mymakefile">
    <comment>My make file</comment>
    <glob pattern="myfile"/>
    <glob pattern="mymakefile"/>
    <icon name="image-png"/>
    <sub-class-of type="text/plain"/>
  </mime-type>
  <mime-type type="application/x-mycfile">
    <comment>My C file</comment>
    <glob pattern="myfile"/>
    <glob pattern="mycfile"/>
    <icon name="image-png"/>
    <sub-class-of type="text/plain"/>
  </mime-type>
</mime-info>

After update-mime-database generated mime.cache, it has 00 00 00 03 value in literal offset, but actually lists all 4 combinations:

mycfile:application/x-mycfile
myfile:application/x-mymakefile
myfile:application/x-mycfile
mymakefile:application/x-mymakefile

But pcmanfm and dolphin reads only the first three because of wrong literal count and fails to detect type for mymakefile.
Comment 5 Roman 2016-07-11 14:35:22 UTC
Nemo and Nautilous are affected too.
Comment 6 Matthias Clasen 2016-07-11 14:55:12 UTC
Not saying that there is no bug, but I will say that it is likely intentional that we don't store identical strings multiple times. I'll have to reread the spec and the implementation more carefully.
Comment 7 Roman 2016-07-11 15:06:11 UTC
(In reply to Matthias Clasen from comment #6)
> Not saying that there is no bug, but I will say that it is likely
> intentional that we don't store identical strings multiple times. I'll have
> to reread the spec and the implementation more carefully.

Strings are not stored multiple times. Offsets are. As they should. So nothing wrong with that.

In provided example I just followed offsets in hex editor to reproduce literal/mimetype relations and wrote them in human-friendly way.

The only problem is N_LITERALS which is written wrong by update-mime-database. It looks like it writes the number of different literal patterns instead of the number of combinations.
Comment 8 Matthias Clasen 2016-07-11 20:01:35 UTC
Created attachment 125012 [details] [review]
Write the correct length for literal and glob lists to the cache

As pointed out in bugzilla, we were not counting 'duplicates'
properly here.
Comment 9 Matthias Clasen 2016-07-11 20:02:09 UTC
Here is an untested patch; please let me know if it fixes the problem for you
Comment 10 Roman 2016-07-11 22:32:18 UTC
(In reply to Matthias Clasen from comment #9)
> Here is an untested patch; please let me know if it fixes the problem for you

I applied patch to shared-mime-info-1.6/update-mime-database.c, recompiled, then tried to run it on files from my first attachment, it gave me error:

$ export XDG_DATA_HOME="$(pwd)"
$ ./update-mime-database $XDG_DATA_HOME/mime
**
GLib:ERROR:/build/glib2.0-dCKQ11/glib2.0-2.44.1/./glib/ghash.c:373:g_hash_table_lookup_node: assertion failed: (hash_table->ref_count > 0)
Aborted
Comment 11 Roman 2016-07-15 20:24:23 UTC
(In reply to Matthias Clasen from comment #9)
> Here is an untested patch; please let me know if it fixes the problem for you

Looks like you forgot assign map to count_data as data. After adding this line

count_data.data = map;

patch seems to work.
Comment 12 Roman 2016-07-16 19:30:09 UTC
(In reply to Matthias Clasen from comment #9)
> Here is an untested patch; please let me know if it fixes the problem for you

This breaks other things though. I suppose it's because you assumed that every entry type consists of 3 items, while e.g. alias entries have only 2.
Comment 13 Roman 2016-07-16 21:03:36 UTC
Created attachment 125115 [details] [review]
Fix for literal/glob patterns counting with no breaking other code

I made it work by ugly hack. Specifically I check if get_value is equal to get_glob_list_value and only then count items using CoundData.
Comment 14 Matthias Clasen 2016-07-21 17:45:19 UTC
Created attachment 125238 [details] [review]
upated patch

We can use the 'weighted' parameter to determine whether individual list entries are 3 words or 2 words each.
Comment 15 Roman 2016-08-01 20:13:09 UTC
(In reply to Matthias Clasen from comment #14)
> Created attachment 125238 [details] [review] [review]
> upated patch
> 
> We can use the 'weighted' parameter to determine whether individual list
> entries are 3 words or 2 words each.

Works for me. All my tests are passed with this patch.
Comment 16 Roman 2016-09-20 13:20:22 UTC
(In reply to Matthias Clasen from comment #14)
> Created attachment 125238 [details] [review] [review]
> upated patch
> 
> We can use the 'weighted' parameter to determine whether individual list
> entries are 3 words or 2 words each.

Does shared-mime-info 1.7 include this fix? I don't see it.
Will it be included in the next release?
Comment 17 Bastien Nocera 2016-09-20 13:37:43 UTC
(In reply to Roman from comment #16)
> (In reply to Matthias Clasen from comment #14)
> > Created attachment 125238 [details] [review] [review] [review]
> > upated patch
> > 
> > We can use the 'weighted' parameter to determine whether individual list
> > entries are 3 words or 2 words each.
> 
> Does shared-mime-info 1.7 include this fix? I don't see it.

It's not, otherwise the bug would have been marked as fixed.

> Will it be included in the next release?

It will, I'm committing it now.
Comment 18 Bastien Nocera 2016-09-20 13:38:14 UTC
Attachment 125238 [details] pushed as 0907448 - Write the correct length for literal and glob lists to the cache


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.