101204 – uchardet returns different results depending on the chunks size of the data passed to uchardet_handle_data()

Bug 101204 - uchardet returns different results depending on the chunks size of the data passed to uchardet_handle_data()

Summary: uchardet returns different results depending on the chunks size of the data p...

Status:	RESOLVED FIXED

Alias:	None

Product:	uchardet
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Jehan
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-05-26 16:57 UTC by Sébastien Wilmet
Modified:	2017-05-28 13:31 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Example of a file in GIO for which the bug occurs (56.25 KB, text/x-csrc) 2017-05-26 16:57 UTC, Sébastien Wilmet	Details
View All

Description Sébastien Wilmet 2017-05-26 16:57:11 UTC

Created attachment 131525 [details]
Example of a file in GIO for which the bug occurs

See the attached patch (it is a UTF-8 file coming from GIO, at line 496 it contains the character ’).

With uchardet git master:
$ uchardet task.c
MAC-CENTRALEUROPE

With uchardet 0.0.6:
$ uchardet task.c
UTF-8

With GtefFileLoader [1], by using the uchardet API (with uchardet git master or 0.0.6, same result):
chunk size: 8192
chunk size: 8192
chunk size: 8192
chunk size: 8192
chunk size: 8192
chunk size: 8192
chunk size: 8192
chunk size: 252
charset: ASCII

The same bug happens with several files in the GLib repository.

[1] GtefFileLoader: https://git.gnome.org/browse/gtef/tree/gtef/gtef-file-loader.c?h=2.0.1#n891

Comment 1 Jehan 2017-05-28 09:23:40 UTC

(In reply to Sébastien Wilmet from comment #0)
> With uchardet git master:
> $ uchardet task.c
> MAC-CENTRALEUROPE
> 
> With uchardet 0.0.6:
> $ uchardet task.c
> UTF-8

It's normal that MAC-CENTRALEUROPE could not have been returned before 0.0.6 since it has been added after.

It is anyway a bit weird that Mac-CentralEurope be favored to UTF-8 since there is obviously no Central European languages in your file (though it's mostly not really English either per se. This is the problem with code and these kinds of non-really natural language files. uchardet is really better on natural languages since its detection engines are trained with language data statistics). Yet since UTF-8 is not based at all on language detection right now, the expected bias does not exist. I really need to start working on extending the language detection to UTF-8. Hopefully it will improve a lot of things.


> With GtefFileLoader [1], by using the uchardet API (with uchardet git master
> or 0.0.6, same result):
> chunk size: 8192
> chunk size: 8192
> chunk size: 8192
> chunk size: 8192
> chunk size: 8192
> chunk size: 8192
> chunk size: 8192
> chunk size: 252
> charset: ASCII
> 
> The same bug happens with several files in the GLib repository.
> 
> [1] GtefFileLoader:
> https://git.gnome.org/browse/gtef/tree/gtef/gtef-file-loader.c?h=2.0.1#n891

As for the difference between git master with the CLI tool and the API, there should not be one. The CLI tool uses the API only (no custom detection code nor "clever" shortcuts). Are you sure GtefFileLoader does not stop feeding the data before the end? In which case, it would return ASCII (the whole file but this character is ASCII). I remember you used to have the same bug on gedit code.

Check a look at uchardet CLI tool code:
https://cgit.freedesktop.org/uchardet/uchardet/tree/src/tools/uchardet.cpp#n54

As far as I can see, it does exactly the same as your code. Only difference is that it uses some lower level IO functions. So I am guessing, either your IO functions does not feed the full file (could it return NULL earlier?), or it corrupts the raw data somehow maybe? You could try recreating the file from the result of  _gtef_file_content_loader_get_content() and make a diff with the original maybe?

Comment 2 Sébastien Wilmet 2017-05-28 10:06:13 UTC

In src/tools/uchardet.cpp, the BUFFER_SIZE is equal to 65536. If I change that value to 8192, then it returns ASCII like in GtefFileLoader.

The documentation of uchardet_handle_data() says:

> The detector is able to shortcut processing when it reaches certainty
> for an encoding

Comment 3 Jehan 2017-05-28 10:34:27 UTC

(In reply to Sébastien Wilmet from comment #2)
> In src/tools/uchardet.cpp, the BUFFER_SIZE is equal to 65536. If I change
> that value to 8192, then it returns ASCII like in GtefFileLoader.

I see. I will look into this.
 
> The documentation of uchardet_handle_data() says:
> 
> > The detector is able to shortcut processing when it reaches certainty
> > for an encoding

Indeed there are some shortcuts like this. This may actually be wrong. As far as I can think right now, the only certainty we can have is when a charset is wrong (i.e. hitting an invalid byte or byte sequence). Other than this, we need to process the full data (of course, software are allowed to do shortcuts on their own by not handling all data, for instance for faster processing, but uchardet should not).

So I will have to look into this, in case we do anything that we shouldn't.

Comment 4 Sébastien Wilmet 2017-05-28 10:56:39 UTC

Thanks.

For now I've added this workaround in GtefFileLoading:
https://git.gnome.org/browse/gtef/commit/?id=997216f974eeefce0c3c717e5da45921a8654133

it works fine for what I wanted to do.

task.c is 57596 bytes. With a chunk size of 65536, uchardet is maybe forced to look at all the data, while with a smaller chunk size it takes a shortcut.

Comment 5 Jehan 2017-05-28 12:17:36 UTC

Fixed.

commit 98bf4d73fdc1400a16209cb55840fd7dd46632ab
Author: Jehan <jehan@girinstud.io>
Date:   Sun May 28 14:06:53 2017 +0200

    Bug 101204 - different results with different chunk sizes.
    
    ASCII and ISO-8859-1 should not be detected in
    nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd()
    instead. Otherwise it creates an unwanted shortcut from the first call
    to uchardet_handle_data() if the input is broken into several pieces and
    if the first chunk happens to be ASCII (or ASCII + NBSP).

 src/nsUniversalDetector.cpp | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

Comment 6 Jehan 2017-05-28 12:35:13 UTC

For info, I opened bug 101218 to keep a reminder of testing with this file when I will improve UTF-8 detection with language awareness.

Comment 7 Sébastien Wilmet 2017-05-28 13:31:52 UTC

Thanks a lot for the bug fix!

It works fine, I've reverted my commit in Gtef.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.