|Summary:||uchardet returns different results depending on the chunks size of the data passed to uchardet_handle_data()|
|Product:||uchardet||Reporter:||Sébastien Wilmet <swilmet>|
|Status:||RESOLVED FIXED||QA Contact:|
|i915 platform:||i915 features:|
|Attachments:||Example of a file in GIO for which the bug occurs|
Description Sébastien Wilmet 2017-05-26 16:57:11 UTC
Created attachment 131525 [details] Example of a file in GIO for which the bug occurs See the attached patch (it is a UTF-8 file coming from GIO, at line 496 it contains the character ’). With uchardet git master: $ uchardet task.c MAC-CENTRALEUROPE With uchardet 0.0.6: $ uchardet task.c UTF-8 With GtefFileLoader , by using the uchardet API (with uchardet git master or 0.0.6, same result): chunk size: 8192 chunk size: 8192 chunk size: 8192 chunk size: 8192 chunk size: 8192 chunk size: 8192 chunk size: 8192 chunk size: 252 charset: ASCII The same bug happens with several files in the GLib repository.  GtefFileLoader: https://git.gnome.org/browse/gtef/tree/gtef/gtef-file-loader.c?h=2.0.1#n891
Comment 1 Jehan 2017-05-28 09:23:40 UTC
(In reply to Sébastien Wilmet from comment #0) > With uchardet git master: > $ uchardet task.c > MAC-CENTRALEUROPE > > With uchardet 0.0.6: > $ uchardet task.c > UTF-8 It's normal that MAC-CENTRALEUROPE could not have been returned before 0.0.6 since it has been added after. It is anyway a bit weird that Mac-CentralEurope be favored to UTF-8 since there is obviously no Central European languages in your file (though it's mostly not really English either per se. This is the problem with code and these kinds of non-really natural language files. uchardet is really better on natural languages since its detection engines are trained with language data statistics). Yet since UTF-8 is not based at all on language detection right now, the expected bias does not exist. I really need to start working on extending the language detection to UTF-8. Hopefully it will improve a lot of things. > With GtefFileLoader , by using the uchardet API (with uchardet git master > or 0.0.6, same result): > chunk size: 8192 > chunk size: 8192 > chunk size: 8192 > chunk size: 8192 > chunk size: 8192 > chunk size: 8192 > chunk size: 8192 > chunk size: 252 > charset: ASCII > > The same bug happens with several files in the GLib repository. > >  GtefFileLoader: > https://git.gnome.org/browse/gtef/tree/gtef/gtef-file-loader.c?h=2.0.1#n891 As for the difference between git master with the CLI tool and the API, there should not be one. The CLI tool uses the API only (no custom detection code nor "clever" shortcuts). Are you sure GtefFileLoader does not stop feeding the data before the end? In which case, it would return ASCII (the whole file but this character is ASCII). I remember you used to have the same bug on gedit code. Check a look at uchardet CLI tool code: https://cgit.freedesktop.org/uchardet/uchardet/tree/src/tools/uchardet.cpp#n54 As far as I can see, it does exactly the same as your code. Only difference is that it uses some lower level IO functions. So I am guessing, either your IO functions does not feed the full file (could it return NULL earlier?), or it corrupts the raw data somehow maybe? You could try recreating the file from the result of _gtef_file_content_loader_get_content() and make a diff with the original maybe?
Comment 2 Sébastien Wilmet 2017-05-28 10:06:13 UTC
In src/tools/uchardet.cpp, the BUFFER_SIZE is equal to 65536. If I change that value to 8192, then it returns ASCII like in GtefFileLoader. The documentation of uchardet_handle_data() says: > The detector is able to shortcut processing when it reaches certainty > for an encoding
Comment 3 Jehan 2017-05-28 10:34:27 UTC
(In reply to Sébastien Wilmet from comment #2) > In src/tools/uchardet.cpp, the BUFFER_SIZE is equal to 65536. If I change > that value to 8192, then it returns ASCII like in GtefFileLoader. I see. I will look into this. > The documentation of uchardet_handle_data() says: > > > The detector is able to shortcut processing when it reaches certainty > > for an encoding Indeed there are some shortcuts like this. This may actually be wrong. As far as I can think right now, the only certainty we can have is when a charset is wrong (i.e. hitting an invalid byte or byte sequence). Other than this, we need to process the full data (of course, software are allowed to do shortcuts on their own by not handling all data, for instance for faster processing, but uchardet should not). So I will have to look into this, in case we do anything that we shouldn't.
Comment 4 Sébastien Wilmet 2017-05-28 10:56:39 UTC
Thanks. For now I've added this workaround in GtefFileLoading: https://git.gnome.org/browse/gtef/commit/?id=997216f974eeefce0c3c717e5da45921a8654133 it works fine for what I wanted to do. task.c is 57596 bytes. With a chunk size of 65536, uchardet is maybe forced to look at all the data, while with a smaller chunk size it takes a shortcut.
Comment 5 Jehan 2017-05-28 12:17:36 UTC
Fixed. commit 98bf4d73fdc1400a16209cb55840fd7dd46632ab Author: Jehan <email@example.com> Date: Sun May 28 14:06:53 2017 +0200 Bug 101204 - different results with different chunk sizes. ASCII and ISO-8859-1 should not be detected in nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd() instead. Otherwise it creates an unwanted shortcut from the first call to uchardet_handle_data() if the input is broken into several pieces and if the first chunk happens to be ASCII (or ASCII + NBSP). src/nsUniversalDetector.cpp | 44 +++++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 21 deletions(-)
Comment 6 Jehan 2017-05-28 12:35:13 UTC
For info, I opened bug 101218 to keep a reminder of testing with this file when I will improve UTF-8 detection with language awareness.
Comment 7 Sébastien Wilmet 2017-05-28 13:31:52 UTC
Thanks a lot for the bug fix! It works fine, I've reverted my commit in Gtef.