101218 – Improve robustness for UTF-8 with language awareness

Bug 101218 - Improve robustness for UTF-8 with language awareness

Summary: Improve robustness for UTF-8 with language awareness

Status:	RESOLVED MOVED

Alias:	None

Product:	uchardet
Classification:	Unclassified
Component:	language/encoding (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Jehan
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	101310 102292
	Show dependency tree / graph

Reported:	2017-05-28 12:34 UTC by Jehan
Modified:	2018-10-12 21:35 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Jehan 2017-05-28 12:34:11 UTC

In bug 101204 is a file example detected as MAC-CENTRALEUROPE though it's actually UTF-8 (full ASCII but one single non-ASCII character). The point is that the file is technically valid in both encoding.

Current code, confidence for UTF-8 (without language awareness) is 0.505 whereas it was 0.535104 for MAC-CENTRALEUROPE. That's basically quite a low confidence for both and the detection to one or another is mostly related to chance.

IMO the difference should be made on language detection as is already the case for single byte encodings.
The attached file is code, but that's still close-enough to natural English that I believe the confidence should rise up for the couple (UTF-8, English) rather than a generic UTF-8 detection.

Comment 1 Sébastien Wilmet 2017-10-15 12:24:16 UTC

I have other source code files which are I think all in ASCII except my name in the license header for the copyright: Sébastien. Those should be detected as UTF-8 but they are instead recognized as other encodings, for example IBM852.

UTF-8 is the encoding of my locale. For local files, I think it makes sense to prioritize the encoding of the current locale. In GtkSourceView the file loader (which is not based on uchardet) takes a list of encodings to try one by one, sorted in decreasing order of priority, and that list depends on the current locale (the list can be different for each language):
https://git.gnome.org/browse/gtksourceview/tree/gtksourceview/gtksourceencoding.c?h=3.99.6#n624

Maybe uchardet could take as input such a list of encodings, so if there is no clear winner, it chooses the one which has the highest priority.

Comment 2 GitLab Migration User 2018-10-12 21:35:02 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/uchardet/uchardet/issues/2.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.