Bug 102292

Summary: UTF-8 Italian text recognized as ISO-8859-1 Portuguese
Product: uchardet Reporter: Jehan <jehan>
Component: language/encodingAssignee: Jehan <jehan>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: uchardet
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on: 101218    
Bug Blocks:    
Attachments: UTF-8 text.

Description Jehan 2017-08-18 12:11:28 UTC
Created attachment 133604 [details]
UTF-8 text.

See: https://github.com/BYVoid/uchardet/issues/36#issuecomment-323316171

The attached text is UTF-8 Italian, but since commit e138839f0753e223f7aa2733e8ed829b47a67cac (Portuguese support for ISO-8859-1), this text is recognized as ISO-8859-1.

Not sure though if there is a proper solution apart from removing Portuguese support on short-term and adding actual language detection to UTF-8, longer term (see bug 101218).

Also obviously the fact that the file just holds 2 words make it a difficult guess for a system based on statistics.
Comment 1 GitLab Migration User 2018-10-12 21:35:15 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/uchardet/uchardet/issues/6.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.