Bug 102292 - UTF-8 Italian text recognized as ISO-8859-1 Portuguese
Summary: UTF-8 Italian text recognized as ISO-8859-1 Portuguese
Alias: None
Product: uchardet
Classification: Unclassified
Component: language/encoding (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Jehan
QA Contact:
Depends on: 101218
  Show dependency treegraph
Reported: 2017-08-18 12:11 UTC by Jehan
Modified: 2018-10-12 21:35 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:

UTF-8 text. (14 bytes, text/plain)
2017-08-18 12:11 UTC, Jehan

Description Jehan 2017-08-18 12:11:28 UTC
Created attachment 133604 [details]
UTF-8 text.

See: https://github.com/BYVoid/uchardet/issues/36#issuecomment-323316171

The attached text is UTF-8 Italian, but since commit e138839f0753e223f7aa2733e8ed829b47a67cac (Portuguese support for ISO-8859-1), this text is recognized as ISO-8859-1.

Not sure though if there is a proper solution apart from removing Portuguese support on short-term and adding actual language detection to UTF-8, longer term (see bug 101218).

Also obviously the fact that the file just holds 2 words make it a difficult guess for a system based on statistics.
Comment 1 GitLab Migration User 2018-10-12 21:35:15 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/uchardet/uchardet/issues/6.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.