Bug 101310 - UTF-8 section symbol (0xC2A7) invokes TIS-620 decoding
Summary: UTF-8 section symbol (0xC2A7) invokes TIS-620 decoding
Status: RESOLVED MOVED
Alias: None
Product: uchardet
Classification: Unclassified
Component: language/encoding (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Jehan
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 101218
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 02:47 UTC by pokechu022+trackers-freedesktop
Modified: 2018-10-12 21:35 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
File containing a single section sign in the midst of other text (1.23 KB, text/plain)
2017-06-06 02:47 UTC, pokechu022+trackers-freedesktop
Details

Description pokechu022+trackers-freedesktop 2017-06-06 02:47:45 UTC
Created attachment 131730 [details]
File containing a single section sign in the midst of other text

A single occurrence of the section sign (§) encoded in UTF-8 causes the file to be marked as TIS-620, even if the rest of the text is English.  This can be seen with the attached file (which includes a single §); curiously adding more instances of § elsewhere usually causes the file to be correctly detected as UTF-8.

This may be a duplicate of bug 101218, but it's a more specific case.  This was first reported at https://github.com/notepad-plus-plus/notepad-plus-plus/issues/940, but I've narrowed it down to a bug in uchardet.
Comment 1 Jehan 2017-06-06 09:32:46 UTC
This is indeed very similar to bug 101218 in the fact that technically, the text is valid in both UTF-8 and TIS-620.

I can see your attachment gets a confidence of 0.567169 in TIS-620, 0.5 in WINDOWS-1252 and 0.505 in UTF-8. These are all quite low confidences and the choice is therefore made out of chance. We should really add the language detection to UTF-8.
Comment 2 GitLab Migration User 2018-10-12 21:35:09 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/uchardet/uchardet/issues/4.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.