103280 – uchardet returns unknown encoding for empty/0-bytes files

Bug 103280 - uchardet returns unknown encoding for empty/0-bytes files

Summary: uchardet returns unknown encoding for empty/0-bytes files

Status:	RESOLVED NOTABUG

Alias:	None

Product:	uchardet
Classification:	Unclassified
Component:	language/encoding (show other bugs)
Version:	unspecified
Hardware:	Other Linux (All)

Importance:	medium normal
Assignee:	Jehan
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-10-15 12:07 UTC by Sébastien Wilmet
Modified:	2017-11-06 19:30 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Sébastien Wilmet 2017-10-15 12:07:12 UTC

uchardet returns unknown encoding for empty/0-bytes files.

As a result the file cannot be opened by the text editor if the file loader relies only on uchardet (with the Tepl library in my case).

I don't know if it's intentional. To fix it in Tepl it is possible to have a fallback mode in case uchardet fails: try each encoding from a list one by one, taking the first which returns 0 encoding conversion errors. But this is not yet implemented.

But maybe for empty files, uchardet can return ASCII or UTF-8, or (probably better) the encoding of the current locale.

Comment 1 Jehan 2017-11-06 01:28:52 UTC

> But maybe for empty files, uchardet can return ASCII or UTF-8, or (probably better) the encoding of the current locale.

Hmmm…
I'm not sure it is a very good idea to add too much "intelligence" inside uchardet. I think it is much more versatile if it doesn't try too much to guess what the application actually wants.

After all, "unknown encoding" feels quite like a good answer to me for an empty file (there is just no data at all, so no way to guess what is the expected encoding). Of course any encoding is true as well, so ASCII/UTF-8 or the current locale would be acceptable results as well, indeed, but then we try to do the application's job. Of course in the case of a text editor, defaulting to the current locale when creating a new file is a good choice. But uchardet can be used for other kind of software as well and returning some encoding rather than a more honest "unknown" may completely mess up their logics.

So I think it makes much more sense to keep this result as dumb as possible and leave the logics of choosing a fallback encoding when such a case occurs to the application developers.

Does that make sense?
Leaving the report opened for now so that I don't forget to read your answer if you disagree and want to persuade me. But right now, I don't think that would be a good idea at all.

> As a result the file cannot be opened by the text editor if the file loader relies only on uchardet (with the Tepl library in my case).

Even when relying only on uchardet, I guess special-casing the 0-byte file would be a good idea. There is no need to do you fallback at all in such a case which is not a real "failure". There is just no data, hence nothing to say.

> To fix it in Tepl it is possible to have a fallback mode in case uchardet fails: try each encoding from a list one by one, taking the first which returns 0 encoding conversion errors. But this is not yet implemented.

I indeed remember this is also what you were doing in GtkSourceView though as I already wrote back then, I don't think this is a very good solution either. I actually have some interesting algorithm ideas which could allow application developers to improve/customize detection. I will start working on these as soon as I release the next version of uchardet.

The new algorithms will be optional and won't change the historical API of uchardet since I want to keep it as simple and "dumb" as possible.

Comment 2 Sébastien Wilmet 2017-11-06 19:30:50 UTC

Yes it makes sense to not have too much intelligence in uchardet and be honest about the response it returns.

And now the fallback mode is implemented in TeplFileLoader, so opening empty files doesn't fail.

I'm looking forward to the new algorithms.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.