Bugzilla – Bug 3867
Bad character encoding for submited files, fix old files
Last modified: 2010-08-18 03:24:00 UTC
I was under the impression this bug was somehow addressed, but here is it: in
the submission for July there are some files (a dinosaurs collection) by the
author "Moisés Rincón Maza".
Both Inkscape and librsvg (EOG, Nautilus) refuses to open the files because of
the wrong association of declared and used encoding character type.
Created attachment 3138 [details]
xc/lib/X11/XInteractive.c (client side of extension protocol)
Attachment 3138 [details] appears to have gone missing, and I can't find anything by
anyone called Maza in release 0.16, so I'm not exactly sure what Nicu is
referring to here. But an example of an apparently different type of corruption
of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains
In this case a non-ASCII character has been converted to an HTML entity (instead
of UTF-8), and the & in the HTML entity has then been converted to &.
OK, I found the files that Nicu was referring to. They didn't make it into
release 0.16, presumably because XML::Twig couldn't parse them either. Here's one:
The problem with these files is that the characters have been left in Latin-1
instead of being converted to UTF-8.
Yeah, f.d.o. has a problem with the server and the attachment is lost, but you
identified correctly the file and indeed it didn't make it into
release precisely because of this bug
> I was under the impression this bug was somehow addressed
No, we back-burnered it until we got the hash bug fixed.
> But an example of an apparently different type of corruption
> of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains
*That* should no longer be happening, since Bryce's round of changes
to Metadata.pm circa the OCAL 0.14 era or thereabouts. The good thing
about it is, at least it's clear what the character is supposed to be.
Does SVGscan find these?
> OK, I found the files that Nicu was referring to. They didn't make it
> into release 0.16, presumably because XML::Twig couldn't parse them
> either. Here's one:
> The problem with these files is that the characters have been left in
> Latin-1 instead of being converted to UTF-8.
Yes, that is the encoding bug that we back-burnered because it was not
impacting as many files as the HASH bug. But the HASH bug is now fixed,
so it's time to look at this one again.
I suspect most or all of the files with this bug have probably ended up in
the failed-files archive for the 0.16 release. There will also be some in
the 0.17 failed files archive, no doubt, and for every release until we
fix the problem. However, the 0.16 failed files archive should include
all the ones from 0.13 through 0.16, and I think the ones from before 0.13
have all been repaired and reprocessed by this point, so there's no need
to go back any further than 0.16 I think.
The failed files from 0.16 can be found here:
I believe I know, too, what causes this bug: the web browser sends the
form contents in an encoding that is neither US-ASCII nor UTF-8, but some
other one (Latin-1 being a good example), but when XML::Twig parses the
submitted file, it makes the encoding UTF-8. What we need to do, probably
in getforminput, is convert the added metadata also to UTF-8, before it
is inserted into the metadata object. Someone told me that I should
look at the Encode module, the POD for which can be viewed here:
However, this documentation is hairy and scary and contains stuff like this:
> CAVEAT: When you run $octets = encode("utf8", $string), then $octets may
> not be equal to $string. Though they both contain the same data, the utf8
> flag for $octets is always off. When you encode anything, utf8 flag of
> the result is always off, even when it contains completely valid utf8
> string. See "The UTF-8 flag" below.
So it is obvious to me that anything I do with this module will need to be
tested thoroughly before we deploy it on the site. However, I do not have
the ability to test it here, because I don't have a unicode keyboard; I
don't have a way, as far as I am aware, to type non-ASCII characters.
And if I did, I wouldn't know what I was doing.
So it has occurred to me now that what we really need to do is put up
a separate, "testing" version of the upload script, on the site, set
up to write its "uploads" to a separate directory from the main one,
so we don't get them mixed up, and to use a different file for the
upload input log. This I can do. Then we can play with encoding stuff
using the testing script until we figure out how to make it do what we
want, and at that point all we have to do is migrate the changes over
to the main upload script and Bob will be our uncle.
I'll call the testing script upload_test.cgi, and I'll try to get it
up this week, and we can go from there.
Created attachment 3004 [details]
Blank SVG file to use for testing.
Once we get the testing script in place, we can use this blank SVG file to test
it, placing the relevant characters into the form's metadata inputs.
Testing script is in place (see URI). Currently it doesn't do anything better
than the regular upload facility, though. But it places its results in a
different directory and keeps a separate log.
> > &eacute;toile
> Does SVGscan find these?
No, because there isn't really anything wrong with it, it's just not what was
intended. I suppose I could add a
> Someone told me that I should look at the Encode module,
That was me.
> However, this documentation is hairy and scary and contains stuff like this:
> > CAVEAT: When you run $octets = encode("utf8", $string), then $octets may
> > not be equal to $string. Though they both contain the same data, the utf8
> > flag for $octets is always off. When you encode anything, utf8 flag of
> > the result is always off, even when it contains completely valid utf8
> > string. See "The UTF-8 flag" below.
Yes, but you shouldn't use encode(), so this isn't a problem.
The way I see it working is like this:
When you get metadata from the form, the encoding should be specified in the
HTTP header. Using this information with the decode() function you can decode
the metadata, that is, convert it to Perl's internal format. (The metadata you
get from SVG::Metadata should already be in Perl's internal format, unless
SVG::Metadata is doing something wrong.) When you write to the file, you do so
in UTF-8 mode:
open(OUT, ">:encoding(utf-8)", $file) or die;
> I don't have a way, as far as I am aware, to type non-ASCII characters.
Try typing vowels with the AltGr key held down. Works for me, at least, although
it's a rather limited set of non-ASCII characters.
> When you write to the file, you do so in UTF-8 mode:
> open(OUT, ">:encoding(utf-8)", $file) or die;
I forgot that you write via XML::Twig now, so ignore this bit.
> > > &eacute;toile
> > Does SVGscan find these?
> No, because there isn't really anything wrong with it
Oh, yeah. I knew that...
> I suppose I could add a
> looks_suspiciously_like_an_html_entity_with_the_ampersand_escaped test.
How did you discover this in the first place?
> When you get metadata from the form, the encoding should be specified
> in the HTTP header.
Actually, I think it's specified in the separators, but anyway, it's
specified. I'd gotten that far...
> Using this information with the decode() function you can decode
> the metadata, that is, convert it to Perl's internal format.
> (The metadata you get from SVG::Metadata should already be in
> Perl's internal format, unless SVG::Metadata is doing something
> wrong.) When you write to the file, you do so in UTF-8 mode
Okay, that's starting to make sense to me now. I think.
> Try typing vowels with the AltGr key held down
I don't have any such key. I have left and right Alt, but that's not
what they do.
I'm telling you, I live in Ohio. I've got a standard US keyboard,
for practical purposes 104 keys. (Technically it has rather more than
104 keys, because it's a high-end model with function keys duplicated
on the left side, as well as across the top, and it's remappable, but
it still only generates the same characters as a 104-key keyboard.)
I cannot type any character on this keyboard that you can't type on an
IBM Model M keyboard from the seventies. I cannot buy a PC keyboard
around here that can type other characters. (I could get a used DEC
keyboard from a VT510, the sort of terminal used with a Vax, but that
will not do me a large amount of good with a PC. It does use a PS/2
connector, but it lacks too many keys that are important on the PC,
so it is not in practice usable with a PC.)
I do not have an AltGr key, a Compose key, or a cokebottle key. I can type
ASCII characters from 32 (decimal) through 126, as well as 9 (horizontal
tab) and 13 (carriage return; in some applications this comes out as 10
(linefeed) instead, or both). That's it. If I boot into DOS, I can type
IBM Extended ASCII characters with the Alt+nnn-on-the-numpad trick, but
those won't display correctly on any modern system.
This is pretty much the only kind of keyboard sold within several hundred
miles of here, in nearly any direction. (Except for USB and laptop
keyboards, which have even fewer keys, and Mac keyboards.) It is the
only kind of keyboard anyone living around here needs, because at *least*
four nines (probably five) of the population is either literate in English,
or else it's the only language they know at all (the latter being
substantially the more common). All words are spelled with only ASCII
characters here: "naive", "resume" (that's a three-syllable noun),
"Otterbein", "El Nino", "logos", "adelphos", "anthropos", "phileo".
They are spelled that way whether hand-written or typed, and frequently
even when they are professionally typeset.
People here say they know two languages if they had three years of
high-school Spanish and can just about manage to come up with "Hola, yo
soy de Americano, e tu? Hasta la vista, baby!". I live in a city of
ten thousand people, and I can say with a fair degree of confidence that
there are 0.00 people living in this city who do not speak some kind of
English. This is the reality of living in the middle of a large and
essentially mono-lingual nation. Other characters are a completely
foreign concept. There is no market for a keyboard that can type them.
A standard US keyboard with 104 keys is just fine, mine is just like that (only
with some multimedia keys) and I can type like this:
I only hat to change the keyboard layout from software (GNOME). AltGr (the right
Alt on *any* keyboard) was used only on the first set, which is written with a
french keyboard layout.
Created attachment 3011 [details]
this is what I really typed
It seems Bugzilla does not handle well international text, this is what I
> How did you discover this in the first place?
I was examining your fixed version of index.xml, and happened to search for &.
> > When you get metadata from the form, the encoding should be
> > specified in the HTTP header.
> Actually, I think it's specified in the separators,
Yes, you're right.
> If I boot into DOS, I can type IBM Extended ASCII characters with
> the Alt+nnn-on-the-numpad trick, but those won't display correctly
> on any modern system.
But you could use GNU recode to convert them to UTF-8:
recode IBM437/CR-LF..u8 filename.txt
or to Latin-1 (when possible):
recode IBM437/CR-LF..lat1 filename.txt
or to some other appropriate modern encoding.
> I was examining your fixed version of index.xml, and happened to search
> for &.
Makes sense... I could have done that...
> I only had to change the keyboard layout from software (GNOME)
That actually *works*? I'll have to look into that. I had always assumed you
had to tell the _truth_ when you told the setup what keyboard layout you had, or
it wouldn't work correctly (like with the mouse; if you tell X11 you have an
IMPS/2 scrollmouse, and you don't, it's bad news).
And it looks like you changed it on-the-fly, too, without restarting X.
I definitely did not know that was possible.
> AltGr (the right Alt)
Oh. If you meant right alt, why didn't you just _say_ right alt? When you
called it AltGr, I assumed it was a completely different key.
I really have got to learn more about this character set stuff. I'm making
myself look dumb, repeatedly.
Oh, incidentally, I found another way this issue manifests itself. If a
zipfile or tarball is uploaded, and the metadata experience this issue,
propagate-metadata.pl cannot propagate them, because XML::Twig cannot
parse metadata.rdf, due to the invalid character. Here is an example of
such a file:
This should work fine now with ccHost, but we still have to fix the broken ones eventually...
Closing all openclipart bugs as openclipart is now on launchpad, as per request from Jon Philips.