Bug 3867 - Bad character encoding for submited files, fix old files
Bad character encoding for submited files, fix old files
Status: RESOLVED NOTOURBUG
Product: openclipart.org
Classification: Unclassified
Component: tools
unspecified
x86 (IA32) All
: high major
Assigned To: default user for a product
http://openclipart.org/cgi-bin/upload...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-07-25 23:49 UTC by Nicu Buculei
Modified: 2010-08-18 03:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Blank SVG file to use for testing. (1.48 KB, image/svg+xml)
2005-08-23 04:59 UTC, Jonadab the Unsightly One
Details
this is what I really typed (11.05 KB, image/png)
2005-08-23 22:45 UTC, Nicu Buculei
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nicu Buculei 2005-07-25 23:49:07 UTC
I was under the impression this bug was somehow addressed, but here is it: in
the submission for July there are some files (a dinosaurs collection) by the
author "Moisés Rincón Maza".
Both Inkscape and librsvg (EOG, Nautilus) refuses to open the files because of
the wrong association of declared and used encoding character type.
Comment 1 Nicu Buculei 2005-07-25 23:50:38 UTC
Created attachment 3138 [details]
xc/lib/X11/XInteractive.c (client side of extension protocol)
Comment 2 Stephen Silver 2005-08-23 01:36:49 UTC
Attachment 3138 [details] appears to have gone missing, and I can't find anything by
anyone called Maza in release 0.16, so I'm not exactly sure what Nicu is
referring to here. But an example of an apparently different type of corruption
of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains

   étoile

In this case a non-ASCII character has been converted to an HTML entity (instead
of UTF-8), and the & in the HTML entity has then been converted to &.
Comment 3 Stephen Silver 2005-08-23 02:00:42 UTC
OK, I found the files that Nicu was referring to. They didn't make it into
release 0.16, presumably because XML::Twig couldn't parse them either. Here's one:

  http://www.openclipart.org/incoming-pre-0.16/corythosaurus_mois_s_rin_01.svg

The problem with these files is that the characters have been left in Latin-1
instead of being converted to UTF-8.
Comment 4 Nicu Buculei 2005-08-23 03:19:32 UTC
Yeah, f.d.o. has a problem with the server and the attachment is lost, but you
identified correctly the file and indeed it didn't make it into
release precisely because of this bug
Comment 5 Jonadab the Unsightly One 2005-08-23 04:44:00 UTC
> I was under the impression this bug was somehow addressed

No, we back-burnered it until we got the hash bug fixed.

> But an example of an apparently different type of corruption
> of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains
>    étoile

*That* should no longer be happening, since Bryce's round of changes 
to Metadata.pm circa the OCAL 0.14 era or thereabouts.  The good thing
about it is, at least it's clear what the character is supposed to be.

Does SVGscan find these?

> OK, I found the files that Nicu was referring to. They didn't make it
> into release 0.16, presumably because XML::Twig couldn't parse them
> either. Here's one:
> 
>  http://www.openclipart.org/incoming-pre-0.16/corythosaurus_mois_s_rin_01.svg
> 
> The problem with these files is that the characters have been left in 
> Latin-1 instead of being converted to UTF-8.

Yes, that is the encoding bug that we back-burnered because it was not
impacting as many files as the HASH bug.  But the HASH bug is now fixed,
so it's time to look at this one again.

I suspect most or all of the files with this bug have probably ended up in
the failed-files archive for the 0.16 release.  There will also be some in
the 0.17 failed files archive, no doubt, and for every release until we
fix the problem.  However, the 0.16 failed files archive should include
all the ones from 0.13 through 0.16, and I think the ones from before 0.13
have all been repaired and reprocessed by this point, so there's no need
to go back any further than 0.16 I think.

The failed files from 0.16 can be found here:
http://openclipart.org/downloads/0.16/openclipart-0.16-failed.zip

I believe I know, too, what causes this bug:  the web browser sends the
form contents in an encoding that is neither US-ASCII nor UTF-8, but some
other one (Latin-1 being a good example), but when XML::Twig parses the
submitted file, it makes the encoding UTF-8.  What we need to do, probably
in getforminput, is convert the added metadata also to UTF-8, before it
is inserted into the metadata object.  Someone told me that I should
look at the Encode module, the POD for which can be viewed here:

http://search.cpan.org/~dankogai/Encode-2.11/Encode.pm

However, this documentation is hairy and scary and contains stuff like this:
> CAVEAT: When you run $octets = encode("utf8", $string), then $octets may
> not be equal to $string. Though they both contain the same data, the utf8
> flag for $octets is always off. When you encode anything, utf8 flag of 
> the result is always off, even when it contains completely valid utf8
> string. See "The UTF-8 flag" below.

So it is obvious to me that anything I do with this module will need to be
tested thoroughly before we deploy it on the site.  However, I do not have
the ability to test it here, because I don't have a unicode keyboard; I
don't have a way, as far as I am aware, to type non-ASCII characters.
And if I did, I wouldn't know what I was doing.

So it has occurred to me now that what we really need to do is put up
a separate, "testing" version of the upload script, on the site, set
up to write its "uploads" to a separate directory from the main one,
so we don't get them mixed up, and to use a different file for the
upload input log.  This I can do.  Then we can play with encoding stuff
using the testing script until we figure out how to make it do what we
want, and at that point all we have to do is migrate the changes over
to the main upload script and Bob will be our uncle.

I'll call the testing script upload_test.cgi, and I'll try to get it
up this week, and we can go from there.
Comment 6 Jonadab the Unsightly One 2005-08-23 04:59:30 UTC
Created attachment 3004 [details]
Blank SVG file to use for testing.

Once we get the testing script in place, we can use this blank SVG file to test
it, placing the relevant characters into the form's metadata inputs.
Comment 7 Jonadab the Unsightly One 2005-08-23 05:17:52 UTC
Testing script is in place (see URI).  Currently it doesn't do anything better
than the regular upload facility, though.  But it places its results in a
different directory and keeps a separate log.
Comment 8 Stephen Silver 2005-08-23 08:14:46 UTC
Jonadab writes:

> >    étoile

> Does SVGscan find these?

No, because there isn't really anything wrong with it, it's just not what was
intended. I suppose I could add a
looks_suspiciously_like_an_html_entity_with_the_ampersand_escaped test.

> Someone told me that I should look at the Encode module,

That was me.

> However, this documentation is hairy and scary and contains stuff like this:
> > CAVEAT: When you run $octets = encode("utf8", $string), then $octets may
> > not be equal to $string. Though they both contain the same data, the utf8
> > flag for $octets is always off. When you encode anything, utf8 flag of 
> > the result is always off, even when it contains completely valid utf8
> > string. See "The UTF-8 flag" below.

Yes, but you shouldn't use encode(), so this isn't a problem.

The way I see it working is like this:

When you get metadata from the form, the encoding should be specified in the
HTTP header. Using this information with the decode() function you can decode
the metadata, that is, convert it to Perl's internal format. (The metadata you
get from SVG::Metadata should already be in Perl's internal format, unless
SVG::Metadata is doing something wrong.) When you write to the file, you do so
in UTF-8 mode:

  open(OUT, ">:encoding(utf-8)", $file) or die;

> I don't have a way, as far as I am aware, to type non-ASCII characters.

Try typing vowels with the AltGr key held down. Works for me, at least, although
it's a rather limited set of non-ASCII characters.
Comment 9 Stephen Silver 2005-08-23 10:08:48 UTC
I wrote:

> When you write to the file, you do so in UTF-8 mode:
>
>   open(OUT, ">:encoding(utf-8)", $file) or die;

I forgot that you write via XML::Twig now, so ignore this bit.
Comment 10 Jonadab the Unsightly One 2005-08-23 17:10:21 UTC
> > >    étoile
> > Does SVGscan find these?
> No, because there isn't really anything wrong with it

Oh, yeah.  I knew that...

> I suppose I could add a
> looks_suspiciously_like_an_html_entity_with_the_ampersand_escaped test.

How did you discover this in the first place?

> When you get metadata from the form, the encoding should be specified 
> in the HTTP header. 

Actually, I think it's specified in the separators, but anyway, it's
specified.  I'd gotten that far...

> Using this information with the decode() function you can decode
> the metadata, that is, convert it to Perl's internal format. 
> (The metadata you get from SVG::Metadata should already be in 
> Perl's internal format, unless SVG::Metadata is doing something 
> wrong.) When you write to the file, you do so in UTF-8 mode

Okay, that's starting to make sense to me now.  I think.

> Try typing vowels with the AltGr key held down

I don't have any such key.  I have left and right Alt, but that's not 
what they do.

<rant relevance="10%">
I'm telling you, I live in Ohio.  I've got a standard US keyboard,
for practical purposes 104 keys.  (Technically it has rather more than
104 keys, because it's a high-end model with function keys duplicated
on the left side, as well as across the top, and it's remappable, but
it still only generates the same characters as a 104-key keyboard.)

I cannot type any character on this keyboard that you can't type on an
IBM Model M keyboard from the seventies.  I cannot buy a PC keyboard 
around here that can type other characters.  (I could get a used DEC 
keyboard from a VT510, the sort of terminal used with a Vax, but that 
will not do me a large amount of good with a PC.  It does use a PS/2
connector, but it lacks too many keys that are important on the PC,
so it is not in practice usable with a PC.)

I do not have an AltGr key, a Compose key, or a cokebottle key. I can type 
ASCII characters from 32 (decimal) through 126, as well as 9 (horizontal 
tab) and 13 (carriage return; in some applications this comes out as 10
(linefeed) instead, or both).  That's it.  If I boot into DOS, I can type 
IBM Extended ASCII characters with the Alt+nnn-on-the-numpad trick, but 
those won't display correctly on any modern system.

This is pretty much the only kind of keyboard sold within several hundred
miles of here, in nearly any direction.  (Except for USB and laptop 
keyboards, which have even fewer keys, and Mac keyboards.)  It is the 
only kind of keyboard anyone living around here needs, because at *least* 
four nines (probably five) of the population is either literate in English, 
or else it's the only language they know at all (the latter being 
substantially the more common).  All words are spelled with only ASCII
characters here:  "naive", "resume" (that's a three-syllable noun), 
"Otterbein", "El Nino", "logos", "adelphos", "anthropos", "phileo".
They are spelled that way whether hand-written or typed, and frequently
even when they are professionally typeset.

People here say they know two languages if they had three years of 
high-school Spanish and can just about manage to come up with "Hola, yo 
soy de Americano, e tu?  Hasta la vista, baby!".  I live in a city of 
ten thousand people, and I can say with a fair degree of confidence that 
there are 0.00 people living in this city who do not speak some kind of 
English.  This is the reality of living in the middle of a large and 
essentially mono-lingual nation.  Other characters are a completely
foreign concept.  There is no market for a keyboard that can type them.
</rant>
Comment 11 Nicu Buculei 2005-08-23 22:42:18 UTC
A standard US keyboard with 104 keys is just fine, mine is just like that (only
with some multimedia keys) and I can type like this:
- æ«€¶&#359;&#8592;&#8595;&#8594;øþ@ßð&#273;&#331;&#295;j&#312;&#322;&#322;»¢“”n
- &#1081;&#1094;&#1091;&#1082;&#1077;&#1085;&#1075;&#1096;&#1097;&#1079;&#1092;&#1099;&#1074;&#1072;&#1087;&#1088;&#1086;&#1083;&#1076;&#1103;&#1095;&#1089;&#1084;&#1080;&#1090;&#1100;
- &#1590;&#1589;&#1579;&#1602;&#1601;&#1594;&#1593;&#1607;&#1582;&#1581;&#1588;&#1587;&#1610;&#1576;&#1604;&#1575;&#1578;&#1606;&#1605;&#1574;&#1569;&#1572;&#1585;&#65275;&#1609;&#1577;
- &#3654;&#3652;&#3635;&#3614;&#3632;&#3633;&#3637;&#3619;&#3618;&#3609;&#3615;&#3627;&#3585;&#3604;&#3648;&#3657;&#3656;&#3634;&#3626;&#3612;&#3611;&#3649;&#3629;&#3636;&#3639;&#3607;

I only hat to change the keyboard layout from software (GNOME). AltGr (the right
Alt on *any* keyboard) was used only on the first set, which is written with a
french keyboard layout.
Comment 12 Nicu Buculei 2005-08-23 22:45:08 UTC
Created attachment 3011 [details]
this is what I really typed

It seems Bugzilla does not handle well international text, this is what I
really submited
Comment 13 Stephen Silver 2005-08-24 00:50:28 UTC
Jonadab writes:

> How did you discover this in the first place?

I was examining your fixed version of index.xml, and happened to search for &amp;.
 
> > When you get metadata from the form, the encoding should be 
> > specified in the HTTP header. 
> 
> Actually, I think it's specified in the separators,

Yes, you're right.

> If I boot into DOS, I can type IBM Extended ASCII characters with
> the Alt+nnn-on-the-numpad trick, but those won't display correctly
> on any modern system.

But you could use GNU recode to convert them to UTF-8:

  recode IBM437/CR-LF..u8 filename.txt

or to Latin-1 (when possible):

  recode IBM437/CR-LF..lat1 filename.txt
 
or to some other appropriate modern encoding.
Comment 14 Jonadab the Unsightly One 2005-08-24 19:37:35 UTC
> I was examining your fixed version of index.xml, and happened to search
> for &amp;.

Makes sense...  I could have done that...

> I only had to change the keyboard layout from software (GNOME)

That actually *works*?  I'll have to look into that.  I had always assumed you
had to tell the _truth_ when you told the setup what keyboard layout you had, or
it wouldn't work correctly (like with the mouse; if you tell X11 you have an
IMPS/2 scrollmouse, and you don't, it's bad news).

And it looks like you changed it on-the-fly, too, without restarting X.
I definitely did not know that was possible.

> AltGr (the right Alt)

Oh.  If you meant right alt, why didn't you just _say_ right alt?  When you
called it AltGr, I assumed it was a completely different key.

I really have got to learn more about this character set stuff.  I'm making
myself look dumb, repeatedly.
Comment 15 Jonadab the Unsightly One 2005-08-25 02:52:18 UTC
Oh, incidentally, I found another way this issue manifests itself.  If a
zipfile or tarball is uploaded, and the metadata experience this issue,
propagate-metadata.pl cannot propagate them, because XML::Twig cannot
parse metadata.rdf, due to the invalid character.  Here is an example of
such a file:
http://www.openclipart.org/incoming-pre-0.17/chemical_accessories_mar_02.zip
Comment 16 Jon Phillips 2007-02-05 15:48:21 UTC
This should work fine now with ccHost, but we still have to fix the broken ones eventually...
Comment 17 Tollef Fog Heen 2010-08-18 03:24:00 UTC
Closing all openclipart bugs as openclipart is now on launchpad, as per request from  Jon Philips.