I was under the impression this bug was somehow addressed, but here is it: in the submission for July there are some files (a dinosaurs collection) by the author "Moisés Rincón Maza". Both Inkscape and librsvg (EOG, Nautilus) refuses to open the files because of the wrong association of declared and used encoding character type.
Created attachment 3138 [details] xc/lib/X11/XInteractive.c (client side of extension protocol)
Attachment 3138 [details] appears to have gone missing, and I can't find anything by anyone called Maza in release 0.16, so I'm not exactly sure what Nicu is referring to here. But an example of an apparently different type of corruption of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains étoile In this case a non-ASCII character has been converted to an HTML entity (instead of UTF-8), and the & in the HTML entity has then been converted to &.
OK, I found the files that Nicu was referring to. They didn't make it into release 0.16, presumably because XML::Twig couldn't parse them either. Here's one: http://www.openclipart.org/incoming-pre-0.16/corythosaurus_mois_s_rin_01.svg The problem with these files is that the characters have been left in Latin-1 instead of being converted to UTF-8.
Yeah, f.d.o. has a problem with the server and the attachment is lost, but you identified correctly the file and indeed it didn't make it into release precisely because of this bug
> I was under the impression this bug was somehow addressed No, we back-burnered it until we got the hash bug fixed. > But an example of an apparently different type of corruption > of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains > étoile *That* should no longer be happening, since Bryce's round of changes to Metadata.pm circa the OCAL 0.14 era or thereabouts. The good thing about it is, at least it's clear what the character is supposed to be. Does SVGscan find these? > OK, I found the files that Nicu was referring to. They didn't make it > into release 0.16, presumably because XML::Twig couldn't parse them > either. Here's one: > > http://www.openclipart.org/incoming-pre-0.16/corythosaurus_mois_s_rin_01.svg > > The problem with these files is that the characters have been left in > Latin-1 instead of being converted to UTF-8. Yes, that is the encoding bug that we back-burnered because it was not impacting as many files as the HASH bug. But the HASH bug is now fixed, so it's time to look at this one again. I suspect most or all of the files with this bug have probably ended up in the failed-files archive for the 0.16 release. There will also be some in the 0.17 failed files archive, no doubt, and for every release until we fix the problem. However, the 0.16 failed files archive should include all the ones from 0.13 through 0.16, and I think the ones from before 0.13 have all been repaired and reprocessed by this point, so there's no need to go back any further than 0.16 I think. The failed files from 0.16 can be found here: http://openclipart.org/downloads/0.16/openclipart-0.16-failed.zip I believe I know, too, what causes this bug: the web browser sends the form contents in an encoding that is neither US-ASCII nor UTF-8, but some other one (Latin-1 being a good example), but when XML::Twig parses the submitted file, it makes the encoding UTF-8. What we need to do, probably in getforminput, is convert the added metadata also to UTF-8, before it is inserted into the metadata object. Someone told me that I should look at the Encode module, the POD for which can be viewed here: http://search.cpan.org/~dankogai/Encode-2.11/Encode.pm However, this documentation is hairy and scary and contains stuff like this: > CAVEAT: When you run $octets = encode("utf8", $string), then $octets may > not be equal to $string. Though they both contain the same data, the utf8 > flag for $octets is always off. When you encode anything, utf8 flag of > the result is always off, even when it contains completely valid utf8 > string. See "The UTF-8 flag" below. So it is obvious to me that anything I do with this module will need to be tested thoroughly before we deploy it on the site. However, I do not have the ability to test it here, because I don't have a unicode keyboard; I don't have a way, as far as I am aware, to type non-ASCII characters. And if I did, I wouldn't know what I was doing. So it has occurred to me now that what we really need to do is put up a separate, "testing" version of the upload script, on the site, set up to write its "uploads" to a separate directory from the main one, so we don't get them mixed up, and to use a different file for the upload input log. This I can do. Then we can play with encoding stuff using the testing script until we figure out how to make it do what we want, and at that point all we have to do is migrate the changes over to the main upload script and Bob will be our uncle. I'll call the testing script upload_test.cgi, and I'll try to get it up this week, and we can go from there.
Created attachment 3004 [details] Blank SVG file to use for testing. Once we get the testing script in place, we can use this blank SVG file to test it, placing the relevant characters into the form's metadata inputs.
Testing script is in place (see URI). Currently it doesn't do anything better than the regular upload facility, though. But it places its results in a different directory and keeps a separate log.
Jonadab writes: > > étoile > Does SVGscan find these? No, because there isn't really anything wrong with it, it's just not what was intended. I suppose I could add a looks_suspiciously_like_an_html_entity_with_the_ampersand_escaped test. > Someone told me that I should look at the Encode module, That was me. > However, this documentation is hairy and scary and contains stuff like this: > > CAVEAT: When you run $octets = encode("utf8", $string), then $octets may > > not be equal to $string. Though they both contain the same data, the utf8 > > flag for $octets is always off. When you encode anything, utf8 flag of > > the result is always off, even when it contains completely valid utf8 > > string. See "The UTF-8 flag" below. Yes, but you shouldn't use encode(), so this isn't a problem. The way I see it working is like this: When you get metadata from the form, the encoding should be specified in the HTTP header. Using this information with the decode() function you can decode the metadata, that is, convert it to Perl's internal format. (The metadata you get from SVG::Metadata should already be in Perl's internal format, unless SVG::Metadata is doing something wrong.) When you write to the file, you do so in UTF-8 mode: open(OUT, ">:encoding(utf-8)", $file) or die; > I don't have a way, as far as I am aware, to type non-ASCII characters. Try typing vowels with the AltGr key held down. Works for me, at least, although it's a rather limited set of non-ASCII characters.
I wrote: > When you write to the file, you do so in UTF-8 mode: > > open(OUT, ">:encoding(utf-8)", $file) or die; I forgot that you write via XML::Twig now, so ignore this bit.
> > > &eacute;toile > > Does SVGscan find these? > No, because there isn't really anything wrong with it Oh, yeah. I knew that... > I suppose I could add a > looks_suspiciously_like_an_html_entity_with_the_ampersand_escaped test. How did you discover this in the first place? > When you get metadata from the form, the encoding should be specified > in the HTTP header. Actually, I think it's specified in the separators, but anyway, it's specified. I'd gotten that far... > Using this information with the decode() function you can decode > the metadata, that is, convert it to Perl's internal format. > (The metadata you get from SVG::Metadata should already be in > Perl's internal format, unless SVG::Metadata is doing something > wrong.) When you write to the file, you do so in UTF-8 mode Okay, that's starting to make sense to me now. I think. > Try typing vowels with the AltGr key held down I don't have any such key. I have left and right Alt, but that's not what they do. <rant relevance="10%"> I'm telling you, I live in Ohio. I've got a standard US keyboard, for practical purposes 104 keys. (Technically it has rather more than 104 keys, because it's a high-end model with function keys duplicated on the left side, as well as across the top, and it's remappable, but it still only generates the same characters as a 104-key keyboard.) I cannot type any character on this keyboard that you can't type on an IBM Model M keyboard from the seventies. I cannot buy a PC keyboard around here that can type other characters. (I could get a used DEC keyboard from a VT510, the sort of terminal used with a Vax, but that will not do me a large amount of good with a PC. It does use a PS/2 connector, but it lacks too many keys that are important on the PC, so it is not in practice usable with a PC.) I do not have an AltGr key, a Compose key, or a cokebottle key. I can type ASCII characters from 32 (decimal) through 126, as well as 9 (horizontal tab) and 13 (carriage return; in some applications this comes out as 10 (linefeed) instead, or both). That's it. If I boot into DOS, I can type IBM Extended ASCII characters with the Alt+nnn-on-the-numpad trick, but those won't display correctly on any modern system. This is pretty much the only kind of keyboard sold within several hundred miles of here, in nearly any direction. (Except for USB and laptop keyboards, which have even fewer keys, and Mac keyboards.) It is the only kind of keyboard anyone living around here needs, because at *least* four nines (probably five) of the population is either literate in English, or else it's the only language they know at all (the latter being substantially the more common). All words are spelled with only ASCII characters here: "naive", "resume" (that's a three-syllable noun), "Otterbein", "El Nino", "logos", "adelphos", "anthropos", "phileo". They are spelled that way whether hand-written or typed, and frequently even when they are professionally typeset. People here say they know two languages if they had three years of high-school Spanish and can just about manage to come up with "Hola, yo soy de Americano, e tu? Hasta la vista, baby!". I live in a city of ten thousand people, and I can say with a fair degree of confidence that there are 0.00 people living in this city who do not speak some kind of English. This is the reality of living in the middle of a large and essentially mono-lingual nation. Other characters are a completely foreign concept. There is no market for a keyboard that can type them. </rant>
A standard US keyboard with 104 keys is just fine, mine is just like that (only with some multimedia keys) and I can type like this: - æ«€¶ŧ←↓→øþ@ßðđŋħjĸłł»¢“”n - йцукенгшщзфывапролдячсмить - ضصثقفغعهخحشسيبلاتنمئءؤرﻻىة - ๆไำพะัีรยนฟหกดเ้่าสผปแอิืท I only hat to change the keyboard layout from software (GNOME). AltGr (the right Alt on *any* keyboard) was used only on the first set, which is written with a french keyboard layout.
Created attachment 3011 [details] this is what I really typed It seems Bugzilla does not handle well international text, this is what I really submited
Jonadab writes: > How did you discover this in the first place? I was examining your fixed version of index.xml, and happened to search for &. > > When you get metadata from the form, the encoding should be > > specified in the HTTP header. > > Actually, I think it's specified in the separators, Yes, you're right. > If I boot into DOS, I can type IBM Extended ASCII characters with > the Alt+nnn-on-the-numpad trick, but those won't display correctly > on any modern system. But you could use GNU recode to convert them to UTF-8: recode IBM437/CR-LF..u8 filename.txt or to Latin-1 (when possible): recode IBM437/CR-LF..lat1 filename.txt or to some other appropriate modern encoding.
> I was examining your fixed version of index.xml, and happened to search > for &. Makes sense... I could have done that... > I only had to change the keyboard layout from software (GNOME) That actually *works*? I'll have to look into that. I had always assumed you had to tell the _truth_ when you told the setup what keyboard layout you had, or it wouldn't work correctly (like with the mouse; if you tell X11 you have an IMPS/2 scrollmouse, and you don't, it's bad news). And it looks like you changed it on-the-fly, too, without restarting X. I definitely did not know that was possible. > AltGr (the right Alt) Oh. If you meant right alt, why didn't you just _say_ right alt? When you called it AltGr, I assumed it was a completely different key. I really have got to learn more about this character set stuff. I'm making myself look dumb, repeatedly.
Oh, incidentally, I found another way this issue manifests itself. If a zipfile or tarball is uploaded, and the metadata experience this issue, propagate-metadata.pl cannot propagate them, because XML::Twig cannot parse metadata.rdf, due to the invalid character. Here is an example of such a file: http://www.openclipart.org/incoming-pre-0.17/chemical_accessories_mar_02.zip
This should work fine now with ccHost, but we still have to fix the broken ones eventually...
Closing all openclipart bugs as openclipart is now on launchpad, as per request from Jon Philips.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.