Summary: | Works spreadsheet: wrong encoding | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Urmas <davian818> |
Component: | filters and storage | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEW --- | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | alonso, caolanm, serval2412, timar74 |
Version: | 4.5.0.0.alpha0+ Master | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
DOS codepage stored
Screenshot from Works 3.0 for DOS result if we apply comment 12 translation page result of LICS -> CP437 -> Unicode |
Could you tell more about LICS encoding? I didn't find anything on Google, in peculiar here: http://www.iana.org/assignments/character-sets/character-sets.xhtml I wasn't able to find any info on it too. But the attached document shows how all the 128 bytes of a DOS codepage are stored. From the look of it, it seems to be based on ISO 8859-1 with some proprietary modifications. Also, LICS codeset is indicated by a 0-length record 0x5425. Thank you for your feedback Urmas. Just by curiosity, where did you find the name "LICS"? I mean why do you call this encoding "LICS"? Lotus International Character Set was a proprietary alternative to ASCII to use in Lotus spreadsheets; evidently Works spreadsheet was build around the interoperability with Lotus 1-2-3. Thank you again Urmas, I found a ref here: http://www.bps-sberbank.by/help/help85_designer.nsf/2e73cbb2141acefa85256b8700688cea/04d08b98c1fe61008525760700600188?OpenDocument Caolan: any idea how to deal with this encoding? (perhaps here http://opengrok.libreoffice.org/xref/core/sal/textenc/tencinfo.cxx ?) Created attachment 110834 [details]
Screenshot from Works 3.0 for DOS
This is how it looks like in Works 3.0 in DOS.
Please note that Works files are currently handled by an external library, libwks.
(In reply to Andras Timar from comment #7) > Please note that Works files are currently handled by an external library, libwks. In fact, it is handled by libwps. (In reply to Urmas from comment #3) > Also, LICS codeset is indicated by a 0-length record 0x5425. Ok. Do you know where to find a table to convert LICS codeset in unicode ? I just look at http://unicode.org/Public/MAPPINGS/VENDORS/ and they do not seem to have one for LICS, and without such a table ... :-~ The table on the screenshot looks CP437. But if I set codepage to 852 in DOS, then I see CP852 in Works. So it looks codepage dependent. @Urmas, where did you read that "LICS codeset is indicated by a 0-length record 0x5425."? Internal Lotus 1-2-3 filter in LibreOffice asks for the codepage before load, it indicates that code page information in not present in DOS WK1/WKS files. If you remove that record, the file will be opened in the plain DOS encoding. The DOS codepage is user-selectable indeed, but it is stored as if it were 850. @osnola: you have a file which provides mappings for all 128 codepoints. You can generate the reverse mapping from it. (In reply to Urmas from comment #10) > > @osnola: you have a file which provides mappings for all 128 codepoints. You > can generate the reverse mapping from it. I only see a screenshot, so of course, it is possible with this screenshot to reconstruct a conversion table (by trying to find some unicode symbol which look similar), but this is a dull and error prone process.... It will be much better to find another way to reconstruct this table and them to integrate it in libwps. This can even be an EasyHack, as this means: - to insert the LICS encoding in libwps_tools_win.{h,cpp} - and them to modify WKS4.cpp to use LICS encoding as DOS default encoding when the code 5425 is found ; i.e. the code to find the LICS code set already exists at line 648 of https://sourceforge.net/p/libwps/code/ci/master/tree/src/lib/WKS4.cpp, so we need only to store in m_state that the default DOS encoding is now LICS and use this type when we create new font ... Huh? 20 minutes and Python yield this table: "\xB0\xEF\xB1\xF9\xB2\x9F\xB4\xB9\xBA\xBB\xBC\xBF\xC0\xC1\xC2\xC3" "\xC5\xC8\xC9\xCB\xCC\xD5\xC4\xD9\xDA\xDB\xDC\xDF\xF2\xB3\xFE\xFF" "\xCA\xAD\xBD\x9C\xCD\xBE\xDD\xF5\xCF\xB8\xA6\xAE\xAA\xF0\xA9\xEE" "\xF8\xF1\xFD\xFC\xCE\xE6\xF4\xFA\xF7\xFB\xA7\xAF\xAC\xAB\xF3\xA8" "\xB7\xB5\xB6\xC7\x8E\x8F\x92\x80\xD4\x90\xD2\xD3\xDE\xD6\xD7\xD8" "\xD1\xA5\xE3\xE0\xE2\xE5\x99\x9E\x9D\xEB\xE9\xEA\x9A\xED\xE7\xE1" "\x85\xA0\x83\xC6\x84\x86\x91\x87\x8A\x82\x88\x89\x8D\xA1\x8C\x8B" "\xD0\xA4\x95\xA2\x93\xE4\x94\xF6\x9B\x97\xA3\x96\x81\x98\xE8\xEC" (In reply to Urmas from comment #12) What is the "reference rendering"? You did not send screenshot, I did. And on my screenshot the first character is \xC7 (Ç) not \xB0. Why do you think that \xB0 is correct? As I wrote in comment 9, if I change codepage in DOS to something else, then rendering in Works 3.0 changes as well. First character in A2 cell is 0xC7 (Ã). The table yields 0x80 (Ç), and it does not matter if you are going to interpret it as Ç, or А, or whatever. Created attachment 110855 [details] result if we apply comment 12 translation page With the conversion table given in comment 12, I obtain this file ; but as when reading this thread, I am no longer sure to know what the result must be :-~ Note: - I suppose that in some near future, we will need to modify the libwps API to ask for DOS file encoding, this will be better.... The bytes are right, but they are interpreted as Windows encoding, not DOS. (In reply to Urmas from comment #14) > First character in A2 cell is 0xC7 (Ã). The table yields 0x80 (Ç), and it > does not matter if you are going to interpret it as Ç, or А, or whatever. Finally I got it. So for opening MS Works for DOS files correctly, we need to know the original DOS codepage that were used at the time of creation, and we need double conversion at import: LICS -> DOS codepage -> Unicode. Created attachment 110861 [details]
result of LICS -> CP437 -> Unicode
Ok,
indeed the proposed screenshot looks a lot like LICS -> CP4437 -> Unicode.
I will try this week-end to modify the libwps's code so that it does the translation :
LICS -> DOS850 -> Unicode
( when the LICS code set is set and the file is a DOS file ).
This will fix some conversions while waiting for a future change of the libwps API ( to add possibility to ask for encoding)...
(In reply to osnola from comment #18) > I will try this week-end to modify the libwps's code so that it does the > translation : > LICS -> DOS850 -> Unicode > ( when the LICS code set is set and the file is a DOS file ). > I just do commit in libwps git repository concerning the LICS conversion: https://sourceforge.net/p/libwps/code/ci/41dfedf3d25025fe1dc91e7a01d83a54d3259476 Of course, this does not fix the DOS codepage's problem (which will require a change of libwps's API) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 110717 [details] DOS codepage stored It seems that European versions of MS Works for DOS store their spreadsheets in LICS encoding. Here's an example of Works 3 spreadsheet (That version can store the entire DOS codepage without data loss). Right now the LO filter assumes codepage 850 reading them.