Bug 87222

Summary: Works spreadsheet: wrong encoding
Product: LibreOffice Reporter: Urmas <davian818>
Component: filters and storageAssignee: Not Assigned <libreoffice-bugs>
Status: NEW --- QA Contact:
Severity: normal    
Priority: medium CC: alonso, caolanm, serval2412, timar74
Version: 4.5.0.0.alpha0+ Master   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: DOS codepage stored
Screenshot from Works 3.0 for DOS
result if we apply comment 12 translation page
result of LICS -> CP437 -> Unicode

Description Urmas 2014-12-11 07:17:06 UTC
Created attachment 110717 [details]
DOS codepage stored

It seems that European versions of MS Works for DOS store their spreadsheets in LICS encoding.
Here's an example of Works 3 spreadsheet (That version can store the entire DOS codepage without data loss).
Right now the LO filter assumes codepage 850 reading them.
Comment 1 Julien Nabet 2014-12-11 20:33:07 UTC
Could you tell more about LICS encoding? I didn't find anything on Google, in peculiar here: http://www.iana.org/assignments/character-sets/character-sets.xhtml
Comment 2 Urmas 2014-12-11 22:24:23 UTC
I wasn't able to find any info on it too. But the attached document shows how all the 128 bytes of a DOS codepage are stored.

From the look of it, it seems to be based on ISO 8859-1 with some proprietary modifications.
Comment 3 Urmas 2014-12-12 07:32:07 UTC
Also, LICS codeset is indicated by a 0-length record 0x5425.
Comment 4 Julien Nabet 2014-12-12 08:31:44 UTC
Thank you for your feedback Urmas.
Just by curiosity, where did you find the name "LICS"? I mean why do you call this encoding "LICS"?
Comment 5 Urmas 2014-12-12 08:45:13 UTC
Lotus International Character Set was a proprietary alternative to ASCII to use in Lotus spreadsheets; evidently Works spreadsheet was build around the interoperability with Lotus 1-2-3.
Comment 6 Julien Nabet 2014-12-12 08:53:03 UTC
Thank you again Urmas, I found a ref here:
http://www.bps-sberbank.by/help/help85_designer.nsf/2e73cbb2141acefa85256b8700688cea/04d08b98c1fe61008525760700600188?OpenDocument

Caolan: any idea how to deal with this encoding? (perhaps here http://opengrok.libreoffice.org/xref/core/sal/textenc/tencinfo.cxx ?)
Comment 7 Andras Timar 2014-12-14 12:22:01 UTC
Created attachment 110834 [details]
Screenshot from Works 3.0 for DOS

This is how it looks like in Works 3.0 in DOS. 

Please note that Works files are currently handled by an external library, libwks.
Comment 8 osnola 2014-12-14 13:15:18 UTC
(In reply to Andras Timar  from comment #7)
> Please note that Works files are currently handled by an external library, libwks.

In fact, it is handled by libwps.

(In reply to Urmas from comment #3)
> Also, LICS codeset is indicated by a 0-length record 0x5425.

Ok.

Do you know where to find a table to convert LICS codeset in unicode ? I just look at http://unicode.org/Public/MAPPINGS/VENDORS/ and they do not seem to have one for LICS, and without such a table ... :-~
Comment 9 Andras Timar 2014-12-14 14:02:11 UTC
The table on the screenshot looks CP437. But if I set codepage to 852 in DOS, then I see CP852 in Works. So it looks codepage dependent. 
@Urmas, where did you read that "LICS codeset is indicated by a 0-length record 0x5425."? Internal Lotus 1-2-3 filter in LibreOffice asks for the codepage before load, it indicates that code page information in not present in DOS WK1/WKS files.
Comment 10 Urmas 2014-12-14 22:54:19 UTC
If you remove that record, the file will be opened in the plain DOS encoding.
The DOS codepage is user-selectable indeed, but it is stored as if it were 850.

@osnola: you have a file which provides mappings for all 128 codepoints. You can generate the reverse mapping from it.
Comment 11 osnola 2014-12-15 08:27:15 UTC
(In reply to Urmas from comment #10)
> 
> @osnola: you have a file which provides mappings for all 128 codepoints. You
> can generate the reverse mapping from it.

I only see a screenshot, so of course, it is possible with this screenshot  to reconstruct a conversion table (by trying to find some unicode symbol which look similar), but this is a dull and error prone process....

It will be much better to find another way to reconstruct this table and them to integrate it in libwps. This can even be an EasyHack, as this means:
- to insert the LICS encoding in libwps_tools_win.{h,cpp}
- and them to modify WKS4.cpp to use LICS encoding as DOS default encoding when the
  code 5425 is found ; i.e. the code to find the LICS code set already exists
  at line 648 of https://sourceforge.net/p/libwps/code/ci/master/tree/src/lib/WKS4.cpp, 
  so we need only to store in m_state that the default DOS encoding is now LICS and 
  use this type when we create new font ...
Comment 12 Urmas 2014-12-15 10:16:37 UTC
Huh? 20 minutes and Python yield this table:

"\xB0\xEF\xB1\xF9\xB2\x9F\xB4\xB9\xBA\xBB\xBC\xBF\xC0\xC1\xC2\xC3"
"\xC5\xC8\xC9\xCB\xCC\xD5\xC4\xD9\xDA\xDB\xDC\xDF\xF2\xB3\xFE\xFF"
"\xCA\xAD\xBD\x9C\xCD\xBE\xDD\xF5\xCF\xB8\xA6\xAE\xAA\xF0\xA9\xEE"
"\xF8\xF1\xFD\xFC\xCE\xE6\xF4\xFA\xF7\xFB\xA7\xAF\xAC\xAB\xF3\xA8"
"\xB7\xB5\xB6\xC7\x8E\x8F\x92\x80\xD4\x90\xD2\xD3\xDE\xD6\xD7\xD8"
"\xD1\xA5\xE3\xE0\xE2\xE5\x99\x9E\x9D\xEB\xE9\xEA\x9A\xED\xE7\xE1"
"\x85\xA0\x83\xC6\x84\x86\x91\x87\x8A\x82\x88\x89\x8D\xA1\x8C\x8B"
"\xD0\xA4\x95\xA2\x93\xE4\x94\xF6\x9B\x97\xA3\x96\x81\x98\xE8\xEC"
Comment 13 Andras Timar 2014-12-15 10:32:33 UTC
(In reply to Urmas from comment #12)
What is the "reference rendering"? You did not send screenshot, I did. And on my screenshot the first character is \xC7 (Ç) not \xB0. Why do you think that \xB0 is correct? As I wrote in comment 9, if I change codepage in DOS to something else, then rendering in Works 3.0 changes as well.
Comment 14 Urmas 2014-12-15 10:53:29 UTC
First character in A2 cell is 0xC7 (Ã). The table yields 0x80 (Ç), and it does not matter if you are going to interpret it as Ç, or А, or whatever.
Comment 15 osnola 2014-12-15 11:44:01 UTC
Created attachment 110855 [details]
result if we apply comment 12 translation page

With the conversion table given in comment 12, I obtain this file ; but as when reading this thread, I am no longer sure to know what the result must be :-~

Note:
- I suppose that in some near future, we will need to modify the libwps API to ask for DOS file encoding, this will be better....
Comment 16 Urmas 2014-12-15 11:55:19 UTC
The bytes are right, but they are interpreted as Windows encoding, not DOS.
Comment 17 Andras Timar 2014-12-15 11:58:47 UTC
(In reply to Urmas from comment #14)
> First character in A2 cell is 0xC7 (Ã). The table yields 0x80 (Ç), and it
> does not matter if you are going to interpret it as Ç, or А, or whatever.

Finally I got it. So for opening MS Works for DOS files correctly, we need to know the original DOS codepage that were used at the time of creation, and we need double conversion at import: LICS -> DOS codepage -> Unicode.
Comment 18 osnola 2014-12-15 12:27:55 UTC
Created attachment 110861 [details]
result of LICS -> CP437 -> Unicode

Ok, 
indeed the proposed screenshot looks a lot like LICS -> CP4437 -> Unicode.

I will try this week-end to modify the libwps's code so that it does the translation :
  LICS -> DOS850 -> Unicode
( when the LICS code set is set and the file is a DOS file ). 

This will fix some conversions while waiting for a future change of the libwps API ( to add possibility to ask for encoding)...
Comment 19 osnola 2014-12-16 07:55:53 UTC
(In reply to osnola from comment #18)
> I will try this week-end to modify the libwps's code so that it does the
> translation :
>   LICS -> DOS850 -> Unicode
> ( when the LICS code set is set and the file is a DOS file ). 
>
I just do commit in libwps git repository concerning the LICS conversion: https://sourceforge.net/p/libwps/code/ci/41dfedf3d25025fe1dc91e7a01d83a54d3259476

Of course, this does not fix the DOS codepage's problem (which will require a 
change of libwps's API)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.