Bug 80149 - FILESAVE: Control characters in imported CSV files cause problems once saved as xlsx
Summary: FILESAVE: Control characters in imported CSV files cause problems once saved ...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.2.4.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: BSA target:5.4.0 target:7.2.0
Keywords:
: 103828 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-06-17 16:33 UTC by TomH
Modified: 2021-05-03 15:21 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
minimal csv file containing a 0x16 character (235 bytes, application/vnd.ms-excel)
2014-06-17 16:33 UTC, TomH
Details

Note You need to log in before you can comment on or make changes to this bug.
Description TomH 2014-06-17 16:33:00 UTC
Created attachment 101245 [details]
minimal csv file containing a 0x16 character

Problem description: 

Steps to reproduce:
1. Import a CSV file that includes a low-value control character (e.g. 0x16)
2. Save resulting spreadsheet in xlsx format
3. Attempt to re-open spreadsheet in LibreOffice or Excel

Current behavior:
The resulting .xlsx file is treated as corrupt (invalid UTF8) by Excel. LibreOffice truncates columns after the corrupted cell when the xlsx file is reloaded.

Expected behavior:
CSV import filter could reject the file; could strip out control characters or XLSX export could use an encoding that coped with the corrupt characters.

Although the initial csv file is obviously malformed, such corrupt data exists in the wild and is hard to detect.  LibreOffice's silent truncation of columns after the corrupt cell is problematic as it can easily be overlooked.
(in my original example rows after the corrupt cell stayed but with right-hand columns missing - no error was reported by LibreOffice) 
              
Operating System: Windows 7
Version: 4.2.4.2 release
Comment 1 afonit 2014-06-23 17:52:15 UTC
I just tried this in both:
4.2.4.2
4.2.5.2
and calc was able to import this file fine, the 2nd and 3rd row came in fine.
I am on Fedora 20.


Is there a specific manner in which you brought the file into calc?
The method I used was file > open, then accepted the defaults
(
character set unicode(utf-8)
language Default - English(USA)
from row 1
separator:
    tab,comma,semicolon
)
Comment 2 QA Administrators 2015-01-10 18:06:56 UTC Comment hidden (obsolete)
Comment 3 Urmas 2015-01-10 20:45:11 UTC
Confirming. The Excel file contains the literal byte 0x16, instead of the required representation as "_x0016_".

The side issue is that LO drops that character entirely when saving to ODS.
Comment 4 QA Administrators 2016-01-17 20:03:28 UTC Comment hidden (obsolete)
Comment 5 Eike Rathke 2016-05-31 17:18:48 UTC
Characters below 0x20, except tab, carriage return and linefeed, are illegal characters in XML, see https://www.w3.org/TR/xml/#charsets
EVEN the &#xhhhh; entity reference representation, see https://www.w3.org/TR/xml/#sec-references "Well-formedness constraint: Legal Character".
This is the reason why they are not saved to .ods
Actually they should also be dropped when saving to .xlsx as you see Excel otherwise stumbles.

Excel apparently came up with their own unspecified invention to save such as _x0016_. As you have to be able to distinguish a literal "_x0016_" they write that as "_x005F_x0016_". Oh glory. That's "encoding illegal XML characters in SQL" http://dcx.sap.com/1200/en/dbusage/xmldraftchapter-b-3488944b.html where "_x" is encoded as "_x005F_x", so they borrowed that from their SQL-Server guys.
Comment 6 Eike Rathke 2016-11-10 18:31:51 UTC
So this is in fact specified somewhere..
Citing from "ECMA-376 Part 1" (OOXML), page 3732:

22.4.2.4  bstr (Basic String)

This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as _x005F_). [Example: The string literal _x0008_ would be stored as _x005F_x0008_. end example]

The possible values for this element are defined by the W3C XML Schema string datatype.
Comment 7 Eike Rathke 2016-11-10 18:39:42 UTC
*** Bug 103828 has been marked as a duplicate of this bug. ***
Comment 8 Eike Rathke 2017-06-16 21:09:49 UTC
Should be fixed with commit 8b25b67d5268abbb260da968cc23b6f6c8dd31af for 5.4
Comment 9 Commit Notification 2021-05-03 15:21:04 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9f89ee7c5076f700589d3b07f3d6a50f9af7d13a

tdf#80149: sc_subsequent_export: Add unittest

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.