Bug 44291 - Functionality request: option for removing BOM from beginning of saved text files
Summary: Functionality request: option for removing BOM from beginning of saved text f...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version: 3.5.0 Beta2
Hardware: x86 (IA32) Linux (All)
: medium enhancement
Assignee: Not Assigned
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-29 13:29 UTC by Bruce Fowler
Modified: 2015-01-03 21:59 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Bruce Fowler 2011-12-29 13:29:58 UTC
Extra hex bytes are being inserted into text files saved
from LibreOffice database queries.  To show this, do the following:

1) Open up a simple database and run a query
2) Open a new text (.odt) document
3) Drag the query by the upper-left corner onto the text document
[ A window titled "Insert Database Columns" will open ]
4) Choose "Insert data as: text" on the top line
5) pick a database column or two, and then click OK
[ The data will be inserted into the text document ]
6) Save the document as ".txt", i.e., plain ascii text
7) View the document with the linux "less" command (or with
   any program that will show the hex-byte content of the file)
8) Note that preceeding any of the ascii data from the database are
   three extra bytes, "0xefbbbf", or "U+FEFF" as "less" shows them

These three extra bytes cause me grief when I use this general
scheme to create address labels.  I didn't ask for them and they
don't belong at the beginning of the output file.  It works this
way on all versions of LObase, up through 3.5.

Thanks for listening...
Comment 1 Bruce Fowler 2012-01-21 18:03:37 UTC
Further experimentation reveled that this problem is not related to "base" but shows up simply by saving a "writer" file as "plain text".  So I am changing the component from base to writer.  To show it, one only need start with a short ".odt" file and follow steps 6-8 in the original bug report.
Comment 2 sasha.libreoffice 2012-03-14 08:00:07 UTC
Thanks for bugreport
Explanations of these 3 bytes is here:
http://en.wikipedia.org/wiki/Byte_order_mark

Please, tell: which program has problem with it?
Comment 3 Bruce Fowler 2012-03-29 19:29:24 UTC
Thanks for the reference.  I have read the Wikipedia article.  It appears to relate entirely to Unicode encoding.  In relation to UTF-8 it says, "The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use."  It further states, "the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment"

In any case, I don't want my data saved in UTF-8 for this particular application, but rather in plain ASCII.  I tried setting the Tools/Options/Load save->HTML compatibility/Character set to Western Europe (ASCII/US), but the BOM is still there.  I can appreciate the utility of the BOM for information interchange, but not for local work with Postscript programs and shell scripts.  Perhaps the appropriate fix is to have an option in "load/save" that says, "I really want plain ASCII."

I wish I were knowledgeable enough to send you a patch, but the LibreOffice code is a bit formidable!  Thanks for your interest and help.
Comment 4 sasha.libreoffice 2012-03-29 23:05:14 UTC
> Perhaps the appropriate fix is to have an option in "load/save" that says, "I
> really want plain ASCII."
I agree with this. But currently we have very few developers. This may take several years. Sorry for such situation.

> but not for local work with Postscript programs and shell scripts.
But may be will more faster add to script removing this BOM and to ask Postscript programs authors to fix their programs
Comment 5 leighman 2012-09-10 20:10:15 UTC
It's easy enough to stop the BOM being written but I presume we want to preserve it in existing documents.
Comment 6 Alex Thurgood 2015-01-03 17:39:34 UTC
Adding self to CC if not already on
Comment 7 Bruce Fowler 2015-01-03 21:59:17 UTC
Glad to see that this bug is still alive.  I fixed my immediate problem with a simple "tr" command in my shell script, but I am still not happy with extraneous stuff being inserted in my text data.  The easy fix would seem to be to have "Save Text as UTF-8" and "Save Text as ASCII" options available as a preference I can set.  Thanks for your continued interest.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.