Summary: | FILESAVE: LibreOffice corrupting complex documents involving html tags | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Doug <dougt901-2012> |
Component: | Writer | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEW --- | QA Contact: | |
Severity: | critical | ||
Priority: | high | CC: | atigmail-bugzilla, david, dougt901-2012 |
Version: | 3.6.1.2 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | BSA | ||
i915 platform: | i915 features: | ||
Attachments: |
This is a re-enactment of the document contents that resulted in corruption of .docx.
This is the same file, but saved by MS Word into .docx format steps for reproduce Faulty document.xml with orphaned w:hyperlink tags Repackaged with orphaned w:hyperlink tags removed (see above), works. |
Description
Doug
2012-09-20 12:53:37 UTC
Created attachment 67436 [details] This is a re-enactment of the document contents that resulted in corruption of .docx. This file was created on Windows/LibreOffice 3.6.0.4. The original text was: This is the beginning of the document. Rule 8.4 of the Rules of Professional Conduct. This is the rest of the document. The "Rule 8.4 of the Rules of Professional Conduct" has the following link: http://www.mass.gov/obcbbo/rpc8.htm#Rule 8.4 You can see the result in OpenOffice XML is that it just breaks the file without warning, rendering it unusable and corrupting the subsequent text. This sequence saved correctly in .doc and .odt. Here is the link with proper html formatting http://www.mass.gov/obcbbo/rpc8.htm#Rule%208.4 bug also is present on 3.6.1.2. The 2d half of the file is not missing, just malformed. Was accessible by changing the extension to .zip and opening document.xml in text editor. The problem here is that LibreOffice saves html tags in an inartful way in .docx files. LibreOffice tries to do everything in the document.xml file. In the example I posted, the link was represented in the document as (forgive me it I mis-crop leading or trailing instructions): HYPERLINK "http://www.mass.gov/obcbbo/rpc8.htm" \l "Rule 8.4"</w:instrText></w:r><w:r><w:fldChar w:fldCharType="separate"/></w:r><w:r><w:rPr><w:rStyle w:val="style15"/></w:rPr><w:t>Rule 8.4 of the Rules of Professional Conduct</w:t></w:r><w:r><w:fldChar w:fldCharType="end"/></w:r></w:hyperlink> Something in that could not be parsed either by Word or LibreOffice on reopen. Word itself does not try to do this in the document.xml file. Instead, it inserts a bookmark with a reference to a different file in the compressed .docx structure: Word /document.xml : ><w:hyperlink r:id="rId4" w:anchor="Rule 8.4" w:history="1"><w:r><w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr><w:t>Rule 8.4 of the Rules of Professional Conduct</w:t></w:r></w:hyperlink> Word /_rels/document.xml.rels : Target="fontTable.xml"/><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://www.mass.gov/obcbbo/rpc8.htm" TargetMode="External"/></Relationships> LibreOffice does not attempt to use the "rels" folder/functionality in the .docx structure in connection with the hyperlinks. As a result, using html links and bookmarks in LibreOffice with .docx files is a problem waiting to happen. Created attachment 67463 [details]
This is the same file, but saved by MS Word into .docx format
Compare the treatment of the html tags in this file with the malformed file above. Word put the html tag into a separate "rels" file inside the .docx structure, which avoids whatever problem LibreOffice encountered by putting the entire tag directly into the document.xml.
It happened to me on libreoffice-4.1.2.3-3.fc19 Created attachment 89499 [details]
steps for reproduce
I think I have the same issue. My steps are: 1. create several lines with text 2. in one of the lines add hyperlink e.g “www.link.com ” (with space so as text become a hyperlink) 3. save the document with .docx extension 4. open my document Result: all lines after hyperlink are dissapear. Reproduced: always Video with steps attached. LibreOffice Writer Version: 4.1.2.3 Build ID: 410m0(Build:3) OS: Ubuntu 13.10 I just encountered this serious bug when re-opening a docx document I was working on. All text after the hyperlinks was mysteriously deleted. I noticed that the file size was still very large despite most of the text missing, and then I tried to figure out a way to decode the docx format and recover the data inside the file somehow, and then I learned that docx is just a renamed zip archive in openxml format. After renaming the .docx to be .zip, I was able to open word/document.xml file and see that all the missing text was still there as a plain xml document. I deleted the xml tags related to the hyperlinks, then I zipped the files again and renamed the .zip to .docx. It worked! I was able to restore the hours of lost work. Hope this helps someone fix the bug and prevent losing their work! Perhaps the software is writing an invalid OpenXML syntax or when it reads it back out, it fails to read it correctly. Just fixed a similarly corrupted .docx file saved by libre office (Version: 4.1.3.2 Build ID: 410m0(Build:2)). I'm not sure what happened during editing (weren't there then), but somehow the hyperlinking added extraneous <w:hyperlink ...> tags just after a <\w:hyperlink> without actual url and the corresponding closing tag, see in quote: """ <w:hyperlink r:id="rId4"><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:sz w:val="22"/><w:szCs w:val="22"/></w:rPr><w:t xml:space="preserve"> Käyttäjälähtöiset innovaatiot toimivat arjessa</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="style31"/><w:tabs><w:tab w:leader="none" w:pos="0" w:val="left"/></w:tabs><w:ind w:hanging="0" w:left="0" w:right="0"/><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:sz w:val="22"/><w:szCs w:val="22"/></w:rPr></w:pPr><w:hyperlink r:id="rId5"> """ Removing the orphan tags seems to make everything visible again in libreoffice. I'll attach the fixed file and the original faulty document.xml, for diffing. Created attachment 95316 [details]
Faulty document.xml with orphaned w:hyperlink tags
Created attachment 95317 [details]
Repackaged with orphaned w:hyperlink tags removed (see above), works.
(In reply to comment #11) > Created attachment 95316 [details] > Faulty document.xml with orphaned w:hyperlink tags I've had this problem only for the last few months. I tried the solution as you have it and found that it worked. That is an amazing piece of detective work; I knew about the structure of docx and the existence of document.xml, but I don't think I would ever have been able to figure out what the issue was. Fantastic work; well done. This bug is still there in LibreOffice 4.2.7.2. It was a huge shock to find that all my text had vanished. The workaround posted by Bruce Kirkbatrick and Elmo worked (thanks a lot for that!), but most non-technical users would not be able to follow the steps required to recover their data. Any chance that this severe bug will be fixed soon? If not, could LibreOffice at least issue a warning when the user adds hyperlinks to a docx file? |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.