Bug 51726 - RTF: Export RTF to xhtml produces duplicate text in output
Summary: RTF: Export RTF to xhtml produces duplicate text in output
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.4 release
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: BSA (target:4.2.5)
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-04 11:02 UTC by Scott Derrick
Modified: 2014-07-14 19:22 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
the rtf file to load in writer and then export. (172.85 KB, application/rtf)
2012-07-04 11:02 UTC, Scott Derrick
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Scott Derrick 2012-07-04 11:02:31 UTC
Created attachment 63817 [details]
the rtf file to load in writer and then export.

Problem description: 

I'm exporting an RTF document to xhtml. 
I load the RTF, select export, select xhtml

The resulting xhtml document has two duplicate copies of the text of the document.

I tried this in Word 2007, and it exported the document correctly.


Steps to reproduce:
1. load attached rtf file
2. export to xhtml
3. see duplicate body sections in html file

Current behavior:

Expected behavior: I expect it to export teh text as it is in the rtf

Platform (if different from the browser): 
              
Browser: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11
Comment 1 Julien Nabet 2012-07-04 13:22:41 UTC
On pc Debian x86-64, with master sources updated today I reproduced the problem.

If I create an odt file from rtf, the text isn't present twice in odt.
If I export the odt to xhtml, the whole text is present twice too.
Comment 2 Julien Nabet 2012-07-04 13:29:51 UTC
I created a simple rtf file with LO containing just 1 word without any formatting.
I exported it to XHTML, everything was ok.
Something in the file seems to trigger the problem.

Scott Derik : do you reproduce this problem with other rtf files ?
Comment 3 Scott Derrick 2012-07-04 14:41:59 UTC
Yes I have.

As I said in the initial posting, Word 2007, doesn't seem to have a problem exporting the file to html. 

I don't know if it's significant or not but the RTF's I'm converting were created 10-20 years ago. They are part of an archive we are converting to xml(TEI), using xhtml as an interim format in the conversion process. 

Neither LibreOffice nor MSOffice complain when loading them?

I have more rtf's that exhibit the problem if needed.

Scott
Comment 4 Julien Nabet 2012-07-04 23:27:12 UTC
I installed unrtf (included in Debian repo) to test your file, it was ok.
So I put back to New status.
Comment 5 David Tardon 2012-07-13 11:00:08 UTC
For anyone interested: the code is in filter/source/xslt/odf2xhtml/export/xhtml/body.xsl and I think the problem is in the template matching draw:frame at line 889. It calls template createDrawFrame, which for some reason prints all the following siblings of the frame wrapped by a div. But after the draw:frame template is finished, the processing continues so the following siblings are processed (and printed) again.

IHMO the best course of action is to abandon the crazy idea that XSLT is suitable tool for processing ODF and rewrite the filter in C++.
Comment 6 Scott Derrick 2012-07-13 16:40:32 UTC
XSLT is a great tool for processing xml formatted content, as long as the content is well formed. 

If its not it can get very confused.

I have worked around the problem by using the "save as" instead of "export" option and selecting html as the target format.  

Interesting the problem doesn't occur when using "save as"...
Comment 7 David Tardon 2012-07-16 07:51:25 UTC
(In reply to comment #6)
> XSLT is a great tool for processing xml formatted content, as long as the
> content is well formed. 

If it is not well-formed, it is not XML .-) Anyway, that is not what I meant. The problem is in the complexity of the ODF format and impedance mismatch with HTML. XSLT is just not suitable for the heavy processing that is necessary to do the transformation (and no, XSLT 2.0 is not a solution. XSLT 2.0 is another problem.) Any attempt to do it anyway just leads to the WORN (Write Once, Read Never) type of code we have today, where any fix creates two new bugs.

> I have worked around the problem by using the "save as" instead of "export"
> option and selecting html as the target format.  
> 
> Interesting the problem doesn't occur when using "save as"...

AFAIK "save as" uses the old HTML export code from sw/source/filter/html .

There is also writer2latex extension, that (despite its name) contains a filter for XHTML export too. It is written in Java and hopefully is in a better shape than the XSLT filter.
Comment 8 Alexandr 2014-07-14 19:22:16 UTC
Hello.
I reproduce the bug with LibreOffice 3.5.4 from Debian Wheezy. I can not reproduce it with LibreOffice 4.2.5 from Debian Wheezy backports and 4.3.0.2. 
I do not know which patch solves the problem hence I set bugstatus to RESOLVED WORKSFORME. Feel free to reopen it if you can reproduce the issue with LibreOffice 4.2.5 or later.