Bug 83227 - FILESAVE: SolidConverter DOCX - On resave margins and images lost and file size doubled
Summary: FILESAVE: SolidConverter DOCX - On resave margins and images lost and file si...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.7.2 release
Hardware: All All
: low minor
Assignee: Miklos Vajna
URL:
Whiteboard: target:5.1.0 target:5.0.3 target:4.4.6
Keywords: bibisected, bisected, dataLoss, filter:docx, regression
Depends on:
Blocks:
 
Reported: 2014-08-29 07:29 UTC by André van den Berg
Modified: 2016-10-25 19:20 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
ZIP-File with the files mentioned in the text (1.82 MB, application/zip)
2014-08-29 07:29 UTC, André van den Berg
Details
Open VS Save Reopen in Master (369.17 KB, image/png)
2014-08-31 13:21 UTC, Yousuf Philips (jay) (retired)
Details
Original File Opened and Saved in Word 2013 (895.20 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-10-04 08:32 UTC, Joey Reid
Details

Note You need to log in before you can comment on or make changes to this bug.
Description André van den Berg 2014-08-29 07:29:38 UTC
Created attachment 105410 [details]
ZIP-File with the files mentioned in the text

I converted a PDF to a DOCX-file by using SolidConverter, which could be opened by WRITER successfully inspite of some misformatting issues. (attachment: GB_2013.docx)

I did not changed anything but saved the file under a different name (here: GB_2013_b.docx), closed it and reopened the saved version.

A lot of the graphic elements of the front page went lost!
Well, in fact they are still available in word\media

Even worse, just by saving the file LO doubled most of the images, some of the got even three addional copies. Why?
By consequence the filesize of GB_2013_b.docx has been doubled, too, in comparition to the original GB_2013.docx.

(attachment: LO-4-3-0-4_doubles-media-entry.PNG) 

Last but not least, why LO lists all media individually in content_types.xml while in the original version there is just made a default setting:

GB_2013.DOCX:
<Default ContentType="image/jpeg" Extension="jpeg"/>
<Default ContentType="image/png" Extension="png"/> 

GB_2013_b.DOCX saved by WRITER:

<Override ContentType="image/jpeg" PartName="/word/media/image24.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image25.jpeg"/>
<Override ContentType="image/png" PartName="/word/media/image22.png"/>
<Override ContentType="image/png" PartName="/word/media/image21.png"/>
<Override ContentType="image/png" PartName="/word/media/image19.png"/>
<Override ContentType="image/png" PartName="/word/media/image20.png"/>
<Override ContentType="image/png" PartName="/word/media/image18.png"/>
<Override ContentType="image/png" PartName="/word/media/image15.png"/>
<Override ContentType="image/png" PartName="/word/media/image14.png"/>
<Override ContentType="image/png" PartName="/word/media/image16.png"/>
<Override ContentType="image/png" PartName="/word/media/image23.png"/>
<Override ContentType="image/jpeg" PartName="/word/media/image13.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image12.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image9.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image11.jpeg"/>
<Override ContentType="image/png" PartName="/word/media/image28.png"/>
<Override ContentType="image/png" PartName="/word/media/image5.png"/>
<Override ContentType="image/jpeg" PartName="/word/media/image8.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image10.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image26.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image1.jpeg"/>
<Override ContentType="image/png" PartName="/word/media/image7.png"/>
<Override ContentType="image/png" PartName="/word/media/image29.png"/>
<Override ContentType="image/png" PartName="/word/media/image6.png"/>
<Override ContentType="image/png" PartName="/word/media/image4.png"/>
<Override ContentType="image/png" PartName="/word/media/image17.png"/>
<Override ContentType="image/png" PartName="/word/media/image3.png"/>
<Override ContentType="image/jpeg" PartName="/word/media/image27.jpeg"/>
<Override ContentType="image/jpeg" PartName="/word/media/image2.jpeg"/>
Comment 1 Yousuf Philips (jay) (retired) 2014-08-31 12:43:26 UTC
Hello Andre,

Thank you for submitting the bug. I can confirm that the bug is in master.

Version: 4.4.0.0.alpha0+
Build ID: fcc6e8ae56d539ef92bfb917a52ac0638b3db25f
TinderBox: Linux-rpm_deb-x86@45-TDF, Branch:master, Time: 2014-08-30_01:50:29
Comment 2 Yousuf Philips (jay) (retired) 2014-08-31 13:18:39 UTC
The docx file wasnt openable in 3.3.0, but opened in 3.5.7.

When looking at the internals of the original docx file, the /word/media folder has 13 images (6 jpgs and 7 pngs). In 3.6.7 to 4.1.6, 0-byte files without extensions were being saved instead of the jpg files. In 4.2.6, two png files were duplicated when saving to the docx. In master, most images are duplicated and 2 images have four instances, resulting in 29 total images. So duplicate images likely started somewhere in the 4.2.x releases.

When opening the original docx file in master, only 1 image was listed in navigator and when clicking images not listed in navigator, the graphics toolbar wouldnt appear. The document has top and bottom margins of 0 cm.

When reopening the saved docx in master, 9 images are shown in navigator, with only 1 of them with a label. This document has top and bottom margins of 2.54 cm, which might be the reason why the top and bottom images of the page arent being displayed.
Comment 3 Yousuf Philips (jay) (retired) 2014-08-31 13:21:48 UTC
Created attachment 105485 [details]
Open VS Save Reopen in Master
Comment 4 Joey Reid 2014-10-04 08:32:24 UTC
Created attachment 107314 [details]
Original File Opened and Saved in Word 2013

If you first open and save GB_2013.docx in MS Word, Writer can open and save the new file without any major problems. It is likely that GB_2013.docx is not a valid OOXML file.
Comment 5 Yousuf Philips (jay) (retired) 2014-10-04 23:31:02 UTC
The docx file was created with MS Office 2007 Outlook, so opening and resaving it in Word 2013 is saving it a different version of the docx format.
Comment 6 Charles 2014-12-07 22:16:26 UTC
This bug report probably explain my similar behavior.  When saving a LibreOffice Writer document in .docx format the margins and indent settings are lost when the file is re-opened.  This occurs with LibreOffice version 4.2.7.2.
Comment 7 Charles 2014-12-07 22:22:26 UTC
The bug appears limited to the .docx format because when I correct the formatting and save the file in .doc format, close it and re-open it the margins and bullet indents formatting is retained.
Comment 8 Matthew Francis 2015-02-13 10:35:51 UTC
The images in GB_2013.docx started being duplicated on save from the below commit. It's not clear how many of the problems round-tripping this file are directly related to this, so I'm going to leave off splitting this bug up for now, but more bugs may need to be opened once this has been dealt with.

Adding Cc: to vmiklos@collabora.co.uk; Could you possibly take a look at this? Thanks


commit cfb5b20cdc230320ff9f864d1cfd81aaea221da0
Author: Miklos Vajna <vmiklos@collabora.co.uk>
Date:   Wed Dec 18 11:03:57 2013 +0100

    DocxAttributeOutput::OutputFlyFrame_Impl: enable DML export by default
    
    This was only available in experimental mode previously. Also note that
    export of Writer TextFrames are handled separately, there DML export is
    still off by default.
    
    Change-Id: Ie8eaa1670610d92a363a8558b68064e7d7de2cdd
Comment 9 Miklos Vajna 2015-09-04 16:05:10 UTC
There are indeed more images in the saved document than in the original one. Interestingly, not all images are duplicated -- and that matches my memory that for Writer images we already de-duplicate them on export when we write both the drawingML and VML markup. We need to do the same for drawinglayer images, too.
Comment 10 Commit Notification 2015-09-07 06:46:27 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=b484e9814c66d8d51cea974390963a6944bc9d73

tdf#83227 oox: reuse RelId in DML/VML export for the same graphic

It will be available in 5.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 11 Miklos Vajna 2015-09-07 06:50:24 UTC
The above fixes the duplicated images, please open a separate bug for the margins problem if that's still a problem.
Comment 12 Yousuf Philips (jay) (retired) 2015-09-08 07:10:46 UTC
(In reply to Miklos Vajna from comment #11)
> The above fixes the duplicated images

Thanks for the fix. Will you be able to backport it into 5.0?

> , please open a separate bug for the
> margins problem if that's still a problem.

Submitted as bug 94009 and it is a regression.
Comment 13 Miklos Vajna 2015-09-11 08:33:30 UTC
libreoffice-5-0 backport: https://gerrit.libreoffice.org/18491
Comment 14 Commit Notification 2015-09-11 09:37:05 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "libreoffice-5-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=1b381370b026f62397dc2d41ddcecf9d6523e044&h=libreoffice-5-0

tdf#83227 oox: reuse RelId in DML/VML export for the same graphic

It will be available in 5.0.3.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 15 Commit Notification 2015-09-15 14:17:11 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "libreoffice-4-4":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c9a290c2a87e9af3b0cd4ccbdd751dddab3532da&h=libreoffice-4-4

tdf#83227 oox: reuse RelId in DML/VML export for the same graphic

It will be available in 4.4.6.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 16 Robinson Tryon (qubit) 2015-12-17 04:36:28 UTC Comment hidden (obsolete)