Summary: | FILEOPEN of particular .docx takes 5 to 20 minutes | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Dan Essin <essin> |
Component: | Writer | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEEDINFO --- | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | barta, iplaw67, jmadero.dev, jorendc, LibreOffice, mst.fdo, perra, the_letter_j |
Version: | 4.0.1.2 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | BSA | ||
i915 platform: | i915 features: |
Description
Dan Essin
2013-03-23 04:02:37 UTC
Please attach a document that you see the behavior. For all of us who don't have Acrobat this will help us triage your bug. Marking as NEEDINFO, once you attach document mark as UNCONFIRMED and we will take a look. Thanks! The file that caused the crash did not attach because it is too big. It is available here: https://www.dropbox.com/s/acpmgu8bmsvadjb/Understanding_Tuberculosis_-_Analyzing_the_Origin_of_Mycobacterium_Tuberculosis_Pathogenicity.docx [Reproducible] with server installation of "LibO 4.0.2.2 rc - German UI / German Locale [Build ID: 4c82dcdd6efcd48b1d8bba66bfe1989deee49c3)]" {tinderbox: @6, pull time 2013-03-26 12:00(?)} on German WIN7 Home Premium (64bit) with newly created own user profile. When I try to open reporter's sample LibO stops responding with maximum CPU load and slowly increasing memory consumption (200MB ...) I will do some more research later @Dan Essin: Please do not copy such crash reports into bugzilla comments, but attach them as text documents, that will keep bug reports readable. Die that ever work for you with other LibO verions? Well, my WIN 4.0.2.2 did NOT crash, after 20 minutes the document was loaded. My MS WORD VIEWER does not open the document at all (Error when opening document), so LibO seems not to do a bad job ;-) Results with LibO 3.3.3 and AOOo4 similar: it takes a lot of time, but they will get the document opened. LibO 3.3.3 crahsed when I tried to save the document as .docx @Dan Essin: May be you simply have a damaged document? Of course, LibO should not crash ... @Joren: Can you please check whether you can reproduce the crash? (In reply to comment #4) > @Joren: > Can you please check whether you can reproduce the crash? I can NOT reproduce a crash either, tested using Mac OSX 10.8.3 with LibreOffice 4.0.2.2. It took me +5 minutes to open it. After that I can edit it _very_ slow (almost impossible to edit it as it should be). Every time I try to do something, the cursor changes to "busy/loading" (the beachball in Mac OSX). @Rainer: lets just change the summary of the bug and accept it as "tremendous slow editing with particular bug"? Kind regards, Joren (In reply to comment #5) > "tremendous slow editing with particular bug"? s/bug/.docx document A 27Mb document containing 699 sections, and 278 pages replete with images. Even in Word 2011 for OSX it takes more than 5 minutes to load completely. Alex In LO 4.1.1.2 on OSX, the file opens after a much longer load time (approx. 15 mins) the page layout (possibly the sections as well) appears to be misinterpreted and converts to portrait page, instead of landscape, thereby increasing the page numbers of the document to 614. Alex So, no crash here. Alex Datafile is 585 pages (a collection-of-biomed-papers-into-book-form) (all tests performed w/ 3.5.7-0ubuntu4 running on ubuntu 12.04.3 from USB) takes 6.5 minutes to open original docx at 27mb takes 2.5 mins when saved as docx2doc down to 22mb takes 1.1 mins when saved as docx2odt down to 14mb 1 (which is still too long... ...but more importantly the document looked somewhat corrupted now) Also tried converting from docx2odt2docx, which stayed at 14mb final size, and opened in 4.5 minutes (speed gained) Also tried converting from docx2odt2doc, which ballooned to 34mb final size, and opened in 2.5 mins (same as above) For contrast, this is an 1800 page odt -- http://wiki.openoffice.org/w/images/3/34/DevelopersGuide_OOo3.0.0.odt Opens in under 0.3 minutes on the same system (which is *still* too slow. I can pick up an 1800-page textbook and open to the first page in two *seconds*, and I would expect the same of libreoffice... if not really of msftOffice). One big difference is that this 1800-pager has very few images, so it is just 1.5mb in total size, 10x or 20x smaller. This is reflected in memory usage -- about 160mb for this 1800-page odt, versus about 400mb for the 600-page odt above. Analysis: this particular docx is a pathological case, sure. But that does not mean we should take 69 seconds to open it! Especially if, by speeding up LibreOffice's ability to quickly open large complex docs, we can improve our market share. One of the other commenters noted that their version of LibreOffice and their version of MsftOffice both took many minutes to open the original docx file... that spells competitive advantage for us, if we can do the job, and microsoft cannot. Now, the 'job' we are trying to do needs some thought. Do we actually care about being able to quickly open this particular document? Well, yes, because we have at least one known enduser who cannot escape the clutches of their proprietary adobe software, unless they can export the data into LibreOffice. The proprietary tool in question *claims* it will export the enduser's data from native format (pdf in this case) to docx, but in fact the docx datafile generated is ludicrously slow to open in most actual office-suite implementations. There might be other endusers with similar problems, either right now or someday in the future, so we should fix the painfully slow loading-speed of this particular broken document. As an alternative, or even better as a simultaneous two-pronged push to victory, we should improve how LibreOffice handles pdf datafiles. Right now, it can import them into LibreDraw. Not the same as being able to import them into LibreWriter! There are some longer-term goals involved in these subtasks, obviously. First of all, we want to improve LibreOffice's ability to deal w/ hairy docx files, so that we remain fully drop-in compatible with Microsoft Office (or Apache Office). However, beyond just compatibility, we want to be the tool of *choice* for dealing with docx files. In the past, OOo2 was always useful as a repair-tool, for when your xls files got corrupted by some lesser softer; LibreOffice4 ought to try and fill the same role. This particular hairy docx is just an in-the-wild example of a corrupted docx -- one which LibreOffice takes many minutes to load. Once it *is* finally loaded, LibreOffice can convert the file into odt, and then load it six *times* faster than the datafile could be loaded before the conversion. That's definitely a competitive advantage for us... if the conversion actually worked. Which it did not -- there was corruption of the datafile along the way. But really and truly, this bug is an example of LibreOffice going into a 'not responding' state, which for some users ends up as a full crash. Having a nice subtle status-bar is fine... loading document... little dusky red progress bar... sure. But when something is taking more than about 15 seconds, these subtle hints are not really good enough. I'd rather transition to a popup-dialog, which explains that the ETA is several minutes from now, and shows numeric completion data. If we solve some of the speed-difficulties with large complex documents (including pathological ones), then we can fall back to the subtle indicators again... and if we *really* speed things up, so that LibreOffice snaps open in two seconds datafiles for which Microsoft and Apache require ten times as long, then we can fall all the way back to the hourglass/beachball/petals/whatever. Is there a document somewhere that covers setting up a speed-profiler for measuring LibreOffice performance? Are there unit-tests that open complex documents, and alert developers of new features when they are killing performance? @Dan Essin, it looks like ubuntu and windows are working (albeit at a glacial pace), and somebody tried LibreOffice 4.1.x on their OSX , can you confirm that 4.1.x opens docx without crashing on your system? If you can open the docx on another OS (such as an Ubuntu LiveUSB), and then SaveAs odt, do you still experience a crash under OSX? p.s. Once the docx/doc/odt datafiles were finally open, I did not have trouble navigating around them -- the docx actually seemed faster, with odt being a bit more pokey when I hit pagedown (but not painfully bad). JorenC, if you open the files in 3.5.x on your box, is it still slow when performing navigation and editing actions? I have changed the title from 'crash' to instead refer to 'takes 5 to 20 mins'. I've also changed from Other/OSX to x86-64/All (confirmed on osx/win/lin). We never got a second machine to crash, but we've got at least three machines confirmed to take 5+ minutes to load this pathological datafile; I'm changing status to NEW because the freshly-modified title *is* solidly confirmed. Anybody that *does* experience a crash when opening the datafile (see comment#2 with dropbox link), feel free to change the title back to the original. Hm - but as Alex has said this is the same in Microsoft Office - this is an incredibly complex file, why are you thinking this is particularly slow for the file? I will test as well on Microsoft Office to see what results I get but unless there is a real reason why you think it should be faster, it may be not a bug and might just be completely expected given the complexity of the file. Will report back shortly Okay most definitely something going on: test results: Windows 7 x64 i3 processor opened the 27 meg test file provided: Microsoft Office 2010: 19.9 seconds LibreOffice 4.1.1.2 release: 10 minutes 30 seconds Marking it as: Major - despite no data loss this kind of performance is enough to make the product virtually unusable for these kinds of complex files Normal - while major probably not affecting many people but the complexity of the file may help us deal with several docx performance issues (maybe?) Michael: thoughts on this one? Anything that QA can do to help pinpoint why this file is taking 90x longer to open with LibreOffice than MS Office. We can do one of two things: a) close this bug as invalid but keep the file as a performance tester going forward (usually what we do if this issue represents many issues in one that combined are causing the slow performance so this bug isn't one bug but many bugs wrapped up into one) b) if this is a single bug we can leave this as NEW and . . . hopefully find someone to take it (I wrote the stuff below before seeing Joel's second comment, that in fact his version of Office opens the file in 20 seconds. I note that somebody else, with a different unspecified version of Office, said it took five minutes. Therefore, my suspicion below that adobe devs were generating *uniformly* crappy docx output was wrong -- instead, it turns out that they were generating docx output that works fine in every office suite they tested -- consisting solely and entirely of a recent version of microsoft word. Sigh. However, besides that useful correction, the rest of my commentary stands: LibreOffice should open that document in two seconds, not twenty seconds. We should not aim to be as good as Microsoft, but dramatically better. @Alex -- what version of office were you using, on what windows flavor, that took five minutes?) First of all, I would encourage you not to be satisfied with being almost as good as Microsoft Office -- that is not how we beat the pants off them, if you catch my drift. They have the advantage of getting their binaries pre-installed (as trialware) on the vast majority of desktops nowadays. We need to be better than them, not just equivalent. As for the meat of the question, my position is that the data itselt is not that complex, inherently, It is 500 pages. It has some indentation, some footnotes, some images. It takes in the neighborhood of twenty times longer to load, than a roughly similar 1800-page document, on the same hardware. Converting to ODT fileformat cuts that 20x factor down to a 3x. Therefore, it makes sense that 1. LibreOffice *can* load similar data much quicker than it currently does Speeding up the load-time of this particular complex document will undoubtedly also help speed up the load-times of non-pathological large & complex documents (my alternative sample still takes about 17 seconds to load -- why not aim for 2 seconds? while we're on that topic, please load the DeveloperGuide.odt into your msftOffice, so we can know how many seconds it takes) There is the question of why this particular datafile is so poorly encoded into docx form... and the answer is, because Adobe is doing the encoding. They don't want you to export from PDF to DOCX, which permits editing with LibreOffice; they want you to keep needing licenses for Acrobat Pro. Almost certainly, LibreOffice can be taught to clean up their pathological DOCX, and if so, that gives us a competitive advantage over other DOCX suites -- we can work with the crappy output of Adobe's pdf2docx converter, while the lesser office suites cannot. Making LibreOffice capable of re-encoding a crappy DOCX into a cleaner-and-quicker DOCX is also, again, helpful with other users, not just this particular pathological docx. (We should also see if LibreOffice can handle the original PDF, if it is possible to obtain it -- perhaps the blame is not adobe's pdf2docx, but rather the state of the original pdf.) So: 2. We ought to fix our docx2odt conversion process, for instant speed-up 3. We ought to work on a libre pdf2docx conversion process, maybe 4. We ought to investigate docx2docx cleanup, with speed & integrity in mind "unless there is a real reason why you think it should be faster" Umm... because LibreOffice is too damn slow? :-) There, that feels better! Seriously, though, the real reason that I think it should be faster is simply first principles. I have a document. It is five pages long. I load it up. Sub-second time. Beautiful. Replace 5 with 1800. Replace sub-second with 17 seconds. Replace beautiful with... pause... drumming fingers... checking gkrellm... pause.... finally! That's not even talking about pathological cases, like the 390 seconds you have to wait for the 600-page docx this bug-report discusses. What do I actually *see* after loading, whether it is a 5-page document, or an 1800-page document? Page 1. And maybe, page 2. If I have a *really* big dose of screen-real-estate, perhaps up to five pages might be displayed. Even ten! But not more than that. How long does LibreOffice need, to display five pages of a document? Sub-second times. How long should it need, to display the first few pages of an 1800-page document, and let me get to work while it loads the rest in the background? *That* is my point here. 5. LibreOffice ought to load the user-visible pages (1 and 2 by default) in less than a second, regardless of how large the document happens to be. p.s. This is valuable when working with large documents on a local drive, like the 1800-pager mentioned above... but it is also useful when working with a medium-sized 50-pager that is being downloaded across the network, say. "Word 2011 for OSX" Ahhh, sorry. Alex had three comments, he already said what version of Word he tried. Not too surprisingly, Microsoft did not give their best effort on the OSX port... but it is somewhat surprising that Adobe didn't test on OSX, given their historical focus on integrating well with Apple. Here from 2011 is a possible-probable duplicate: https://bugs.freedesktop.org/show_bug.cgi?id=39179 A guy named Pierre was working on that bug here: http://www.lanedo.com/2013/quest-for-libreoffice-speed/ His conclusion was that footnotes are incredibly computationally expensive for openoffice, which makes sense -- the 600-page docx which is the subject of this article is some kind of biomed stuff about tuberculosis, with hundreds or even thousands of footnotes. Pierre suggests the trouble might be in this file: http://opengrok.libreoffice.org/xref/core/writerfilter/source/ooxml/OOXMLDocumentImpl.cxx#161 p.s. What other tools are useful for performance-profiling the FILEOPEN sequence, besides the stuff Pierre mentioned? @the_letter_j - well we do see the clear performance issue. Your comment though needs addressed. "It should open in 2 seconds" is not a useful comment, it's like me saying "my Honda Civic should go 200mph", unless there is a programming reason why you think it should open in 2 seconds, then the # is irrelevant. I think the clear point is that it's 90x slower than MS Word and there lies a problem, saying it "should open in 2 seconds", is probably very unrealistic. This is Microsoft's format, we have to little control over they handle their format, the best we can hope for is the same speed (and even this is a bit unrealistic as we are constantly having to reverse engineer their specifications) Presumably you refer to the 1973 Honda Civic with the 1.1L engine and curb weight of 1500 pounds. https://en.wikipedia.org/wiki/Honda_Civic_%28first_generation%29 There is also, albeit only sold in Japan, the 2007 FD2 Civic Type-R with a 225-horse engine, and a stock curb weight of 2800 pounds, top speed 150 mph right from the manufacturer. http://hondatyper.com/Civic-Type-R.html There is even the special-edition only-300-sold Mugen-racing-subsidiary Honda Civic Double-R at 237 HP and 2767 pounds, which means ~160 mph max. http://www.modified.com/features/sportculture-feature-010208/viewall.html Now, that's measured on the flat, with no tail-wind, and a nominal driver weighing 250 pounds or something standardized like that. I'd be willing to bet that, by shaving off some of the 1250-pound mass-differential between the 1973 economy model and the 2007 racing model, you could quite easily have a Civic that hit the 200mph mark... especially if you were willing to run the test with a tailwind and a steep downhill grade. But, that would be cheating, right? Because the point of the top-speed measurements is to give a performance benchmark, under standardized conditions. But the engineering reason that I am suggesting we can boost the *perceived* and also the *effective* performance of LiO, for opening large complex documents, is also cheating.... I'm trying to say that we should be opening the 1800-page docx, by first running as fast as we can from point A to point B, which is to render the first few pages of the document on-screen. That is all the the enduser cares about, usually, in a wide variety of use-cases. Then, second of all, while they are reading page#0 and page#1, which are already onscreen, LiO can keep chugging along in the background, rendering (in RAM rather than in the GPU's visible-onscreen-right-now-framebuffer) the remainder of the complex document in question. That is "2 seconds". This is not a case of us needing to reverse engineer Microsoft's format. That's already been accomplished, for the most part. This is not a case of us needing to render correctly. This is a case of our app being CPU bound, because we are trying to render for layout all two-thousand-odd footnotes, before we even display the first few pages to the enduser. Is Microsoft already using my trick? Well then, maybe 2 seconds is in fact unreasonable. But if they aren't, then 2 seconds may well be achievable. Quite frankly, I had to really strive to be fair and specify 2000 milliseconds, an interminably long time for fiddling with a local file pulled from the hard drive at 80MB/sec plus a few milliseconds of seek. Even the 27mb docx is going to be fully in ram within about 400ms, and we know that *some* small test docx can render the first couple of pages in sub-second times. Which suggests that 2 seconds is very reasonable, indeed. I'd like to be faster, actually. But first things first. And by that I mean: let us agree to stop thinking of Microsoft as the leader, and us as the followers. Look at the LiMa project, which is reverse-engineering the Mali GPU drivers for tablets and smartphones -- their binaries are in many cases *faster* than the stock ones from ARM... and Luc claims he might be able to be another 60% faster in a foreseeably short timeframe. Why not us? Let us strive to shave the excess from the curb-weight, and to increase the size of the carbon-fiber airbox, and to cheat by measuring speed on a downhill grade with a tail-wind. That is all the quickstarter is, after all: cheating. Very effective cheating, and totally ethical, because it really does speed things up for actual endusers. If the fine folks at Honda really wanted to help actual endusers, rather than making their Civic go 200mph, they would bump it to 140 or 150 or 160, and then concentrate on quality. (Which is what they *did* now isn't it?) I'm not going to complain if LibreOffice never makes it to the equivalent of a 200mph sedan. But right now, we can assume that Microsoft is the equivalent of the Chevy Tahoe, 320 HP, 5400 pounds, and top speed 139mph. https://en.wikipedia.org/wiki/Chevrolet_Tahoe#2007.E2.80.932014_.28GMT900.29 http://www.automobile-catalog.com/car/2013/1769015/chevrolet_tahoe_ppv.html LibreOffice, when dealing with complex files, simply cannot keep up. We're 20x slower the msftWord on docx files with lots of footnotes. That means we're going about 7mph -- the tranny is stuck in 1st gear. But if we can get out of first, and shift all the way to sixth gear, then even if our engine is only 200 HP, we can still beat Microsoft, as long as our curb-weight (aka feature-bloat) is 3500 pounds or less. If we really want speed, we can build a motorcycle with that same engine, and call it LibreAbi or LibreMeric or something... but that's optional. I'm *not* here to complain. I'm here to help. But this is not the first time I've lurked on performance-related bug reports. LibreOffice has been slow for a long time. This is not anyone's fault really -- first we had to reverse-engineer the fileformats, then we had to overcome the inertia of Sun and Oracle with their differing agendas, and finally we had to keep something like feature-parity. But now is the time for LiO to start moving faster. The core of LiO is weak, at the moment, in terms of raw performance. But there is nothing holding us back from making fixes, anymore. The core of winword is *not* very strong now. They are pre-occupied with The Ribbon, and The Metro, and all sorts of other stuff. The computerized equivalent of big fins flanking the trunk. LibreOffice has a chance to be like Honda, and build an efficient set of vehicles that get great gas mileage, cost less, run better, and flat out win the engineering contest. My assertion is that people care about fast, effective editing of complex documents. If we make those people happy, LibreOffice will prosper, the folks over at ApacheOpenOffice will cheer, and poor Microsoft will fall by the wayside. GoogleDocs is too simplistic. AbiWord is too simplistic. LibreOffice is the right level of power... but we need to tune that power towards a useful purpose. Anyways, apologies for filling up your inbox, for the third time today. I do not mean to make you unhappy; I get the feeling, when you say that my comment is not useful, and needs to be addressed, that you see it as whining for the impossible. But as I've tried to show, there are solid engineering reasons to believe I'm not whining. Okay, to be fair: not *just* whining, but also have a point. :-) That does not mean I'm 100% correct in all particulars -- maybe it is impossible to render the first couple pages quickly, because the fileformat is so poorly designed that we cannot be sure how the first page will render, until we have in fact rendered every subsequent page. (I've seen HTML+CSS monstrosities just like that in fact.) But let's find out. Let's not settle for LibreOffice being no-more-than-twice-as-slow-as-winword, because that will not bring us tons of converts. Let's aim to win, not aim for second. Since this is getting somewhat off-topic for this particular bug, I am happy to continue it via email, if anybody cares to. In particular, I would like to know what the recommended setup is for performance profiling. Michael Meeks mentioned cachegrind, which I've used once or twice, and Pierre mentioned doing some performance graphs using chrome, and I've done something similar. LibreOffice is pretty huge, though, so I'm not sure cachegrind output is going to make sense to me. Viva la Libre Office. (In reply to comment #14) > It is 500 pages. It has some indentation, some footnotes, some images. (In reply to comment #15) > ... the 600-page docx which is the subject of this article is some kind of > biomed stuff about tuberculosis, with hundreds or even thousands of footnotes. (In reply to comment #17) > ... we are trying to render for layout all two-thousand-odd footnotes ... For clarity, the provided example DOCX contains no true footnotes. Footnotes are encoded in OOXML using the <w:footnoteReference> (anchor) and <w:footnote> (text) elements. I downloaded the DOCX, renamed it: $ ls -l fdo6* -rw-r--r-- 1 oweng users 28564066 Feb 6 20:38 fdo62656.docx A simple grep of the XML reveals: $ unzip -p fdo62656.docx word/document.xml | xmllint --format - | grep -c "footnote" 0 There are no endnotes either: $ unzip -p fdo62656.docx word/document.xml | xmllint --format - | grep -c "endnote" 0 There is however one footnote on p.253 (1st page of ch.10), one on p.425 (1st page of ch.18), and one on p.511 (1st page of ch.23) but all are manually set, rather than being true footnotes. I checked every page in Word 2007 to be certain. As Alex indicates (comment 7) the performance issues are more likely related to either the 218 PNGs, 274 different page headers, or 699 sections in use. @Dan Essin: The link you've provided has expired. Can you update the link or attach the file to this bug, please? You might also want to try loading it in the latest Fresh release (http://www.libreoffice.org/download/libreoffice-fresh/?type=rpm-x86_64&version=4.3.0&lang=en-GB) and see if anything has changed. set status to NEEDINFO since test file is not available anymore so we cannot retest under newer LibO releases (4.3.2.2 and 4.4.x master) @Dan Essin please upload again the file and provide a link to download it then revert the status to NEW if you still see the bug with current releases I have a copy of the document, but it needs to be uploaded onto: https://owncloud.documentfoundation.org/common/ under QA/Bugzilla/Bugs I do not have an OwnCloud account yet (just sent a request). Original file is 27.2MB. Once I have access I will attempt to re-upload and provide a link. As I indicated in comment 18, these types of Acrobat-generated DOCX are rather horrible, containing lots of sections and text boxes / frames (rather than true footnotes). |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.