67538 – XTypeDetection::queryTypeByDescriptor poor performance

Bug 67538 - XTypeDetection::queryTypeByDescriptor poor performance

Summary: XTypeDetection::queryTypeByDescriptor poor performance

Status:	UNCONFIRMED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	framework (show other bugs)
Version:	4.0.3.3 release
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Not Assigned
QA Contact:

URL:
Whiteboard:	NeedAdvice
Keywords:

Depends on:
Blocks:

Reported:	2013-07-30 14:10 UTC by Grigory
Modified:	2015-01-19 22:03 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
C++ source reproducing the problem (12.74 KB, text/plain) 2013-07-30 14:10 UTC, Grigory	Details
View All

Description Grigory 2013-07-30 14:10:04 UTC

Created attachment 83303 [details]
C++ source reproducing the problem

I am using Debian and upgraded to libreoffice from wheezy-backports (version 1:4.0.3-2~bpo70+1 in Debian notation).

I can succefully connect to Libreoffice using API but when I try to get the type of any document from InputStream using XTypeDetection::queryTypeByDescriptor it looks like hang. For very short strings (about 1000 bytes) the function returns the type (generic_Text or encoded in my tests) correctly and quite fast but if I load even 50000 bytes file the wait time to get the type is more than a minute. I can see one thread from my app consuming 50% of cpu and one thread from Libreoffice process consuming 50% of cpu for all this time.

I tried to attach gdb to the process but the only thing I understand that the execution is inside cppu_threadpool::JobQueue::enter and doesn't go out until the end.

I used openoffice from Debian squeeze before and my code worked fine so I suppose that this situation is a bug.

I attach simple test application to reproduce the problem

Comment 1 Stephan Bergmann 2013-07-31 09:32:00 UTC

LibreOffice's document type detection code is notoriously slow, trying lots of filters one by one until it finds a good match, and seeking and reading the same data over and over again from the given input stream.  So, if that stream is only made available across URP, this can easily cause lots of delay, esp. if the stream's data is in a "bogus" format for which no matching filter can be found (so the search needs to go through all of them).

That said, if you say things were considerably faster with an old OpenOffice.org version (which exactly was that?), it might be a performance regression in the rewritten binary URP bridge hinted at <https://issues.apache.org/ooo/show_bug.cgi?id=116038#c14> "rewrite binary URP bridge."

Comment 2 Grigory 2013-07-31 10:35:54 UTC

I have some additional research: the main problem is reading data from the inputstream byte-by-byte. If I pass the file by URL the perfomance is great. The call to InputStream::readBytes is perforemed with size parameter num=1 and then this byte is transmitted across processes - this is very slow. I added debug print to my stream implemenation and the result is :
Successfully connected to LibreOffice
Changed location to: 0
Readed: 30
Changed location to: 0
Changed location to: 0
Changed location to: 0
Readed: 1024
Changed location to: 0
Changed location to: 0
Changed location to: 0
Changed location to: 0
Changed location to: 0
Readed: 4096
Changed location to: 0
Changed location to: 0
Changed location to: 0
Changed location to: 0
Changed location to: 0
Readed: 4096
Changed location to: 0
Changed location to: 0
Readed: 26
Changed location to: 0
Changed location to: 0
Readed: 7
Changed location to: 0
Changed location to: 0
Changed location to: 0
Readed: 512
Changed location to: 0
Changed location to: 0
Readed: 1
Changed location to: 0
Changed location to: 0
Readed: 4
Changed location to: 0
Changed location to: 0
Changed location to: 0
Changed location to: 0
Readed: 4096
Changed location to: 0
Changed location to: 1
Readed: 1
Readed: 1
Readed: 1
Changed location to: 0
Readed: 1
Changed location to: 0
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
Readed: 1
...... 
byte-by-byte read

So it seems that there is no buffering in communication. This problem makes InputStreams useless - it is not just a two times slower as in <https://issues.apache.org/ooo/show_bug.cgi?id=116038#c14> "rewrite binary URP bridge.

Comment 3 Robinson Tryon (qubit) 2015-01-15 19:41:24 UTC

(In reply to Grigory from comment #2)
> I have some additional research: the main problem is reading data from the
> inputstream byte-by-byte. If I pass the file by URL the perfomance is great.

Hi Grigory,
How's the performance of our the latest builds?

Status -> NEEDINFO

Comment 4 Grigory 2015-01-19 19:44:49 UTC

(In reply to Robinson Tryon (qubit) from comment #3)
> (In reply to Grigory from comment #2)
> > I have some additional research: the main problem is reading data from the
> > inputstream byte-by-byte. If I pass the file by URL the perfomance is great.
> 
> Hi Grigory,
> How's the performance of our the latest builds?
> 
> Status -> NEEDINFO

Sorry, I don't have a computer with necessary environment to check the error. But the code that I attached to the ticket should reproduce the problem if it still exists

Comment 5 Robinson Tryon (qubit) 2015-01-19 22:03:27 UTC

(In reply to Grigory from comment #4)
> Sorry, I don't have a computer with necessary environment to check the
> error. But the code that I attached to the ticket should reproduce the
> problem if it still exists

Stephan: Is testing the code straightforward, or is this something a dev will need to handle?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.