Bug 44681 - EasyHack: port to CLucene from java/Lucene ...
Summary: EasyHack: port to CLucene from java/Lucene ...
Status: CLOSED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Extensions (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: Other All
: medium major
Assignee: Gert van Valkenhoef
URL:
Whiteboard:
Keywords: difficultyInteresting, easyHack, skillCpp, topicCleanup
Depends on:
Blocks:
 
Reported: 2012-01-11 07:44 UTC by Michael Meeks
Modified: 2015-12-16 00:25 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Partial implementation (indexing) (4.68 KB, text/x-c++src)
2012-02-09 13:01 UTC, Gert van Valkenhoef
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Meeks 2012-01-11 07:44:30 UTC
When we do our first-ever startup, we index help files from our internal extensions. This causes us to startup Java - which can take easily 5-10 seconds which makes our first-ever start rather horribly slow. This is also the last java usage (I think) during first-start. We also pay this price when searching in the help for the first time too so ... it'd be nice to fix.

        The help indexer lives inside l10ntools/source/help - which builds the
indexes, and is re-used at first start for any installed extensions (I
think). This is mostly C++ anyway - with an horrible blob of < 500 total
lines of Java (188 ';' lines) to link to the Java lucene stuff - it
should be fairly trivial to re-do.

        Then the run-time uses of that should be in xmlhelp. The "Access Lucene
via XInvocation" line in
xmlhelp/source/cxxhelp/provider/resultsetforquery.cxx
"com.sun.star.help.HelpSearch" This has another 600 lines (155 ';'s) in
xmlhelp/source/com/sun/star/help/ that badly need to die in similar
fashion.

        Hopefully the file format is the same between clucene and lucene
replacing xmlhelp/source/com/sun/star/help/ so we can do one bit and
then the other, but it shouldn't be a huge bother to do both at once.
Comment 1 Gert van Valkenhoef 2012-02-09 13:01:26 UTC
Created attachment 56833 [details]
Partial implementation (indexing)

I wrote a C++ port of the Java-based HelpIndexerTool (or, actually, only the part that does the indexing using Lucene, with a minimal main() that just indexes a hard-coded test directory). Before I go forward, there are a number of questions about how this will be used and what are the constraints etc.:

 * Lucene 2.3 is used, but the CLucene stable version is Lucene 1.9.1
   compatible. The developers recommend using the Git version, which *is*
   compatible with 2.3. Is that OK?

 * How exactly is HelpIndexerTool used? I believe both as a command-line tool
   (as part of a ??? -> HelpLinker -> HelpIndexer -> ???) chain in the build
   process, and as a run-time component to index help for extensions. Is it
   desired to keep HelpIndexer as a stand-alone command-line tool, or is that
   just because it is a Java component currently?

 * HelpIndexerTool does two things:

    - Index help files using Lucene, producing intermediate files

    - Bundle the intermediate files into a ZIP archive

   Currently, I ported most of the first part (for Japanese there is a special
   Analyzer, which I don't know how to test, and there are a bunch of options
   to check certain things, of which I'm not sure whether they're ever used).

   Question: does creating the ZIP need to be part of this? If so, what is the
   best way to create the archive?

 * HelpFileDocument is a simple support class that produces a Lucene Document
   for a given help file. This is fully ported (as the helpDocument method).

 * I'm assuming that CLucene will *always* be compiled with TCHAR defined as
   wchar_t. This is because of my ignorance of how one does portable wide 
   strings in LibreOffice. Please enlighten me.

 * How to incorporate the CLucene dependency in the build process?

The attached code is contributed under under the LGPLv3+ / MPL.
Comment 2 Caolán McNamara 2012-02-10 07:28:53 UTC
oh shiny, that's *very* encouraging. With a bit of luck this could take hours off my multi-language build times :-)

Well first off I reckon its best to mail your code-to-date and your question again to the general development list libreoffice@lists.freedesktop.org to get better and wider feedback

but here's my guesses. 

 * Lucene 2.3 is used, but the CLucene stable version is Lucene 1.9.1
   compatible. The developers recommend using the Git version, which *is*
   compatible with 2.3. Is that OK?

probably yeah

 * How exactly is HelpIndexerTool used? I believe both as a command-line tool
   (as part of a ??? -> HelpLinker -> HelpIndexer -> ???) chain in the build
   process, and as a run-time component to index help for extensions. Is it
   desired to keep HelpIndexer as a stand-alone command-line tool, or is that
   just because it is a Java component currently?

I think its desirable to be a standalone command-line tool. It gets used when building the "helpcontent2" module, which is really slow for lots of enabled languages.

   Currently, I ported most of the first part (for Japanese there is a special
   Analyzer, which I don't know how to test, and there are a bunch of options
   to check certain things, of which I'm not sure whether they're ever used)

The CJKAnalyzer comes with the java lucene to its a special one, but not a custom one belonging to us, *presumably* this means that the lucene world knows any potential gotchas with trying to convert uses of it to clucene, I'm not exactly sure what it does over the generic one, but we've got some Japanese readers who should be able to read the final output of a conversion to see if the quality is sufficient.

   Question: does creating the ZIP need to be part of this? If so, what is the
   best way to create the archive?

back in the day I the last time I convert the original java HelpLinker to c++ I *cough* just spawned off perl to do the zipping, e.g. see JarOutputStream::JarOutputStream in http://people.redhat.com/caolanm/ooocvs/workspace.helplinker01.patch could grab and re-use that. In the longer run we might expose some more stuff from package/inc to export out a simple zip api, but using (silly) JarOutputStream would do for now


 * I'm assuming that CLucene will *always* be compiled with TCHAR defined as
   wchar_t. This is because of my ignorance of how one does portable wide 
   strings in LibreOffice. Please enlighten me.

presumably this will "just work", FWIW we have an 8bit code unit "rtl::OString" and a UTF-16 rtl::OUString class in LibreOffice, not sure if we need to bridge from these to whatever CLucene uses at any point, but I'm sure its doable if necessary.

 * How to incorporate the CLucene dependency in the build process?

its sort of tricky to do this, but plenty of examples, e.g. see the libwpd or libcdr or hunspell dirs which are special modules that build extra dependencies. Basically don't worry about this bit, get it converted to clucene and with some luck someone else will handle figuring out how to build clucene itself as part of our build
Comment 3 Gert van Valkenhoef 2012-02-10 09:31:05 UTC
Thanks for the comments. I'll look into your suggestions and then send the next version to the general list.
Comment 4 Michael Meeks 2012-02-17 13:20:54 UTC
so marking a dup of the remove stdlibs bug.

*** This bug has been marked as a duplicate of bug 46246 ***
Comment 5 Michael Meeks 2012-02-17 13:21:44 UTC
urk - terribly sorry, wrong bug ...
Comment 6 Caolán McNamara 2012-02-20 13:12:39 UTC
set this as assigned as its in-progress
Comment 7 Caolán McNamara 2012-02-23 02:35:42 UTC
reassigning for glory attribution.

Yup, all works. Only interesting problem was that clucene's defaults overflowed for our "ja" help and threw and exception on creating the help indexes. Apparently, according to the documentation, lucene has the same limit except silently drops what doesn't fit, so now that I've doubled the limit I presume this means we get better Japanese indexing as well as a freebie
Comment 8 Robinson Tryon (qubit) 2015-12-16 00:25:05 UTC
Migrating Whiteboard tags to Keywords: (EasyHack,DifficultyInteresting,SkillCpp,TopicCleanup)
[NinjaEdit]