Bug 48729

Summary: autocorrect limit. acor.dat with entry 65535: Loop and/or loss of acor data
Product: LibreOffice Reporter: tommy27 <barta>
Component: LinguisticAssignee: Not Assigned <libreoffice-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: jmadero.dev, quikee, shantanu.oak, timar74
Version: Inherited From OOo   
Hardware: All   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=44580
https://bugs.freedesktop.org/show_bug.cgi?id=46805
Whiteboard: EasyHack,DifficultyBeginner,SkillCpp,TopicCleanup target:3.7.0
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 37361    

Description tommy27 2012-04-15 04:19:50 UTC
------------------
INTRODUCTION
------------------
This bug report is the twin of OOo issue 87672 ( https://issues.apache.org/ooo/show_bug.cgi?id=87672 ) which I reported in March 2008 in OOo 2.4 (gosh!!! 5 years!!! times runs fast!!!). for this reason I suggest not to change the bug title which is the same.

------------------
SUMMARY
------------------
autocorrect databases can store only 65534 per language
if you reach 65535 limit the file crashes and the database is erased.

so there are 2 problems

1- autocorrect entries number is limited

2- loss of data when that limit is reached

------------------
BACKGROUND
------------------
Some informations to get into it:

1- OOo/LibO autocorrect function is based on database of couple of entries (bad spelling --> correct spelling) collected in an .xml file called DocumentList.xml which is stored inside an acor_.dat file (basically it's a .zip with .dat extension) under this Windows path ...User\LibreOffice 3\user\autocorr

2- there's one global .dat  file (acor_.dat) which applies autocorrect items in any languge and a single .dat file for each single language and variant (i.e. acor_en-GB.dat for british english,  acor_en-US.dat foramerican english, acor_it-IT.dat for italian etc. etc.) which applies autocorrect items only if you are writing in that particular language

3- the DocumentList.xml file containing all the entries is based on a 16-bit structure so it has an upper limit of 2^16 = 65 536

4- for some reasons if you get close to that limit, with entry number 65 535, the .xml file “implodes” and all the database is erased with loss of data.

------------------
POSSIBLE SOLUTIONS
------------------
1- rise the upper limit from 2^16 to higher value... if I remember correctly once Calc had the same 2^16 limit for cell numbers which was upgraded to 2^20 = 1 048 576. Could something similar be done for autocorrect as well?

2- allow use of multiple .dat files for the same language (i.e. acor_en-GB1.dat, acor_en-GB2.dat, acor_en-GB3.dat etc. etc.) each one with the 65K limit. Once one is full, you can move to another just as it already happens with custom dictionaries... you can have more than one of those indeed.

------------------
SOURCE CODE
------------------
the autocorrect function is coded into the svxcorr.cxx file

http://docs.libreoffice.org/editeng/html/svxacorr_8cxx_source.html

------------------
TEST FILES
------------------
the .zip contains the acor_it-IT.dat and documentlist.xml files that crash in OOo/LibO once you add another autocorrect entry

https://issues.apache.org/ooo/attachment.cgi?id=52416
Comment 1 tommy27 2012-04-30 12:51:51 UTC
setting this as NEW since this issue was already confirmed in the OOo era and it's still reproducible in LibO (tested either on 3.5.2 and master~2012-04-26_21.37.10_LibO-Dev_3.6.0alpha0_Win_x86_install_en-US)
Comment 2 Michael Meeks 2012-09-13 11:29:53 UTC
Hopefully reasonably easy to fix; needs some testing though; the code to poke at is in:

cui/source/tabpages/autocdlg.cxx - for the editing UI
  + quite possibly there are problems with the list widget here wrt. longer lists.

editeng/inc/editeng/svxacorr.hxx
    - is a good place to start reading for the underlying structures

Of course bumping this list up will potentially create some nasty performance issues around the list code & so on, but - we'll have to deal with those too I imagine.
Comment 3 tommy27 2012-09-13 19:45:50 UTC
I'm not a developer, and I have no C++ knowledge, so I could have misunderstood some things, however as far as I see the issue is due to the fact that the documentlist.xml containg all entries is a 16-bit structure, hence the limit is 65535.

if it could be adapted to 32-bit the limit would be much higher...

looking at LibO 4 development page ( http://wiki.documentfoundation.org/Development/LibreOffice4 ) I see this:

>>> consider moving all 16-Bit integer types to 32-Bit at least 

>>> this cant be done in bulk as the are still a lot of SAL_MAX_INT16-2 style 
>>> magic numbers in the applications, so needs a very careful evaluation 
>>> -Bjoern-michaelsen 2012-07-29T17:09:34 (CEST)

and looking inside svxacorr.cxx I see a lot of "sal_uInt16 nLang"

does this make any sense to you?
Comment 4 tommy27 2012-09-19 03:55:19 UTC
Since I'm no developer but I'm very interest in a fix for this bug, I decided to donate money to support the devs. I have just sent a 50 euros wire bank transfer to the TDF bank account (IBAN: DE12666900000003497390, BIC: VBPFDE66)
 
check your account and you will see a 50 euros transfer with this infos: FUNDING FOR LIBO BUG 48729 - HTTPS://BUGS.FREEDESKTOP.ORG/SHOW_BUG.CGI?ID=48729
 
I will also donate another 50 euros to the personal bank account of the hacker who will finally fix this.
 
The bug has 2 main part:
1- autocorrect entries is limit to 16^2 --> 65535
2- once the limit is reached the database gets erased
 
the most important thing for me is actually to raise the number of entries (from 16^2 to 20^2 or 32^2) since I need to add more autocorrect terms in my LibreOffice.
 
just fixing the database erase without raising that number of available entries would not help me, since my priority is to expand the autocorrect database capacity
 
thank you, TOMMY
Comment 5 Michael Meeks 2012-09-19 08:37:41 UTC
Hi Tommy; thanks for your donation. TDF does not itself do development work for money, although many individuals and companies in our ecosystem do work that way. We appreciate your support & interest, and hopefully someone will pick up this easy hack at some stage ! Thanks you for your generosity.
Comment 6 Tomaz Vajngerl 2012-09-19 09:19:42 UTC
Hi,

I like the fact that you made a donation and offered money to get this resolved instead of bitching about it. This problem doesn't sound like it is hard to solve  and it sound interesting so I will take a look at it today. I don't need the money tough.

Regards, Tomaž
Comment 7 tommy27 2012-09-19 12:30:57 UTC
@Tomaz

thanks for taking care of it.
as I said in the first comment, froma a non-developer point of view I actually see 2 possible ways to fix it.

1- rise the upper limit from 2^16 to higher value.
if I remember correctly once Calc had the same 2^16 limit for cell numbers which was upgraded to 2^20 = 1 048 576. 

2- allow use of multiple .dat files for the same language 
(i.e. acor_en-GB1.dat, acor_en-GB2.dat, acor_en-GB3.dat etc. etc.) each one with the 65K limit. Once one is full, you can move to another just as it already happens with custom dictionaries... you can have more than one of those indeed.

so, I expect that from a developer point of view you could find out which is the easiest way to hack it up, and which one will have less impact on performance
Comment 8 Not Assigned 2012-09-19 18:31:28 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=78a39502de36c32d7aaf6e5bbde1f8df80fdd21f

fdo#48729 Change int16->int32 in auto replace dialog.



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 9 Tomaz Vajngerl 2012-09-19 18:53:50 UTC
Hi,

I think I found the problem. Everything was already using dynamic lists/sets and all the nice things but the code in the dialog still used sal_uInt16 variables for position and iteration. I have commited a fix into the master so it will be available in the next daily build. Tommy please test if it is working now.

Another problem is that it is slow as hell to add/edit entries in the autocorrect table - this could still be optimized as I don't see a reason it is so slow when there are only 65k entries.

Best regards, Tomaž
Comment 10 tommy27 2012-09-19 19:40:24 UTC
(In reply to comment #9)
> Hi,
> 
> I think I found the problem. 
> ... snip ... 

>I have commited a fix into the master so it will be available 
> in the next daily build. Tommy please test if it is working now.

WOW!!! how fast have you been!!!

so if I get it right, being a Windows user, I shall download a new master build
in the next couple of days from here: 
http://dev-builds.libreoffice.org/daily/Win-x86@6/

I'm eager to test the new 32-bit enabled autocorrect function and report if it
works with larger entries database and if such higher capacity has any effect
on performance.

> Another problem is that it is slow as hell to add/edit entries in the
> autocorrect table - this could still be optimized as I don't see a reason it is
> so slow when there are only 65k entries.

this is a know problem, already reported by me as Bug 49350.
a backtrace for this bug is available and Micheal Meeks labeled it as Easy Hacks suggesting a way to obtain a 10x speedup for it

> Best regards, Tomaž

thank you again very much,
Tommy
Comment 11 Andras Timar 2012-09-19 20:19:55 UTC
(In reply to comment #9)
> I think I found the problem. Everything was already using dynamic lists/sets
> and all the nice things but the code in the dialog still used sal_uInt16
> variables for position and iteration.

Because the underlying classes use xub_StrLen everywhere, which is in fact sal_uInt16. IMHO you need to change those, too.
Comment 12 Joel Madero 2012-09-19 21:45:00 UTC
@Tomaz - first thanks for doing this quickly :)

Now my concern....I think this sets a terrible precedent, someone offers cash and even though it was declined, a patch is made right away. Might something we need to discuss going forward :) 

@reporter - thanks for reporting, donating, and being a champion for this issue, I'm glad that it looks resolved for you :)
Comment 13 Tomaz Vajngerl 2012-09-20 05:01:17 UTC
(In reply to comment #11)
> Because the underlying classes use xub_StrLen everywhere, which is in fact
> sal_uInt16. IMHO you need to change those, too.

Hm... OK I will have to go through the code again today.
Comment 14 Michael Meeks 2012-09-20 08:52:21 UTC
Hi Joel,

> Now my concern....I think this sets a terrible precedent, someone offers cash
> and even though it was declined, a patch is made right away. Might something
> we need to discuss going forward :) 

Not sure I think it's a terrible precedent; it's just a micro-payments version of what larger consultancies are doing I think; of course bugzilla is not a terribly elegant tool for that - but lets discuss at the ESC later.

ATB,

Michael.
Comment 15 Tomaz Vajngerl 2012-09-20 14:12:44 UTC
> so if I get it right, being a Windows user, I shall download a new master build
> in the next couple of days from here: 
> http://dev-builds.libreoffice.org/daily/Win-x86@6/
> 
> I'm eager to test the new 32-bit enabled autocorrect function and report if it
> works with larger entries database and if such higher capacity has any effect
> on performance.

The fix is included in: http://dev-builds.libreoffice.org/daily/Win-x86@6/master/2012-09-19_20.37.35/master~2012-09-19_20.37.35_LibO-Dev_3.7.0.0.alpha0_Win_x86_install_en-US.msi

(and any later daily build)
Comment 16 tommy27 2012-09-21 03:44:01 UTC
@Tomaz

positively tested your fix on master-2012-09-19_20.37.35/

I merged two 65K autocorrect database to create a 130K database.
LibO can now handle this enlarged acor.dat files without crashing and loosing all data as happened before with the 16bit limit.

adding new entries by "right click autocorrect suggestions" is still fast and unaffected, while adding new entries in the "autocorrect replacement table" is even slower than before because of the higher number of entries.

as I said before that is Bug 49350, which I happily see you have already tried to fix with a new patch. Once a new master build is available I'll test it as well and give my feedback.

great job!!! thank you very much!!!
Comment 17 tommy27 2012-09-29 08:04:57 UTC
setting this as RESOLVED FIXED in master (future LibO 3.7).

I'll take the time to do some more testing about performance and then I'll ask cherry-pick in LibO 3.6.x

thanks again Tomaz for your wonderful work.
Comment 18 tommy27 2012-10-04 03:38:49 UTC
ok, my test say that Tomaz's patch has no serious side effects.

so I'm asking him to undergo review on the dev-list for cherry-picking in LibO 3.6.3
Comment 19 tommy27 2012-12-09 07:54:39 UTC
LibO 4.0 is imminent (february 2013) however I think this easy-hack fix is worth backporting to the 3.6.x branch which still has 2 programmed micro-releases (.5 and .6 which are scheduled in february and april 2013).

the expanded capacity of autocorrect database (from 65K to 4M) IMHO is a great added value in LibO towards OOo/AOO which still suffer the 16bit limit.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.