Bug 66597

Summary: Other: Text-copy problems with Hindi text copied from PDF
Product: LibreOffice Reporter: Steve White <stevan.white>
Component: Printing and PDF exportAssignee: Not Assigned <libreoffice-bugs>
Status: NEW --- QA Contact:
Severity: normal    
Priority: medium CC: dr.khaled.hosny, fitojb, samjnaa
Version: 4.0.2.2 release   
Hardware: Other   
OS: All   
Whiteboard: BSA
i915 platform: i915 features:
Bug Depends on: 62846    
Bug Blocks:    
Attachments: More thorough description of the problem.
LOWriter doc as described in report
PDF as exported on my system

Description Steve White 2013-07-04 19:23:51 UTC
Created attachment 82038 [details]
More thorough description of the problem.

Problem description: 

Steps to reproduce:
1. In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR)
सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्राप्त है । 
उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए ।
Format with a different font supporting Hindi.  I used
distro Lohit Hindi and Gargi, as well
as GNU FreeSerif and GNU FreeSans (latest versions from SVN).
2. Export as PDF
3. Open the resulting file with Adobe Reader.
Select and copy the text from the PDF file,
and paste it into a text editor.

Current behavior:
Lohit Hindi
सभी मनुष्यों को गौरव और अधिधिकारों के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुि औद्धि और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतार्ताव करना चाि औहए ।
FreeSerif
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतारव करना चािहए ।
FreeSans
सभी मनुष्यों को गौरव और अधिधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्ाप्त है ।
उन्हें बुिद्धि और अधन्तरात्मा की देन प्ाप्त है और परस्पर उन्हें भाईचारे के भाव से बताव करना चािहए ।
Gargi
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बताव करना चािहए ।

Expected behavior:
Should get something more like the original text back.

Operating System: All
Version: 4.0.2.2 release
Comment 1 Steve White 2013-07-04 19:27:30 UTC
Created attachment 82039 [details]
LOWriter doc as described in report
Comment 2 Steve White 2013-07-04 19:31:02 UTC
Created attachment 82040 [details]
PDF as exported on my system
Comment 3 Khaled Hosny 2013-07-04 21:05:29 UTC
Text extraction from PDF is a very unreliable process. Glyph names plays an important rule, and using proper glyph names in accordance with Adobe Glyph Naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) plays a big rule and should help extractability of text set in GNU FreeFont which currently contain useless (for text extraction) glyph names like dev_rakaar and aasigndeva. Glyph names does not help with re-ordering, and there is probably some LibreOffice bugs in setting ToUnicode values in PDF, but proper glyph names is the start.
Comment 4 Khaled Hosny 2013-07-04 21:15:42 UTC
Gargi and Lohit Hindi (at least my version of them) have some wrong glyph names as well.
Comment 5 Steve White 2013-07-04 21:25:48 UTC
Hi Khaled.

Of course we're aware that copying text from PDF is unreliable.
In fact, with the currrent technology, based on ToUnicode, it is impossible to reproduce the original text.

I am sure however, in the case of Indic scripts, it could be done in such a way that results in mostly readable text.

The reason I submitted this report to LibreOffice is that this product does the best job of the several approaches I tested.  I think it could be improved with the least effort, and serve as a model for other systems.

Regarding the AGLFN, as I said, it could be used it to break a tie, but otherwise, you should reconsider your statements.  The AGLFN cannot carry more information than the ToUnicode stream does, and OpenType feature tables carry more information than either can.  The best approach would be to judiciously use the OpenType featues to populate the ToUnicode stream.

As I said, the AGLFN could be used to break a tie in OpenType feature tables.  But if it conflicts with the feature tables, it cannot be right.  (And in fact, that's what my tests showed: technologies that relied on AGLFN often showed mistakes because of failure to code a glyph name...which is a pity because correct info was available.) It would be better to drop the technology.

Cheers!
Comment 6 Steve White 2013-07-05 08:31:50 UTC
Khaled,

Several of the bugs pointed out are logic errors in the generation code (for sure the duplicated characters, and I think also the disappearing/reeappearing one).  These have nothing to do with glyph naming.

I also pointed out that although Gargi and Lohit attempt (different) AGLFN schemes, each has bugs in that regard.  This is part of my complaint with the AGLFN.  In each case, there was sufficient information in the font's feature tables to produce ToUnicode entries which would have correctly decomposed the glyphs. Although often LibreOffice PDF generation algorithms use OpenType tables to populate ToUnicode, here the algorithms instead fell back to AGLFN, and failed.

It would be best to prefer the OpenType features in building ToUnicode, and fall back to AGLFN only to break a tie, in case those features would specify more than one character string for a given glyph.

Another thought:

How to tackle the re-ordering of glyphs (especially, the 'i' and 'ii' vowel signs) using ToUnicode?  (I don't know if LibreOffice attempts something like this, I just see it's mostly wrong.)  The idea is based on making compound glyphs in the internal representation of the PDF file  -- they need not correspond to slots in the original font.

When a glyph that needs re-ordering (as 'i' and 'ii') is detected, it should be possible to identify the following consonant cluster.  The entire group, including the vowel and cluster, could be made a single glyph.  Then the fake entry for that glyph in the ToUnicode stream would specify characters for the decomposed cluster, with the vowel re-ordered to the end of the cluster.

Of course, identifying the cluster could be tricky in some cases, but in modern Devanagari at least, it usually consists of a few half-form consonants followed by a consonant, or else a single consonant ligature.  (That may be all--need to consult Unicode ch. 9)

And of course, there are other ways to do it!
Comment 7 Shriramana Sharma 2014-03-28 16:19:50 UTC
Khaled, is this perhaps related to bug 62728, since adding support for PDF/A-2U will/should fix the problem? I also find that any Indic text does not get copied correctly from PDFs exported by LibO. Using latest release LibO 4.2.2 on Kubuntu Saucy.
Comment 8 Khaled Hosny 2014-03-28 21:20:16 UTC
I can’t find a complete specfication of PDF/A-2 level U, but it seems to require preserving the Unicode reprisentation of the text, which is indeed a goal shared with this bug as well.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.