Hi,

LOWriter performs the best conversion of text to PDF, in the 
sense of making the text copyable from the PDF file, of any 
of the systems I've tried in Linux.

In particular, in copying Hindi text, I know there are 
a lot of hoops to jump, not the least being that the
letters are often re-ordered by the font layout layer.

However, in producing a PDF file containing Hindi text,
there remain some problems.
 ----------------------------------------------------
The tests I ran were the following.
In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR)
सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्राप्त है । 
उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए ।
Format with a different font supporting Hindi.

Next, "Export as PDF".
Opened the resulting file with Adobe Reader.

Now select and copy the text from the PDF file,
and paste it into a text editor.
(Oh, let's hope the system's default encoding is UTF-8!)

It should appear like the original text above.
 ----------------------------------------------------
Note that I've done analogous tests starting from Firefox
(using the CUPS PDF printer), and with XeLaTex.
 ----------------------------------------------------
xHere I compared the distro Lohit Hindi and Gargi, as well
as GNU FreeSerif and GNU FreeSans (latest versions from SVN).
This is LOWriter 4.0.2.2 on Ubuntu.

Lohit Hindi
सभी मनुष्यों को गौरव और अधिधिकारों के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुि औद्धि और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतार्ताव करना चाि औहए ।
FreeSerif
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतारव करना चािहए ।
FreeSans
सभी मनुष्यों को गौरव और अधिधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्ाप्त है ।
उन्हें बुिद्धि और अधन्तरात्मा की देन प्ाप्त है और परस्पर उन्हें भाईचारे के भाव से बताव करना चािहए ।
Gargi
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बताव करना चािहए ।

They're *close*.  I'm very impressed that it re-orders the vowel signs!!!
Nobody else does that!!!


1) All tests show a duplication of letters in the word
	अधिकारों
producing
	अधिधिकारों

2) The anusvara--the dot on top in में--is lost in most cases.  
But for this one word, in FreeSans, it isn't lost, and in another, 
it is lost in FreeSerif and Gargi, but not in Lohit or FreeSans!  
What's the pattern???
Gargi has ligature of 094B+0902 it's named uni0972 and not in PUA, which is just a pity.
FreeSerif also has a ligature dev_o_anusvara.abvs.
FreeSans does not use this... relies on mark placement.
Lohit names the same glyph u094B_u0902.abvs.
I modified the FreeSerif font, placing the anusvara using GPOS positioning.
Then the anusvara reappeared in the test.
But something is wrong here.
	Guess: maybe for these 'abvs' replacements, the AGLFN is being used?
	Otherwise, for unknown reasons, the anusvara isn't being extracted
	properly from the ligature glyphs.

3) The word बुद्धि is screwed up in every case but in different ways.

	I the last glyph is a ligature, da-dha.  The ligature is decomposed
	into consonants, but the dha is lost.  Then it gets challenged re-ordering 
	the vowel sign.
	(it should be repositioned after the decomposed consonants)

4) Another strange duplication in Lohit only: बतार्ताव

5) FreeSerif, Sans, Gargi: the reph बर्ताव should be transformed back to ra-virama,
but instead it's lost entirely: बताव
	This information should be gotten from the font's 'rphf' table.
	All of the fonts have it.  
	The only difference I see is: Lohit uses an AGLFN name.

6) Weird: अन्तरात्मा gets a ध (dha) in every case! अधन्तरात्मा
	Could this be the dha lost in (3)?
 ----------------------------------------------------
My catch on this:

Clearly you code uses the font's feature tables to construct the PDF file's ToUnicode.
This is good and right, although hard to do right (and in fact, since the tables
constitute a many-to-many mapping of glyphs to character strings and ToUnicode is many-to-one, it's impossible to to perfectly).

I suspect you are using the AGLFN.  If that's so, please re-consider.
In the presence of OpenType tables, glyph names should be ignored.
The glyph names can at best duplicate what's in the tables. Otherwise
following the AGLFN *only* causes problems.
Oh, hell, maybe you could use it to break a tie, when the feature tables
map the same glyph to two different character strings.
But please, usually, ignore the AGLFN.  It is not a good thing.

I think this is only a few bugs away from working adequately.

Further things to consider:
1) the rearrangement of vowels really is necessary.
2) the OpenType standards for Indic scripts changed in 2005, 
   making new scripts such as 'dev1' to replace 'deva', 
   in which several feature tags are changed, notably, 
   the order of inputs to 'akhn', 'abvf', 'blwf' are altered.