Description
Jason Crain
2012-02-25 01:13:44 UTC
Created attachment 57623 [details]
incorrect display in evince
Created attachment 57624 [details]
correct display in evince
Created attachment 57625 [details] [review] Fixed display for selected glyph in ActualText span This patch seems to correct the issue. It sets the correct CharCode. I don't think this patch is correct, you are only setting the charcode once in ActualText::addChar so if multiple ActualText::addChar calls happen before ActualText::end the other charcodes are lost, no? The patch doesn't work when the ActualText span contains more than one glyph. There is a test case in the test suite that demonstrates the problem: http://cgit.freedesktop.org/poppler/test/tree/unittestcases/WithActualText.pdf Created attachment 57989 [details] [review] Enable displayed chars to map to any number of text chars It's tricky when the length of the ActualText does not match the number of displayed glyphs. This first patch modifies the TextWord, TextLine, TextLineFrag, TextBlock, and TextPage classes to suport displayed characters that can be mapped to any number of text characters. Created attachment 57990 [details] [review] Fixes display for selected glyphs in ActualText span This sets the correct CharCode for each glyph in an ActualText span. Attempts to match one text character to each glyph. If there are more glyphs, they are added without matching text. If there is more text, the remaining is added to the last glyph. Created attachment 57993 [details] [review] fix selection of glyphs in actualtext Thanks for these patches. I have started looking at some of the text related bugs and the inability of TextOuputDev to understand the difference between glyphs and characters is the cause of some of these bugs. The first patch is very similar to the solution I had in mind. The first patch also fixes bug 9001. Some comments on the first patch: The following code in TextBlock::coalesce() needs fixing: if (word2->len == word0->len && !memcmp(word2->text, word0->text, word0->len * sizeof(Unicode))) { len need to be replaced with textLen. I don't think addChar should be renamed to addChars. My understanding of the code is that 'Char' is referring to the CharCodes and only one CharCode is added per call. I would move the surrogate decoding to TextWord:addChar() and do the decoding as the unicode values are copied into the text array. This avoids having to make a copy of the unicode array in TextPage::addChar(). Some comments on the second patch: I don't like the way the second patch is mapping the replacement text to the charcodes. I am attaching an alternative patch that distributes the replacement text evenly across the charcodes. Comment on attachment 57993 [details] [review] fix selection of glyphs in actualtext Review of attachment 57993 [details] [review]: ----------------------------------------------------------------- ::: poppler/TextOutputDev.cc @@ +5331,5 @@ > + // If this is the last glyph ensure all remaining text is included > + // as pos may be < length due to rounding errors. > + if (i == lenGlyphs - 1) > + count = length - first; > + text->addChar(state, glyphs[i].x, glyphs[i].y, glyphs[i].dx, glyphs[i].dy, This needs to make sure that a surrogate pair is not split Created attachment 58013 [details] [review] Enable displayed chars to map to any number of text chars updated patch Guys, i'm a bit lost, sorry, are both patches supposed to fix the same issue in a different way? Both patches are required. The "fix selection of glyphs in actualtext" ensures all charcodes are passed through to TextPage. I need to update this to correctly handle surrogates. The "enable displayed chars to map to any number of text chars" is a prerequisite for the "fix selection of glyphs in actualtext" patch. It also fixes text selection of ligatures (bug 9001). When using Jason's patch on a pdf that i will be attaching in a moment, pdftotext changes from extracting Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing through 𝑥. We want to define an order on the set ℒ 𝑥 ∖ {𝐿} in the same way as we did in the proof of Proposition 5. In the disk model ̃ 𝑃 with boundary circle 𝐿, every line 𝐻 ∈ ℒ 𝑥 ∖{𝐿} separates ̃ 𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may be disconnected). Since we know from Propositions 6 and 7 that lines always intersect transversally, it follows that for two such lines, one of the respective lower parts is entirely to Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing through 𝑥. We want to define an order on the set ℒ𝑥 ∖ {𝐿} in the same way as we with boundary circle 𝐿, every did in the proof of Proposition 5. In the disk model 𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may line 𝐻 ∈ ℒ𝑥 ∖{𝐿} separates 𝑃 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect transversally, it follows that for two such lines, one of the respective lower parts is entirely You can see that in the second case some weird/unwanted reordering happened Created attachment 58045 [details]
The said pdf
Created attachment 58178 [details] [review] convert utf-16 to ucs-4 when reading ToUnicode The next two patches fix the problem of "fix selection of glyphs in actualtext" not handling surrogates. The "Unicode" type is meant to be UCS-4 so the solution is to convert UTF-16 to UCS-4 when it the ToUnicode cmap is parsed. This patch does the UTF-16 conversion in CharCodeToUnicode.cc. As a result the special surrogate handling in TextOutputDev and HtmlOutputDev can be removed. Created attachment 58179 [details] [review] move text string to unicode conversion into a separate function This patch adds a new function for converting PDF text strings to UCS-4. As a result the duplicated code in TextOutputDev and pdfinfo can be replaced by a call to this function. This patch is to ensure that my updated "fix selection of glyphs in actualtext" does not have to care about surrogates. Created attachment 58180 [details] [review] fix selection of glyphs in actualtext The updated version of "fix selection of glyphs in actualtext". The patch order is: 1 - convert utf-16 to ucs-4 when reading ToUnicode 2 - move text string to unicode conversion into a separate function 3 - Enable displayed chars to map to any number of text chars 4 - fix selection of glyphs in actualtext Patch 3 needs to be updated to remove the surrogate handling and fix the regression. Created attachment 58643 [details] [review] fix regressions This patch fixes the regressions in "Enable displayed chars to map to any number of text chars". The problem is the changes now allow glyphs that map to zero length unicode strings to be added to TextWords. Often these glyphs have overlapping bounding boxes or are not on the same baseline. This confuses TextOutputDev when trying to determine the layout of the text. This patch does two things: - it avoids breaking words when one of these glyphs with an empty mapping is encountered - it increases the tolerance for overlapping bounding boxes. With the attached PDF the result the text output is still different but checking the differences it is actually an improvement. However I suspect the changes could potentially break other PDFs. If this patch causes problems, plan B is to change TextOutputDev to ignore the glyphs with zero mapping when determining the text layout (but still add these glyphs to the words to make text selection work correctly). This should emulate the old behavior as closely as possible. Created attachment 58644 [details] [review] Enable displayed chars to map to any number of text chars This is Jason's patch rebased so it applies on top of my first two patches. The patch order is: 1 - convert utf-16 to ucs-4 when reading ToUnicode 2 - move text string to unicode conversion into a separate function 3 - Enable displayed chars to map to any number of text chars 4 - fix selection of glyphs in actualtext 5 - fix regressions With all the patches applied the pdftotext extraction of https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459 changes from • Patches may be grouped with other patches to test the whole of a to • P atches may be grouped with other patches to test the whole of a The correct extraction would be having no newline, but if we are going to have a newline i very much prefer the old way than the new one that breaks the word after the P Created attachment 58860 [details] [review] Don't reverse order of words with same xMin This fixes the regression. Output is now: • Patches may be grouped with other patches to test the whole of a Tehre's still some problems, in the file that i will be attaching • white X in patch b: gets changed to •w hite X in patch b: Created attachment 58892 [details]
The pdf with the "white X in patch" problem
Created attachment 59003 [details] [review] don't start a new word if the previous char is a control char Problem is caused by a ^G that overlaps the first letter of the word. This patch avoids started a new word if a control characters overlaps other character. Although it would probably be better to strip out control characters from the extracted text. Acroread does not include the ^G in extracted text. With this last patch now we get P atch creation date instead of Patch creation date again in https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459 *** Bug 87401 has been marked as a duplicate of this bug. *** -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/350. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.