106312 – Spurious whitespace added after an "ActualText" segment

Bug 106312 - Spurious whitespace added after an "ActualText" segment

Summary: Spurious whitespace added after an "ActualText" segment

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-29 16:37 UTC by Michaël Meyer
Modified:	2018-08-20 22:02 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Sample PDF (5.48 KB, application/pdf) 2018-04-29 16:37 UTC, Michaël Meyer	Details
View All

Description Michaël Meyer 2018-04-29 16:37:39 UTC

Created attachment 139219 [details]
Sample PDF

The attached PDF file contains two times the same string "aṭa", in a regular font and in an italic font, respectively. In both cases, the dot below "t" is rendered with an IPA font, and the resulting character is overlayed with the corresponding code point (U+1E6D) as "ActualText".

Now, extracting the PDF text with "pdftotext" (or copy-pasting the text from a PDF viewer that uses Poppler) results in the string "aṭa aṭ a" instead of the expected "aṭa aṭa". Both Acrobat Reader and Google Chrome's builtin PDF viewer correctly produce the string "aṭa aṭa".

Looking at Poppler's code, it looks like the culprit is the following check in "poppler/TextOutputDev.cc":

    if (overlap || lastCharOverlap ||
	sp < -minDupBreakOverlap * curWord->fontSize ||
	sp > minWordBreakSpace * curWord->fontSize || // PROBLEM HERE
	fabs(base - curWord->base) > 0.5 ||
	curFontSize != curWord->fontSize ||
	wMode != curWord->wMode
	) {
      endWord();
    }

Slightly increasing the value of "minWordBreakSpace" produces the expected result. This makes me think that "curWord->fontSize" is not computed properly for the italic font.

The attached PDF file was produced with the following latex code (to be compiled with lualatex):

   \documentclass[12pt]{article}

   \usepackage{newunicodechar}
   \usepackage[luatex]{accsupp}
   \usepackage{tipa}

   \newunicodechar{ṭ}{%
      \BeginAccSupp{%
         method=hex,%
         unicode=true,%
         ActualText=1e6d,%
      }%
      \textsubdot{t}%
      \EndAccSupp{}%
   }

   \begin{document}
   \thispagestyle{empty}
   aṭa \textit{aṭa}
   \end{document}

Comment 1 GitLab Migration User 2018-08-20 22:02:53 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/173.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.