99824 – pdftotext breaks sentence in middle of sentence when text overflow the box, whereas pdftohtml captures the full sentence.

Bug 99824 - pdftotext breaks sentence in middle of sentence when text overflow the box, whereas pdftohtml captures the full sentence.

Summary: pdftotext breaks sentence in middle of sentence when text overflow the box, w...

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-02-15 12:19 UTC by Gaurav Arora
Modified:	2018-08-21 10:33 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
sample pdf which is facing this issue (6.51 MB, application/pdf) 2017-02-15 12:19 UTC, Gaurav Arora	Details
Image showing how the text looks like in pdf (558.09 KB, image/png) 2017-02-15 12:20 UTC, Gaurav Arora	Details
View All

Description Gaurav Arora 2017-02-15 12:19:08 UTC

Created attachment 129623 [details]
sample pdf which is facing this issue

While analyzing some specific set of files, we realized that lines generated by pdftohtml and pdftotext is different where text overflows the line boundary of box.

In case of pdftohtml the line is captured normally with full text of that line in a single text element. Whereas in case of pdftotext line is broken in middle of word and the rest of line is added as a separate line.

Explanation with example below:

Line as appear in pdftohtml output:

<text top="412" left="79" width="1021" height="17" font="0">To JOSEPH E. BLUTH for research and development in the field of electronic photography and transfer of video tape to motion picture film. [Laboratory]</text>



Line as appear in pdftotext

To JOSEPH E. BLUTH for research and development in the field of electronic photography and transfer of video tape to mo

.
.
.
.
.
sfer of video tape to motion picture film. [Laboratory]


Line as it appear in pdf file:


http://i67.tinypic.com/i6w66e.png


Even though pdf file doesn't show this line correctly. pdftohtml is correctly able to get the full line, hence pdftotext can also handle and get the full line.

It seems weird for line to be broken like this. I have attached a sample pdf file which shows this bug. I have tested the the file with poppler-0.51.0

/poppler/tmp/poppler-0.51.0$ /usr/local/bin/pdftotext -v
pdftotext version 0.51.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

Comment 1 Gaurav Arora 2017-02-15 12:20:21 UTC

Created attachment 129624 [details]
Image showing how the text looks like in pdf

Comment 2 Jason Crain 2017-02-15 15:04:18 UTC

I think this is the correct behavior for pdftotext.  You have a document which has weird formatting and pdftotext tries to respect that formatting.  The fact that pdftohtml doesn't do this is more likely a bug in pdftohtml.

Comment 3 GitLab Migration User 2018-08-21 10:33:03 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/252.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.