62266 – [PATCH] try to detect line breaks in the PDF and insert them in raw mode for pdftotext

Bug 62266 - [PATCH] try to detect line breaks in the PDF and insert them in raw mode for pdftotext

Summary: [PATCH] try to detect line breaks in the PDF and insert them in raw mode for ...

Status:	RESOLVED INVALID

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-03-12 22:42 UTC by Andrew Gallant
Modified:	2013-04-06 21:39 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Adds parabrk option to pdftotext (5.44 KB, text/plain) 2013-03-12 22:42 UTC, Andrew Gallant	Details
View All

Description Andrew Gallant 2013-03-12 22:42:05 UTC

Created attachment 76449 [details]
Adds parabrk option to pdftotext

Adds the parabrk option to `pdftotext`.

The parabrk option is only applicable to raw mode, and attempts to insert an 
additional new line character wherever one can be detected in the PDF. It is 
intended to separate paragraphs when they are separated by vertical whitespace in the PDF.

It isn't perfect, for instance, it doesn't handle page boundaries.

Comment 1 Albert Astals Cid 2013-03-19 22:42:02 UTC

Hi Andrew, I am not sure this patch makes any sense, raw mode order is raw, knows nothing about paragraphs, so trying to add an option that says 
   attempt to insert empty lines between paragraphs with -raw
Is implying -raw knows about paragraphs, that as far as i know it doesn't

Can you elaborate the need for this patch?

Comment 2 Andrew Gallant 2013-03-19 23:23:43 UTC

Perhaps the option is ill-named. What it's really doing is trying to insert a single new line whenever one or more can be detected in the PDF (as defined by an amount of white space greater than the line spacing). I think this would fall under the category "raw" mode.

I chose the name because the intended use case of identifying vertical white space in the PDF is to translate that white space into the raw text generated. Usually this results in a separation of paragraphs that are also separated by vertical white space in the PDF.

The actual need is an attempt to output raw text with respect to the PDF as faithfully as possible. It's quite nice to get raw text that has line breaks wherever they were found in the PDF.

Comment 3 Albert Astals Cid 2013-03-19 23:41:38 UTC

You understand raw text is *not* the order text is in the page, right?

Comment 4 Andrew Gallant 2013-03-19 23:45:04 UTC

Hmm. I assumed it was in reading order. In particular, it gets the order of text in two-column PDFs seemingly correct. At least, it has done so in a limited number of test cases that I've checked. (Mostly journal articles.)

Comment 5 Andrew Gallant 2013-03-19 23:47:24 UTC

Right, in the usage info: "-raw  keep strings in content stream order". I am merely trying to add line breaks to that stream where I can find them in the PDF.

Comment 6 Albert Astals Cid 2013-03-19 23:52:25 UTC

Stream order means "as found out in the stream" it may be reading order and it may not, you're just being lucky, and thus since they are not reading order it makes no sense to do any processing on it.

Comment 7 Andrew Gallant 2013-03-20 00:01:43 UTC

Ah, dang. I did not realize "stream" was jargon in the PDF world.

However, isn't there still some wiggle room for processing? For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word (or if the next word is to the left of the current word). I understand my change to be in a similar light of this kind of processing. i.e., there actually *is* some assumption of reading order in "raw" mode.

Comment 8 Albert Astals Cid 2013-03-21 22:44:41 UTC

No there is no assumption of reading, it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle, and that's because in a pdf you don't need to put space characters if you don't want, and one thing is extracting text in raw order and the other is extracting all the text in a single string with no spaces in between :D

Comment 9 Andrew Gallant 2013-03-21 22:55:08 UTC

> it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle

It is more than that. As I said:

> For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word

The raw text isn't just having spaces added, but it is also getting new lines added whenever the vertical space between the current word and the next word exceeds the `maxIntraLineDelta` constant.

My patch is a very small extension of this sort of logic: add an additional new line when the vertical space between the current word and next word exceeds the `maxLineSpacingDelta` constant.

I don't think my patch makes any additional assumptions beyond the assumptions already made by the code.

Comment 10 Albert Astals Cid 2013-03-25 21:16:53 UTC

(In reply to comment #9)
> > it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle
> 
> It is more than that. As I said:
> 
> > For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word
> 
> The raw text isn't just having spaces added, but it is also getting new
> lines added whenever the vertical space between the current word and the
> next word exceeds the `maxIntraLineDelta` constant.
> 
> My patch is a very small extension of this sort of logic: add an additional
> new line when the vertical space between the current word and next word
> exceeds the `maxLineSpacingDelta` constant.
> 
> I don't think my patch makes any additional assumptions beyond the
> assumptions already made by the code.

It may not, but i don't see the need for your patch (you haven't made a case for it) and more code means more code I need to maitain for the rest of my life. In my opinion you are trying to use raworder for something that raworder is not supposed to do, why are you using raw order instead of the real physical order?

Comment 11 Andrew Gallant 2013-03-25 22:19:49 UTC

> It may not, but i don't see the need for your patch (you haven't made a case for it)

My patch is useful when one wants to capture groupings indicated by a particular amount of vertical white space in raw mode from the PDF. Raw mode is *already* capturing some kinds of vertical white space.

I've said this a couple of times now, but you don't seem to recognize it as me having made a case. Perhaps you could tell me what you would need to be convinced so that I can better make my case?

> In my opinion you are trying to use raworder for something that raworder is not supposed to do

I disagree. If that were so, then I'd be making assumptions about the text in raw order that the code hasn't already made. But I'm not. It's a tweak on existing logic that is already assuming some sort of reading order by looking at letter spacing and intra-line spacing and using that information to affect the output of raw mode. I propose to also look at inter-line spacing.

> why are you using raw order instead of the real physical order?

Because I want to attempt to extract a linear text stream from a PDF in reading order. Unless I am mistaken, raw mode seems best suited to do that. The new option in the patch makes that raw text easier to consume in some cases (just like adding new lines based on the intra-line spacing also makes it easier to consume).

Comment 12 Albert Astals Cid 2013-03-25 22:24:10 UTC

(In reply to comment #11)
> 
> > why are you using raw order instead of the real physical order?
> 
> Because I want to attempt to extract a linear text stream from a PDF in
> reading order. Unless I am mistaken, raw mode seems best suited to do that.

I already told you that raw order has nothing to do with reading order in comment #3

Comment 13 Andrew Gallant 2013-03-25 22:35:09 UTC

> I already told you that raw order has nothing to do with reading order in
> comment #3

I know you did. My response (several comments ago) was that your code
says otherwise. It's using an assumption of reading order to insert
line breaks. So why can't my patch use the same assumption?

- Andrew

Comment 14 Albert Astals Cid 2013-03-25 22:43:36 UTC

Because we will eventually remove the code.

From the man page
       -raw   Keep the text in content stream order.  This is a hack which often "undoes" column formatting, etc.  Use of raw mode is no longer recommended.

Comment 15 Andrew Gallant 2013-03-25 22:57:28 UTC

> Because we will eventually remove the code.
>
> From the man page
>        -raw   Keep the text in content stream order.  This is a hack which
> often "undoes" column formatting, etc.  Use of raw mode is no longer
> recommended.

From the description, it seems as if I am using raw mode exactly as it
was intended.

Do you just not want to add new features to a mode that is slated for slaughter?

- Andrew

Comment 16 Albert Astals Cid 2013-03-25 23:11:56 UTC

(In reply to comment #15)
> > Because we will eventually remove the code.
> >
> > From the man page
> >        -raw   Keep the text in content stream order.  This is a hack which
> > often "undoes" column formatting, etc.  Use of raw mode is no longer
> > recommended.
> 
> From the description, it seems as if I am using raw mode exactly as it
> was intended.

Ok, let's be clear, what is "reading order" for you?

Comment 17 Andrew Gallant 2013-03-26 23:22:30 UTC

>> > From the man page
>> >        -raw   Keep the text in content stream order.  This is a hack
>> > which
>> > often "undoes" column formatting, etc.  Use of raw mode is no longer
>> > recommended.
>>
>> From the description, it seems as if I am using raw mode exactly as it
>> was intended.
>
> Ok, let's be clear, what is "reading order" for you?

The order in which one reads the text in the PDF. This seems
consistent with the description of raw mode: it "often 'undoes' column
formatting."

- Andrew

Comment 18 Albert Astals Cid 2013-03-26 23:24:40 UTC

Not really, undo column formatting means that if you have three columns it will treat the three first lines of the columns as the first line. That is "undoing" the clommun formatting and that's not what i consider reading order

Comment 19 Andrew Gallant 2013-03-26 23:51:29 UTC

OK. That is contrary to every PDF I've tried with multiple columns (a considerable number, mostly journal articles), but I've made my case. I defer to your judgment.

Comment 20 Albert Astals Cid 2013-04-06 21:39:11 UTC

Well, if pdftotext does not correctly extract text, that is a bug you should file one and attach a file with the issue, until then i'm closing this, we are not really interested in hacks but in proper fixes to extract text.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.