Created attachment 76449 [details] Adds parabrk option to pdftotext Adds the parabrk option to `pdftotext`. The parabrk option is only applicable to raw mode, and attempts to insert an additional new line character wherever one can be detected in the PDF. It is intended to separate paragraphs when they are separated by vertical whitespace in the PDF. It isn't perfect, for instance, it doesn't handle page boundaries.
Hi Andrew, I am not sure this patch makes any sense, raw mode order is raw, knows nothing about paragraphs, so trying to add an option that says attempt to insert empty lines between paragraphs with -raw Is implying -raw knows about paragraphs, that as far as i know it doesn't Can you elaborate the need for this patch?
Perhaps the option is ill-named. What it's really doing is trying to insert a single new line whenever one or more can be detected in the PDF (as defined by an amount of white space greater than the line spacing). I think this would fall under the category "raw" mode. I chose the name because the intended use case of identifying vertical white space in the PDF is to translate that white space into the raw text generated. Usually this results in a separation of paragraphs that are also separated by vertical white space in the PDF. The actual need is an attempt to output raw text with respect to the PDF as faithfully as possible. It's quite nice to get raw text that has line breaks wherever they were found in the PDF.
You understand raw text is *not* the order text is in the page, right?
Hmm. I assumed it was in reading order. In particular, it gets the order of text in two-column PDFs seemingly correct. At least, it has done so in a limited number of test cases that I've checked. (Mostly journal articles.)
Right, in the usage info: "-raw keep strings in content stream order". I am merely trying to add line breaks to that stream where I can find them in the PDF.
Stream order means "as found out in the stream" it may be reading order and it may not, you're just being lucky, and thus since they are not reading order it makes no sense to do any processing on it.
Ah, dang. I did not realize "stream" was jargon in the PDF world. However, isn't there still some wiggle room for processing? For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word (or if the next word is to the left of the current word). I understand my change to be in a similar light of this kind of processing. i.e., there actually *is* some assumption of reading order in "raw" mode.
No there is no assumption of reading, it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle, and that's because in a pdf you don't need to put space characters if you don't want, and one thing is extracting text in raw order and the other is extracting all the text in a single string with no spaces in between :D
> it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle It is more than that. As I said: > For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word The raw text isn't just having spaces added, but it is also getting new lines added whenever the vertical space between the current word and the next word exceeds the `maxIntraLineDelta` constant. My patch is a very small extension of this sort of logic: add an additional new line when the vertical space between the current word and next word exceeds the `maxLineSpacingDelta` constant. I don't think my patch makes any additional assumptions beyond the assumptions already made by the code.
(In reply to comment #9) > > it is just an assumption that if two characters are separated enough one from the other, there is a space in the middle > > It is more than that. As I said: > > > For example, the current code inserts a new line whenever the next word is detected to not be in the same line as the current word > > The raw text isn't just having spaces added, but it is also getting new > lines added whenever the vertical space between the current word and the > next word exceeds the `maxIntraLineDelta` constant. > > My patch is a very small extension of this sort of logic: add an additional > new line when the vertical space between the current word and next word > exceeds the `maxLineSpacingDelta` constant. > > I don't think my patch makes any additional assumptions beyond the > assumptions already made by the code. It may not, but i don't see the need for your patch (you haven't made a case for it) and more code means more code I need to maitain for the rest of my life. In my opinion you are trying to use raworder for something that raworder is not supposed to do, why are you using raw order instead of the real physical order?
> It may not, but i don't see the need for your patch (you haven't made a case for it) My patch is useful when one wants to capture groupings indicated by a particular amount of vertical white space in raw mode from the PDF. Raw mode is *already* capturing some kinds of vertical white space. I've said this a couple of times now, but you don't seem to recognize it as me having made a case. Perhaps you could tell me what you would need to be convinced so that I can better make my case? > In my opinion you are trying to use raworder for something that raworder is not supposed to do I disagree. If that were so, then I'd be making assumptions about the text in raw order that the code hasn't already made. But I'm not. It's a tweak on existing logic that is already assuming some sort of reading order by looking at letter spacing and intra-line spacing and using that information to affect the output of raw mode. I propose to also look at inter-line spacing. > why are you using raw order instead of the real physical order? Because I want to attempt to extract a linear text stream from a PDF in reading order. Unless I am mistaken, raw mode seems best suited to do that. The new option in the patch makes that raw text easier to consume in some cases (just like adding new lines based on the intra-line spacing also makes it easier to consume).
(In reply to comment #11) > > > why are you using raw order instead of the real physical order? > > Because I want to attempt to extract a linear text stream from a PDF in > reading order. Unless I am mistaken, raw mode seems best suited to do that. I already told you that raw order has nothing to do with reading order in comment #3
> I already told you that raw order has nothing to do with reading order in > comment #3 I know you did. My response (several comments ago) was that your code says otherwise. It's using an assumption of reading order to insert line breaks. So why can't my patch use the same assumption? - Andrew
Because we will eventually remove the code. From the man page -raw Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
> Because we will eventually remove the code. > > From the man page > -raw Keep the text in content stream order. This is a hack which > often "undoes" column formatting, etc. Use of raw mode is no longer > recommended. From the description, it seems as if I am using raw mode exactly as it was intended. Do you just not want to add new features to a mode that is slated for slaughter? - Andrew
(In reply to comment #15) > > Because we will eventually remove the code. > > > > From the man page > > -raw Keep the text in content stream order. This is a hack which > > often "undoes" column formatting, etc. Use of raw mode is no longer > > recommended. > > From the description, it seems as if I am using raw mode exactly as it > was intended. Ok, let's be clear, what is "reading order" for you?
>> > From the man page >> > -raw Keep the text in content stream order. This is a hack >> > which >> > often "undoes" column formatting, etc. Use of raw mode is no longer >> > recommended. >> >> From the description, it seems as if I am using raw mode exactly as it >> was intended. > > Ok, let's be clear, what is "reading order" for you? The order in which one reads the text in the PDF. This seems consistent with the description of raw mode: it "often 'undoes' column formatting." - Andrew
Not really, undo column formatting means that if you have three columns it will treat the three first lines of the columns as the first line. That is "undoing" the clommun formatting and that's not what i consider reading order
OK. That is contrary to every PDF I've tried with multiple columns (a considerable number, mostly journal articles), but I've made my case. I defer to your judgment.
Well, if pdftotext does not correctly extract text, that is a bug you should file one and attach a file with the issue, until then i'm closing this, we are not really interested in hacks but in proper fixes to extract text.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.