Summary: | libpoppler cannot recreate pdftotext output, because physical_layout is not handled correctly | ||
---|---|---|---|
Product: | poppler | Reporter: | dummydummy |
Component: | cpp frontend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | dummydummy |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
new_poppler-page.cpp based on poppler 0.61.1 with my changes
new_poppler-page.h based on poppler 0.61.1 with my changes output of diff -u poppler-page.cpp (from poppler 0.61.1) new_poppler-page.cpp output of diff -u poppler-page.h (from poppler 0.61.1) new_poppler-page.h output of git diff > git-diff.txt |
Description
dummydummy
2017-11-17 17:02:49 UTC
as far as i see we need another enum, there's three modes posible raw not raw and physical layout not raw and not physical layout would you be able to work on a patch? Created attachment 135573 [details]
new_poppler-page.cpp based on poppler 0.61.1 with my changes
Created attachment 135574 [details]
new_poppler-page.h based on poppler 0.61.1 with my changes
Created attachment 135575 [details] [review] output of diff -u poppler-page.cpp (from poppler 0.61.1) new_poppler-page.cpp Created attachment 135576 [details] [review] output of diff -u poppler-page.h (from poppler 0.61.1) new_poppler-page.h Thank you very much for the speedy reply! My first ever attempt at using git failed: $ git clone https://anongit.freedesktop.org/git/poppler/poppler.git Cloning into 'poppler'... fatal: unable to access 'https://anongit.freedesktop.org/git/poppler/poppler.git/': gnutls_handshake() failed: Public key signature verification has failed. This is a bug from sept. 2016 "Bug#835342: curl or git clone commands throws "gnutls_handshake() failed" on https targets" https://groups.google.com/forum/#!topic/linux.debian.bugs.dist/nUrVMIG-r0o So I just downloaded poppler-0.61.1.tar.xz and used diff -u as follows to produce the attached files: $ diff -u poppler-page.cpp new_poppler-page.cpp >fix_cpp.patch $ diff -u poppler-page.h new_poppler-page.h >fix_h.patch The changes are fairly small: in poppler-page.h: added a further enum default_layout in line 50 and comments in the 2 lines above in poppler-page.cpp: added the following line after line 272 const GBool use_physical_layout = (physical_layout == layout_mode); used the new variable use_physical_layout in line 274 as second parameter to TextOutputDev and for cosmetic reasons, swapped the order of layout_mode and raw_order_layout in line 272 (if ever I mistype == as =, this order makes the compiler complain :-) Is there anything else I can do to resolve this problem? Unfortunately the preceding patch does not resolve the problem on its own. Comparing main() of pdftotext.cc with the page::text() in .../gcc/poppler-page.cpp there are 2 other differences: (1) doc->displayPage(&td, d->index + 1, 72, 72, 0, FALSE, TRUE, false); vs doc->displayPages(textOut, firstPage, lastPage, resolution, resolution, 0, TRUE, FALSE, gFalse); i.e. page::text () sets GBool useMediaBox = false, GBool crop = true and in pdftotext it is the other way round!? I have yet to investigate what this means by reading the source... (2) the page::text() in .../gcc/poppler-page.cpp subsequently calls td.getText() while pdftotext does not... Yet more source code to wade through... to be continued... When a TextOutputDev is created with a NULL first argument for output filename, then a call to doc->displayPage(with such a TextOutputDev as first argument) will not generate any output. Instead TextOutputDev->getText () will attempt to assemble the fragments resulting from parsing the top level PDF object during doc->displayPage(...) approximately matching the correct physical layout. This is what happens in libpoppler, but the results differ from those produced by pdftotext. pdftotext creates a TextOutputDev with a (non-NULL) first argument for the output filename. In this case, a call to doc->displayPage(with such a TextOutputDev as first argument) will generate output to the filename (possibly via Gfx->display () ?). The poppler code thus appears to have two routines which are not quite duplicates of one another for the same purpose of producing the text disposed according to the physical layout in a string variable!? Such a (historical?) architecture is just a recipe for problems. Is someone attempting to fix this? (This could be a major job) Please send a git diff, it's much easier to read. Use git://anongit.freedesktop.org/poppler/poppler where did you get the https line from? You lost me a bit in having two comments one after the other, could you please try to write/summarize it again in a single comment? The https line is from https://poppler.freedesktop.org/ " Poppler is developed using git. To clone the repository use the following command: git clone https://anongit.freedesktop.org/git/poppler/poppler.git " but it does not work with Debian 9 "Stretch" (stable). Thank you for the address which works. The command git diff seems to display the diff on the screen. I just redirected it into a file with git diff >git-diff.txt If there is a better way, please let me know (I have never used git before) The proposed modification ensures that the the function ustring page::text(const rectf &r, text_layout_enum layout_mode) const (in file .../gcc/poppler-page.cpp) when called with physical_layout as layout_mode correctly creates a TextOutputDev with second parameter set to true for physical_layout. HOWEVER, even with this change, it is NOT possible to obtain the same result as the output from pdftotext using libpoppler (The layout is different, for example there are no blank lines whereas pdftotext adds them when needed). The reason is that pdftotext creates a TextOutputDev with a filename (the output file name in fact) as first parameter and page::text(...) instead creates a TextOutputDev with a NULL as first parameter (as there is no output filename). When this TextOutputDev is subsequently passed into doc->displayPage(...), the PDF-page is apparently parsed into "fragments". When the filename was provided the text is simultaneously written into the output file (respecting the physical_layout when required). With the libpoppler function, only the parsing occurs (as there is no output filename). To obtain the text in physical layout a different function (TextOutputDev::getText (...)) is subsequently called to assemble the parsed fragments from doc->displayPage(...). However, it does so differently to doc->displayPage(...). This is why libpoppler cannot re-create the output of pdftotext. To summarise: the poppler code has currently two different functions for providing text in physical_layout (one in doc->displayPage(...) and a different, inferior one in TextOutputDev::getText (...)) - Is this really necessary? Created attachment 135731 [details]
output of git diff > git-diff.txt
Can I please have your name for for proper copyright attribution?
> To summarise: the poppler code has currently two different functions for
> providing text in physical_layout (one in doc->displayPage(...) and a
> different, inferior one in TextOutputDev::getText (...)) - Is this really
> necessary?
Probably not, we'll gladly accept a patch that makes it better :)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/598. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.