See also: https://lists.freedesktop.org/archives/poppler/2016-March/011727.html Extracting text with raw_order_layout gives malformed and random output (no text at all for most pages): ustring str = p->text(p->page_rect(), page::raw_order_layout); An example: - source: http://arxiv.org/pdf/1403.2805.pdf - pdftotext default output: http://pastebin.com/raw/A93xPT4j - cpp with page::physical_layout: http://pastebin.com/raw/MZFpTRbD - cpp with page::raw_order_layout http://pastebin.com/raw/n8dcsqkZ Output misses most text, has no spaces, etc. Also each time I run it, I get different results so it looks like there is a memory bug.
This problem can easily be reproduced using the `poppler-dump` test utility. For example take a sample pdf curl -OL http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf Then this is the output we expect: ./utils/pdftotext pdf-sample.pdf /dev/stdout However the C++ API gives: ./cpp/tests/poppler-dump pdf-sample.pdf --show-text raw The latter is mostly gibberish.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/35.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.