See also: https://lists.freedesktop.org/archives/poppler/2016-March/011727.html
Extracting text with raw_order_layout gives malformed and random output (no text at all for most pages):
ustring str = p->text(p->page_rect(), page::raw_order_layout);
- source: http://arxiv.org/pdf/1403.2805.pdf
- pdftotext default output: http://pastebin.com/raw/A93xPT4j
- cpp with page::physical_layout: http://pastebin.com/raw/MZFpTRbD
- cpp with page::raw_order_layout http://pastebin.com/raw/n8dcsqkZ
Output misses most text, has no spaces, etc. Also each time I run it, I get different results so it looks like there is a memory bug.
This problem can easily be reproduced using the `poppler-dump` test utility. For example take a sample pdf
curl -OL http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
Then this is the output we expect:
./utils/pdftotext pdf-sample.pdf /dev/stdout
However the C++ API gives:
./cpp/tests/poppler-dump pdf-sample.pdf --show-text raw
The latter is mostly gibberish.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/35.