|Summary:||raw_order_layout completely broken|
|Product:||poppler||Reporter:||Jeroen Ooms <jeroen>|
|Component:||cpp frontend||Assignee:||poppler-bugs <poppler-bugs>|
|Status:||RESOLVED MOVED||QA Contact:|
|i915 platform:||i915 features:|
Description Jeroen Ooms 2016-03-12 20:47:45 UTC
See also: https://lists.freedesktop.org/archives/poppler/2016-March/011727.html Extracting text with raw_order_layout gives malformed and random output (no text at all for most pages): ustring str = p->text(p->page_rect(), page::raw_order_layout); An example: - source: http://arxiv.org/pdf/1403.2805.pdf - pdftotext default output: http://pastebin.com/raw/A93xPT4j - cpp with page::physical_layout: http://pastebin.com/raw/MZFpTRbD - cpp with page::raw_order_layout http://pastebin.com/raw/n8dcsqkZ Output misses most text, has no spaces, etc. Also each time I run it, I get different results so it looks like there is a memory bug.
Comment 1 Jeroen Ooms 2016-08-16 10:16:27 UTC
This problem can easily be reproduced using the `poppler-dump` test utility. For example take a sample pdf curl -OL http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf Then this is the output we expect: ./utils/pdftotext pdf-sample.pdf /dev/stdout However the C++ API gives: ./cpp/tests/poppler-dump pdf-sample.pdf --show-text raw The latter is mostly gibberish.
Comment 2 GitLab Migration User 2018-08-20 21:35:50 UTC
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/35.