Bug 94518

Summary: raw_order_layout completely broken
Product: poppler Reporter: Jeroen Ooms <jeroen>
Component: cpp frontendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
i915 platform: i915 features:

Description Jeroen Ooms 2016-03-12 20:47:45 UTC
See also: https://lists.freedesktop.org/archives/poppler/2016-March/011727.html

Extracting text with raw_order_layout gives malformed and random output (no text at all for most pages):

  ustring str = p->text(p->page_rect(), page::raw_order_layout);

An example:

 - source: http://arxiv.org/pdf/1403.2805.pdf
 - pdftotext default output: http://pastebin.com/raw/A93xPT4j
 - cpp with page::physical_layout: http://pastebin.com/raw/MZFpTRbD
 - cpp with page::raw_order_layout http://pastebin.com/raw/n8dcsqkZ

Output misses most text, has no spaces, etc. Also each time I run it, I get different results so it looks like there is a memory bug.
Comment 1 Jeroen Ooms 2016-08-16 10:16:27 UTC
This problem can easily be reproduced using the `poppler-dump` test utility. For example take a sample pdf

   curl -OL http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

Then this is the output we expect:

  ./utils/pdftotext pdf-sample.pdf  /dev/stdout

However the C++ API gives:

  ./cpp/tests/poppler-dump pdf-sample.pdf --show-text raw

The latter is mostly gibberish.
Comment 2 GitLab Migration User 2018-08-20 21:35:50 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/35.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.