Bug 94518 - raw_order_layout completely broken
Summary: raw_order_layout completely broken
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: cpp frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-12 20:47 UTC by Jeroen Ooms
Modified: 2018-08-20 21:35 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Jeroen Ooms 2016-03-12 20:47:45 UTC
See also: https://lists.freedesktop.org/archives/poppler/2016-March/011727.html

Extracting text with raw_order_layout gives malformed and random output (no text at all for most pages):

  ustring str = p->text(p->page_rect(), page::raw_order_layout);

An example:

 - source: http://arxiv.org/pdf/1403.2805.pdf
 - pdftotext default output: http://pastebin.com/raw/A93xPT4j
 - cpp with page::physical_layout: http://pastebin.com/raw/MZFpTRbD
 - cpp with page::raw_order_layout http://pastebin.com/raw/n8dcsqkZ

Output misses most text, has no spaces, etc. Also each time I run it, I get different results so it looks like there is a memory bug.
Comment 1 Jeroen Ooms 2016-08-16 10:16:27 UTC
This problem can easily be reproduced using the `poppler-dump` test utility. For example take a sample pdf

   curl -OL http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

Then this is the output we expect:

  ./utils/pdftotext pdf-sample.pdf  /dev/stdout

However the C++ API gives:

  ./cpp/tests/poppler-dump pdf-sample.pdf --show-text raw

The latter is mostly gibberish.
Comment 2 GitLab Migration User 2018-08-20 21:35:50 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/35.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.