Bug 69608 - poppler_page_get_text() ordering does not agree with poppler_page_get_text_layout() as docs say it should
Summary: poppler_page_get_text() ordering does not agree with poppler_page_get_text_la...
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-20 14:22 UTC by Peter Waller
Modified: 2013-09-22 09:36 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
poppler-glib-demo screenshot (117.54 KB, image/png)
2013-09-22 07:39 UTC, Carlos Garcia Campos
Details

Description Peter Waller 2013-09-20 14:22:04 UTC
Whilst trying to extract textual information from PDFs, it seems that the documentation for poppler_page_get_text_layout() is not correct, or there is a bug in poppler_page_get_text(). The documentation for poppler_page_get_text_layout says:

"The position in the array represents an offset in the text returned by poppler_page_get_text()".

(Note that the documentation says the same for poppler_page_get_text_attributes).

However, this doesn't seem to be the case. The problem is described succinctly here, complete with a short piece code which reproduces the problem:

http://www.mail-archive.com/poppler@lists.freedesktop.org/msg06238.html

The linked PDF [1] gives 1541 glyphs from poppler_page_get_text and 1477 glyphs from poppler_page_get_text_layout. It does not appear to be related to unicode encoding.

In addition to the numbers of glyphs not agreeing, the order doesn't seem to match up either, from what I can tell.

[1] http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf
Comment 1 Carlos Garcia Campos 2013-09-20 16:11:49 UTC
I have fixed this recently, are you using poppler 0.24 or git master? See:

http://cgit.freedesktop.org/poppler/poppler/commit/?id=c55b577ce69ad4bb69f5261b3e120e92c9fdb3d0
Comment 2 Peter Waller 2013-09-21 09:44:02 UTC
I've reproduced the problem both on Ubuntu 12.04 with 0.24.0 and 13.10 with 0.24.1. I haven't tried git master, is that expected to be any different?
Comment 3 Peter Waller 2013-09-21 11:02:24 UTC
I did try git master but I encountered this:

Linking CXX shared library libpoppler.so
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libfontconfig.a(fccfg.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libfontconfig.a: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
make[2]: *** [libpoppler.so.43.0.0] Error 1

So I tried to build my own libfontconfig with -fPIC but got this:

Using stylesheet: /usr/share/docbook-utils/docbook-utils.dsl#print
Working on: /home/pwaller/.local/src/fontconfig/fc-cache/../fc-cache/fc-cache.sgml
nsgmls:/home/pwaller/.local/src/fontconfig/fc-cache/../fc-cache/fc-cache.sgml:1:59:W: cannot generate system identifier for public text "-//OASIS//DTD DocBook V4.1//EN"
nsgmls:/home/pwaller/.local/src/fontconfig/fc-cache/../fc-cache/fc-cache.sgml:35:0:E: reference to entity "REFENTRY" for which no system identifier could be generated
nsgmls:/home/pwaller/.local/src/fontconfig/fc-cache/../fc-cache/fc-cache.sgml:1:0: entity was defined here
nsgmls:/home/pwaller/.local/src/fontconfig/fc-cache/../fc-cache/fc-cache.sgml:35:0:E: DTD did not contain element declaration for document type name

which persists even after installing every docbook and sgml and dtd related pacakge I can think of. Hints?
Comment 4 Carlos Garcia Campos 2013-09-22 07:39:14 UTC
Created attachment 86304 [details]
poppler-glib-demo screenshot

How are you trying? This is a screenshot of poppler-glib-demo application using 0.24.0. It shows that the last offset (as returned by poppler_page_get_text_layout) corresponds to the last character in the text (returned by poppler_page_get_text). So both layout and text are returning 1476 characters.
Comment 5 Peter Waller 2013-09-22 09:32:55 UTC
Here is a session showing what I observe:

pwaller@fractal:~$ aptitude show libpoppler-glib8
Package: libpoppler-glib8                
State: installed
Automatically installed: no
Multi-Arch: same
Version: 0.24.1-0ubuntu1
Priority: optional
Section: libs
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Uncompressed Size: 355 k
Depends: libc6 (>= 2.14), libcairo2 (>= 1.12.0), libfreetype6 (>= 2.2.1),
         libglib2.0-0 (>= 2.37.3), libpoppler43 (>= 0.24.1), libstdc++6 (>=
         4.1.1)
PreDepends: multiarch-support
Breaks: libpoppler-glib8 (!= 0.24.1-0ubuntu1)
Replaces: libpoppler-glib8 (< 0.24.1-0ubuntu1)
Description: PDF rendering library (GLib-based shared library)
 Poppler is a PDF rendering library based on Xpdf PDF viewer. 
 
 This package provides the GLib-based shared library for applications using the
 GLib interface to Poppler.
Homepage: http://poppler.freedesktop.org/

pwaller@fractal:~$ wget http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf
--2013-09-22 10:27:25--  http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf
Resolving ww1.microchip.com (ww1.microchip.com)... 77.67.21.35, 77.67.21.27
Connecting to ww1.microchip.com (ww1.microchip.com)|77.67.21.35|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2337393 (2.2M) [application/pdf]
Saving to: ‘22197B.pdf’

100%[======================================>] 2,337,393   1.42MB/s   in 1.6s   

2013-09-22 10:27:27 (1.42 MB/s) - ‘22197B.pdf’ saved [2337393/2337393]

pwaller@fractal:~$ python
Python 2.7.5+ (default, Sep 19 2013, 13:48:49) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from gi.repository import Poppler
>>> doc = Poppler.Document.new_from_file("file:///home/pwaller/22197B.pdf", "")>>> page = doc.get_page(0)
>>> ok, layout = page.get_text_layout()
>>> text = page.get_text()
>>> len(layout), len(text)
(1476, 1520)
Comment 6 Peter Waller 2013-09-22 09:35:25 UTC
Aha! In this case, it was UTF8 encoding. The numbers were similar enough to the broken behaviour of 0.18 I assumed the same thing was going on, but it turns out I just forgot to do `.decode("utf8")`!

Thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.