Bug 71160 - Differing number of items returned from get_text{,layout} for glyphs over page edge
Summary: Differing number of items returned from get_text{,layout} for glyphs over pag...
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-02 12:26 UTC by Peter Waller
Modified: 2013-11-25 08:12 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
PDF which returns the wrong number of glyphs/rectangles (6.10 KB, text/plain)
2013-11-02 12:26 UTC, Peter Waller
Details
Patch (5.93 KB, patch)
2013-11-02 13:17 UTC, Carlos Garcia Campos
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Waller 2013-11-02 12:26:57 UTC
Created attachment 88530 [details]
PDF which returns the wrong number of glyphs/rectangles

As discussed on the mailing list, attached is a PDF containing one phrase where the last letter overlaps the page bounding box. Unless I'm mistaken, poppler_page_get_text_layout is returning 18 glyphs and poppler_page_get_text is returning 17.

> Yes, it's a bug, poppler_page_get_text_layout should always return the
> same number of glyps as poppler_page_get_text. In this case the problem
> is that TextSelectionDumper::getWordList() returns the list of words
> inside the selection, but if a word is not completely selected (like in
> this case because part of the word is outside the bbox) it still returns
> the whole word.
> 
> So, we have at least two possibilities:
> 
>  - Discard characters that are off-page in
>    poppler_page_get_text_layout.
>  - Make TextWordSelection class public and return a list of
>    TextWordSelection instead of a list of words so that we know in
>    poppler_page_get_text_layout which chars of the word are selected.
> 
> The first option is probably easier, but the second one would also fix
> other cases using this API in the future, and would make
> poppler_page_get_text_layout easier, we would only need to iterate the
> words from begin_selection to end_selection instead of from 0 to len.

My own preference for my use case is to not discard information. It would be great if the solution could ensure that all glyphs are returned, even if they go over the edge of the page or are off the page.
Comment 1 Carlos Garcia Campos 2013-11-02 13:17:04 UTC
Created attachment 88531 [details] [review]
Patch

This patch should fix the bug.
Comment 2 Carlos Garcia Campos 2013-11-02 13:19:01 UTC
(In reply to comment #0)
> Created attachment 88530 [details]
> PDF which returns the wrong number of glyphs/rectangles
> 
> As discussed on the mailing list, attached is a PDF containing one phrase
> where the last letter overlaps the page bounding box. Unless I'm mistaken,
> poppler_page_get_text_layout is returning 18 glyphs and
> poppler_page_get_text is returning 17.
> 
> > Yes, it's a bug, poppler_page_get_text_layout should always return the
> > same number of glyps as poppler_page_get_text. In this case the problem
> > is that TextSelectionDumper::getWordList() returns the list of words
> > inside the selection, but if a word is not completely selected (like in
> > this case because part of the word is outside the bbox) it still returns
> > the whole word.
> > 
> > So, we have at least two possibilities:
> > 
> >  - Discard characters that are off-page in
> >    poppler_page_get_text_layout.
> >  - Make TextWordSelection class public and return a list of
> >    TextWordSelection instead of a list of words so that we know in
> >    poppler_page_get_text_layout which chars of the word are selected.
> > 
> > The first option is probably easier, but the second one would also fix
> > other cases using this API in the future, and would make
> > poppler_page_get_text_layout easier, we would only need to iterate the
> > words from begin_selection to end_selection instead of from 0 to len.
> 
> My own preference for my use case is to not discard information. It would be
> great if the solution could ensure that all glyphs are returned, even if
> they go over the edge of the page or are off the page.

I don't think we should return characters that are not inside the page. What is your use case exactly?
Comment 3 Peter Waller 2013-11-02 13:21:50 UTC
(In reply to comment #2)

> I don't think we should return characters that are not inside the page. What
> is your use case exactly?

I'm trying to extract information out of PDFs which may be poorly formatted, for example having glyphs that unintentionally go out of bounds.
Comment 4 Carlos Garcia Campos 2013-11-25 08:12:17 UTC
I've pushed the patch to git master and also added poppler_page_get_text_for_area, poppler_page_get_text_layout_for_area and poppler_page_get_text_attributes_for_area, so that you can pass any rectangle to those methods.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.