Bug 67710 - Tagged-PDF: LBody tag is not supported
Summary: Tagged-PDF: LBody tag is not supported
Status: RESOLVED DUPLICATE of bug 64815
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 64815 64821
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-03 12:46 UTC by Alejandro Piñeiro (freenode IRC: apinheiro)
Modified: 2013-09-26 18:41 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Dumb test that can be used to reproduce the bug (39.48 KB, text/plain)
2013-08-03 12:46 UTC, Alejandro Piñeiro (freenode IRC: apinheiro)
Details
Fixes the bug (1.69 KB, patch)
2013-08-03 12:55 UTC, Alejandro Piñeiro (freenode IRC: apinheiro)
Details | Splinter Review
Add support to LBody to poppler (1.69 KB, patch)
2013-08-03 13:06 UTC, Alejandro Piñeiro (freenode IRC: apinheiro)
Details | Splinter Review
Add support to LBody tag at poppler-glib (2.16 KB, patch)
2013-08-04 07:37 UTC, Alejandro Piñeiro (freenode IRC: apinheiro)
Details | Splinter Review

Description Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-03 12:46:29 UTC
Created attachment 83578 [details]
Dumb test that can be used to reproduce the bug

STEPS TO REPRODUCE IT:

1. Use patches on bug 64816 in order to have a tool to scan tagged pdf (note: support to get that scanned is already on master)
2. Use one of those tools (ie: pdfinfo -struct-text) and scan the document attached with this bug report.


EXPECTED OUTCOME:
Document properly parsed without warnings, structure and content properly printed 

ACTUAL OUTCOME:
Executing pdfinfo -struc-text (and fwiw. pdfstructhtml) prints the following warnings:
Syntax Error: StructElem object is wrong type (LBody)
Syntax Error: StructElem object is wrong type (LBody)
Syntax Error: StructElem object is wrong type (LBody)

The text of the list items are not properly extracted/printed

EXTRA NOTES:
I already checked that the problem is not at the tools, but at the core tagged-pdf. Specifically, with bug 64815, StructElement was added, with a typeMap structure with all the tags supported. LBody was missing. LBody is a valid tag, defined at page 586 of the reference (PDF32000_2008.pdf).
Comment 1 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-03 12:55:50 UTC
Created attachment 83579 [details] [review]
Fixes the bug

This patch solves the bug by adding LBody as one of the tags defined at StructElement.[cc|h].

This solves the problem with the document attached at comment 0. Anyway, it was not tested with other more complex pdfs. The reference * mentions that LBody can have nested lists as children, and I don't have enough experience with the to know if just adding it to the table is enough to also handle that case.

* "LBody (List body): The descriptive content of a list item. In a dictionary list, for example, it contains the definition of the term. It may either contain the content directly or have other BLSEs, perhaps including nested lists, as children."
Comment 2 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-03 12:58:31 UTC
(In reply to comment #1)

> with the to know if just adding it to the table is enough to also handle
  *with the code to know
Comment 3 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-03 13:06:57 UTC
Created attachment 83580 [details] [review]
Add support to LBody to poppler

Fixes a typo on the previous patch.

Note: looking at the html from pdfstructtohtml I realized that listitems are exposed with an extra bullet point. Using pdfinfo -struct-text we have things like this:

      LI (block)
        LBody
          P (block):
             /Placement /Block
             /StartIndent 36
            "•list item 1"

Not sure if the bullet point should be part of the text. Again, I hope that someone with more experience with the current code could reply that question.
Comment 4 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-03 17:06:20 UTC
(In reply to comment #0)

> EXTRA NOTES:
> I already checked that the problem is not at the tools, but at the core
> tagged-pdf. Specifically, with bug 64815, StructElement was added, with a

I have just realized that not all the patches of bug 64815 were added yet, including the one that adds StructElement.[cc|h]. So adding a dependency with that bug. Sorry for the noise.
Comment 5 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-04 07:37:35 UTC
Created attachment 83604 [details] [review]
Add support to LBody tag at poppler-glib

The attached document at comment 0 crashed with poppler-glib-demo (the one updated with bug 64821) due the lack of the equivalent support for poppler-glib.
Comment 6 Alejandro Piñeiro (freenode IRC: apinheiro) 2013-08-04 07:39:27 UTC
Adding the dependency to bug 64821, as the second patch needs those changes. As the bug is about the same tag (LBody) I maintained the patches on the same bug, but if someone things that is more sensible to have one bug per patch (so one bug for poppler and another for poppler-glib) I could create another bug.
Comment 7 Adrian Perez de Castro 2013-09-26 14:37:07 UTC
(In reply to comment #3)
> Created attachment 83580 [details] [review] [review]
> Add support to LBody to poppler
> 
> Fixes a typo on the previous patch.
> 
> Note: looking at the html from pdfstructtohtml I realized that listitems are
> exposed with an extra bullet point. Using pdfinfo -struct-text we have
> things like this:
> 
>       LI (block)
>         LBody
>           P (block):
>              /Placement /Block
>              /StartIndent 36
>             "•list item 1"
> 
> Not sure if the bullet point should be part of the text. Again, I hope that
> someone with more experience with the current code could reply that question.

Yes, the bullet glyph is part of the contents of the PDF. Note that the
structure tree is purely informative, and does not affect how things are
rendered. Therefore if a bullet is to be shown, it must be part of the
page command stream.

The “pdfstructtohtml” could be indeed be a bit smarter and do one of
(or a combination of both):

 - Checking the beginning of the text string and, if one of the usual
   bullet symbols is used (circle bullet, square bullet, etc), do not
   output the glyph inside the <li> elements and let the browser add
   the bullet.
 - Removing the bullet from the <li> elements using CSS, so the bullet
   glyph from the text is shown.
Comment 8 Adrian Perez de Castro 2013-09-26 14:39:34 UTC
Comment on attachment 83580 [details] [review]
Add support to LBody to poppler

Review of attachment 83580 [details] [review]:
-----------------------------------------------------------------

Patch looks good to me.
Comment 9 Adrian Perez de Castro 2013-09-26 14:42:45 UTC
Comment on attachment 83604 [details] [review]
Add support to LBody tag at poppler-glib

Review of attachment 83604 [details] [review]:
-----------------------------------------------------------------

Apart from a part of the patch being unneded, the rest looks good to me.

::: glib/poppler-structure.cc
@@ +104,5 @@
>          return StructElement::LI;
>        case POPPLER_STRUCTURE_ELEMENT_LIST_LABEL:
>          return StructElement::Lbl;
> +      case POPPLER_STRUCTURE_ELEMENT_LIST_BODY:
> +        return StructElement::LBody;

This hunk from the patch won't be needed in the end. After discussing the
Poppler-GLib API with Carlos García, there will not be a PopplerStructure
object any more — traversing the structure will be done by obtaining a
PopplerStructureElementIter directly from the PopplerDocument.
Comment 10 Adrian Perez de Castro 2013-09-26 18:41:58 UTC
For the record, I am including your fixes for LBody in the patch set
that is part of bug 64815 and bug 64821. Therefore I am closing this
as duplicate.

*** This bug has been marked as a duplicate of bug 64815 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.