Created attachment 83578 [details] Dumb test that can be used to reproduce the bug STEPS TO REPRODUCE IT: 1. Use patches on bug 64816 in order to have a tool to scan tagged pdf (note: support to get that scanned is already on master) 2. Use one of those tools (ie: pdfinfo -struct-text) and scan the document attached with this bug report. EXPECTED OUTCOME: Document properly parsed without warnings, structure and content properly printed ACTUAL OUTCOME: Executing pdfinfo -struc-text (and fwiw. pdfstructhtml) prints the following warnings: Syntax Error: StructElem object is wrong type (LBody) Syntax Error: StructElem object is wrong type (LBody) Syntax Error: StructElem object is wrong type (LBody) The text of the list items are not properly extracted/printed EXTRA NOTES: I already checked that the problem is not at the tools, but at the core tagged-pdf. Specifically, with bug 64815, StructElement was added, with a typeMap structure with all the tags supported. LBody was missing. LBody is a valid tag, defined at page 586 of the reference (PDF32000_2008.pdf).
Created attachment 83579 [details] [review] Fixes the bug This patch solves the bug by adding LBody as one of the tags defined at StructElement.[cc|h]. This solves the problem with the document attached at comment 0. Anyway, it was not tested with other more complex pdfs. The reference * mentions that LBody can have nested lists as children, and I don't have enough experience with the to know if just adding it to the table is enough to also handle that case. * "LBody (List body): The descriptive content of a list item. In a dictionary list, for example, it contains the definition of the term. It may either contain the content directly or have other BLSEs, perhaps including nested lists, as children."
(In reply to comment #1) > with the to know if just adding it to the table is enough to also handle *with the code to know
Created attachment 83580 [details] [review] Add support to LBody to poppler Fixes a typo on the previous patch. Note: looking at the html from pdfstructtohtml I realized that listitems are exposed with an extra bullet point. Using pdfinfo -struct-text we have things like this: LI (block) LBody P (block): /Placement /Block /StartIndent 36 "•list item 1" Not sure if the bullet point should be part of the text. Again, I hope that someone with more experience with the current code could reply that question.
(In reply to comment #0) > EXTRA NOTES: > I already checked that the problem is not at the tools, but at the core > tagged-pdf. Specifically, with bug 64815, StructElement was added, with a I have just realized that not all the patches of bug 64815 were added yet, including the one that adds StructElement.[cc|h]. So adding a dependency with that bug. Sorry for the noise.
Created attachment 83604 [details] [review] Add support to LBody tag at poppler-glib The attached document at comment 0 crashed with poppler-glib-demo (the one updated with bug 64821) due the lack of the equivalent support for poppler-glib.
Adding the dependency to bug 64821, as the second patch needs those changes. As the bug is about the same tag (LBody) I maintained the patches on the same bug, but if someone things that is more sensible to have one bug per patch (so one bug for poppler and another for poppler-glib) I could create another bug.
(In reply to comment #3) > Created attachment 83580 [details] [review] [review] > Add support to LBody to poppler > > Fixes a typo on the previous patch. > > Note: looking at the html from pdfstructtohtml I realized that listitems are > exposed with an extra bullet point. Using pdfinfo -struct-text we have > things like this: > > LI (block) > LBody > P (block): > /Placement /Block > /StartIndent 36 > "•list item 1" > > Not sure if the bullet point should be part of the text. Again, I hope that > someone with more experience with the current code could reply that question. Yes, the bullet glyph is part of the contents of the PDF. Note that the structure tree is purely informative, and does not affect how things are rendered. Therefore if a bullet is to be shown, it must be part of the page command stream. The “pdfstructtohtml” could be indeed be a bit smarter and do one of (or a combination of both): - Checking the beginning of the text string and, if one of the usual bullet symbols is used (circle bullet, square bullet, etc), do not output the glyph inside the <li> elements and let the browser add the bullet. - Removing the bullet from the <li> elements using CSS, so the bullet glyph from the text is shown.
Comment on attachment 83580 [details] [review] Add support to LBody to poppler Review of attachment 83580 [details] [review]: ----------------------------------------------------------------- Patch looks good to me.
Comment on attachment 83604 [details] [review] Add support to LBody tag at poppler-glib Review of attachment 83604 [details] [review]: ----------------------------------------------------------------- Apart from a part of the patch being unneded, the rest looks good to me. ::: glib/poppler-structure.cc @@ +104,5 @@ > return StructElement::LI; > case POPPLER_STRUCTURE_ELEMENT_LIST_LABEL: > return StructElement::Lbl; > + case POPPLER_STRUCTURE_ELEMENT_LIST_BODY: > + return StructElement::LBody; This hunk from the patch won't be needed in the end. After discussing the Poppler-GLib API with Carlos García, there will not be a PopplerStructure object any more — traversing the structure will be done by obtaining a PopplerStructureElementIter directly from the PopplerDocument.
For the record, I am including your fixes for LBody in the patch set that is part of bug 64815 and bug 64821. Therefore I am closing this as duplicate. *** This bug has been marked as a duplicate of bug 64815 ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.