Bug 64816 - [TAGGEDPDF] Provide some way of dumping the document structure
Summary: [TAGGEDPDF] Provide some way of dumping the document structure
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other All
: medium enhancement
Assignee: Adrian Perez de Castro
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: tagged-pdf
  Show dependency treegraph
 
Reported: 2013-05-21 07:54 UTC by Adrian Perez de Castro
Modified: 2016-03-01 12:58 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
[PATCH 3/6] Tagged-PDF: Modify pdfinfo to show the document structure (5.83 KB, patch)
2013-05-29 23:51 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v3 3/7] Tagged-PDF: Modify pdfinfo to show the document (6.00 KB, patch)
2013-06-06 14:22 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v3 4/6] Tagged-PDF: Implement the utils/pdfstructtohtml tool (16.35 KB, patch)
2013-06-06 14:59 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document structure (6.01 KB, patch)
2013-06-18 16:05 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document structure (16.36 KB, patch)
2013-06-18 16:06 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v5 07/10] Tagged-PDF: Implement the utils/pdfstructtohtml tool (16.36 KB, patch)
2013-06-18 16:07 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure (16.35 KB, patch)
2013-08-01 13:46 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v7 6/9] Tagged-PDF: Implement the utils/pdfstructtohtml tool (16.35 KB, patch)
2013-08-01 13:47 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure (5.34 KB, patch)
2013-08-01 16:21 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v8 06/15] Tagged-PDF: Modify pdfinfo to show the document structure (5.46 KB, patch)
2013-09-26 18:46 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v8 07/15] Tagged-PDF: Implement the utils/pdfstructtohtml tool (16.35 KB, patch)
2013-09-26 18:47 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v10 03/12] Tagged-PDF: Modify pdfinfo to show the document structure (5.46 KB, patch)
2013-11-11 20:43 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v10 04/12] Tagged-PDF: Implement the utils/pdfstructtohtml tool (16.31 KB, patch)
2013-11-11 20:44 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v11 03/11] Tagged-PDF: Implement the utils/pdfstructtohtml tool (15.25 KB, patch)
2013-11-27 20:40 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v20] Tagged-PDF: Modify pdfinfo to show the document structure (4.97 KB, patch)
2014-03-06 09:41 UTC, Adrian Perez de Castro
Details | Splinter Review
[PATCH v21] Tagged-PDF: Modify pdfinfo to show the document structure (5.03 KB, patch)
2015-12-29 15:15 UTC, Adrian Perez de Castro
Details | Splinter Review

Description Adrian Perez de Castro 2013-05-21 07:54:29 UTC
For documents which have a structure tree (see bug #64815) it would be
interesting to be able to dump it in some format that can be easily
visualized. This could be done as a new tool, or maybe adding the
functionality to "pdfinfo".
Comment 1 Adrian Perez de Castro 2013-05-29 23:51:55 UTC
Created attachment 79992 [details] [review]
[PATCH 3/6] Tagged-PDF: Modify pdfinfo to show the document structure

The attached patch needs the other patches attached to bug #64815.
Comment 2 Adrian Perez de Castro 2013-06-06 14:22:32 UTC
Created attachment 80408 [details] [review]
[PATCH v3 3/7] Tagged-PDF: Modify pdfinfo to show the document

Uploaded new version of the patch, which includes:

* Object references are now displayed in the output when the
  “-struct-text” flag is used.
Comment 3 Adrian Perez de Castro 2013-06-06 14:59:17 UTC
Created attachment 80421 [details] [review]
[PATCH v3 4/6] Tagged-PDF: Implement the utils/pdfstructtohtml tool

This new patch adds a "utils/pdfstructtohtml" tool that uses the document
structure from tagged PDFs to generate an HTML document from it, trying
to preserve as much of the structure and some of the styling in the
output. This serves also as a demonstration of how to use the Attribute
and StructElement classes.

Note that this is not intended (for the moment) to be a full-fledged tool
and will produce basic HTML output where the structure is preserved, but
it does no take much care of preserving the rendered look of the input
documents.
Comment 4 Adrian Perez de Castro 2013-06-18 16:05:53 UTC
Created attachment 81013 [details] [review]
[PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document structure
Comment 5 Adrian Perez de Castro 2013-06-18 16:06:21 UTC
Created attachment 81014 [details] [review]
[PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document  structure
Comment 6 Adrian Perez de Castro 2013-06-18 16:07:01 UTC
Created attachment 81015 [details] [review]
[PATCH v5 07/10] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Comment 7 Adrian Perez de Castro 2013-08-01 13:46:31 UTC
Created attachment 83434 [details] [review]
[PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure
Comment 8 Adrian Perez de Castro 2013-08-01 13:47:03 UTC
Created attachment 83435 [details] [review]
[PATCH v7 6/9] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Comment 9 Adrian Perez de Castro 2013-08-01 16:21:07 UTC
Created attachment 83452 [details] [review]
[PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure
Comment 10 Adrian Perez de Castro 2013-09-26 18:46:37 UTC
Created attachment 86679 [details] [review]
[PATCH v8 06/15] Tagged-PDF: Modify pdfinfo to show the document structure
Comment 11 Adrian Perez de Castro 2013-09-26 18:47:00 UTC
Created attachment 86680 [details] [review]
[PATCH v8 07/15] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Comment 12 Adrian Perez de Castro 2013-11-11 20:43:53 UTC
Created attachment 89056 [details] [review]
[PATCH v10 03/12] Tagged-PDF: Modify pdfinfo to show the document structure
Comment 13 Adrian Perez de Castro 2013-11-11 20:44:27 UTC
Created attachment 89057 [details] [review]
[PATCH v10 04/12] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Comment 14 Adrian Perez de Castro 2013-11-27 20:40:35 UTC
Created attachment 89918 [details] [review]
[PATCH v11 03/11] Tagged-PDF: Implement the utils/pdfstructtohtml tool

The implementation could be much smater, and instead of generating a
“style=""” attribute for each span, it could use a CSS class and spit
out a separate CSS file with only the unique styles.
Comment 15 Adrian Perez de Castro 2014-02-23 20:20:25 UTC
For the record, the “pdfstructtohtml” tool was made to exercise the
Tagged-PDF implementation in Poppler, and I do not consider it
production-quality code. I think it would be interesting to pick some
of the parts and move them into “pdftohtml”, for example the text
extraction could use the Tagged-PDF information when available, and
use the existing method as a fall-back. I can try to do that when
I have some time.

On the other hand, the patch to enable dumping the structure in
“pdfinfo” is quite small, and I think it would be a good thing to
have the patch applied.
Comment 16 Carlos Garcia Campos 2014-02-24 18:51:56 UTC
(In reply to comment #15)
> For the record, the “pdfstructtohtml” tool was made to exercise the
> Tagged-PDF implementation in Poppler, and I do not consider it
> production-quality code. I think it would be interesting to pick some
> of the parts and move them into “pdftohtml”, for example the text
> extraction could use the Tagged-PDF information when available, and
> use the existing method as a fall-back. I can try to do that when
> I have some time.
> 
> On the other hand, the patch to enable dumping the structure in
> “pdfinfo” is quite small, and I think it would be a good thing to
> have the patch applied.

Use different bugs then :-)
Comment 17 Adrian Perez de Castro 2014-03-06 09:41:04 UTC
Created attachment 95216 [details] [review]
[PATCH v20] Tagged-PDF: Modify pdfinfo to show the document structure

Rebased up-to-date patch, ready to apply onto “master”
Comment 18 Adrian Perez de Castro 2014-06-16 18:19:11 UTC
Is there any reason why this patch cannot be applied to repository? Personally
I do not see any trouble in adding a flag to make “pdfinfo” dump the structure
from the Tagged-PDF data. Can we please land this?
Comment 19 Adrian Perez de Castro 2015-12-29 15:15:57 UTC
Created attachment 120726 [details] [review]
[PATCH v21] Tagged-PDF: Modify pdfinfo to show the document structure

Rebased and updated to follow changes in the StructElement API.
Comment 20 Adrian Perez de Castro 2016-02-26 10:31:52 UTC
Ping? It would be nice if someone can review the updated patch
and get it landed before it bitrots again. Thanks in advance!
Comment 21 Mike Gerber 2016-03-01 12:35:09 UTC
I'm working with tagged/accessible PDFs and I would also like to see this merged into pdfinfo.
Comment 22 Carlos Garcia Campos 2016-03-01 12:58:58 UTC
Ok, pushed. Thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.