For documents which have a structure tree (see bug #64815) it would be interesting to be able to dump it in some format that can be easily visualized. This could be done as a new tool, or maybe adding the functionality to "pdfinfo".
Created attachment 79992 [details] [review] [PATCH 3/6] Tagged-PDF: Modify pdfinfo to show the document structure The attached patch needs the other patches attached to bug #64815.
Created attachment 80408 [details] [review] [PATCH v3 3/7] Tagged-PDF: Modify pdfinfo to show the document Uploaded new version of the patch, which includes: * Object references are now displayed in the output when the “-struct-text” flag is used.
Created attachment 80421 [details] [review] [PATCH v3 4/6] Tagged-PDF: Implement the utils/pdfstructtohtml tool This new patch adds a "utils/pdfstructtohtml" tool that uses the document structure from tagged PDFs to generate an HTML document from it, trying to preserve as much of the structure and some of the styling in the output. This serves also as a demonstration of how to use the Attribute and StructElement classes. Note that this is not intended (for the moment) to be a full-fledged tool and will produce basic HTML output where the structure is preserved, but it does no take much care of preserving the rendered look of the input documents.
Created attachment 81013 [details] [review] [PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 81014 [details] [review] [PATCH v5 06/10] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 81015 [details] [review] [PATCH v5 07/10] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Created attachment 83434 [details] [review] [PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 83435 [details] [review] [PATCH v7 6/9] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Created attachment 83452 [details] [review] [PATCH v7 5/9] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 86679 [details] [review] [PATCH v8 06/15] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 86680 [details] [review] [PATCH v8 07/15] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Created attachment 89056 [details] [review] [PATCH v10 03/12] Tagged-PDF: Modify pdfinfo to show the document structure
Created attachment 89057 [details] [review] [PATCH v10 04/12] Tagged-PDF: Implement the utils/pdfstructtohtml tool
Created attachment 89918 [details] [review] [PATCH v11 03/11] Tagged-PDF: Implement the utils/pdfstructtohtml tool The implementation could be much smater, and instead of generating a “style=""” attribute for each span, it could use a CSS class and spit out a separate CSS file with only the unique styles.
For the record, the “pdfstructtohtml” tool was made to exercise the Tagged-PDF implementation in Poppler, and I do not consider it production-quality code. I think it would be interesting to pick some of the parts and move them into “pdftohtml”, for example the text extraction could use the Tagged-PDF information when available, and use the existing method as a fall-back. I can try to do that when I have some time. On the other hand, the patch to enable dumping the structure in “pdfinfo” is quite small, and I think it would be a good thing to have the patch applied.
(In reply to comment #15) > For the record, the “pdfstructtohtml” tool was made to exercise the > Tagged-PDF implementation in Poppler, and I do not consider it > production-quality code. I think it would be interesting to pick some > of the parts and move them into “pdftohtml”, for example the text > extraction could use the Tagged-PDF information when available, and > use the existing method as a fall-back. I can try to do that when > I have some time. > > On the other hand, the patch to enable dumping the structure in > “pdfinfo” is quite small, and I think it would be a good thing to > have the patch applied. Use different bugs then :-)
Created attachment 95216 [details] [review] [PATCH v20] Tagged-PDF: Modify pdfinfo to show the document structure Rebased up-to-date patch, ready to apply onto “master”
Is there any reason why this patch cannot be applied to repository? Personally I do not see any trouble in adding a flag to make “pdfinfo” dump the structure from the Tagged-PDF data. Can we please land this?
Created attachment 120726 [details] [review] [PATCH v21] Tagged-PDF: Modify pdfinfo to show the document structure Rebased and updated to follow changes in the StructElement API.
Ping? It would be nice if someone can review the updated patch and get it landed before it bitrots again. Thanks in advance!
I'm working with tagged/accessible PDFs and I would also like to see this merged into pdfinfo.
Ok, pushed. Thanks!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.