12449 – Feature request: pdftodocbook

Bug 12449 - Feature request: pdftodocbook

Summary: Feature request: pdftodocbook

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-09-16 15:26 UTC by Loui
Modified:	2018-08-20 21:54 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Loui 2007-09-16 15:26:29 UTC

It looks like the xml feature in pdftohtml is some weird custom format I was thinking that it would be really nice to be able to convert to docbook xml instead. It would probably be more useful for most people.

I'll try to look into it when I get some time to. I'm just putting this up to keep it in mind or if anyone else wants to try it. Cheers.

Comment 1 Mathieu Malaterre 2012-09-27 11:15:09 UTC

I believe this would be an extremely hard task. PDF is all about rendering, while docbook is all about content...
I am not sure how PDF represent a table for instance

Comment 2 kurt.pfeifle 2017-10-30 01:06:39 UTC

Such a feature to work would only have a chance for "tagged" PDF.

PDF started its life as a digital document format with (almost) only one feature: to be a true replacement visually for printed paper -- but on screen (and to be reliable to convert the on-screen page images to paper images without misrendering).

For this task, PDF did not need to know the "meaning" of the strokes and pixels on the screen. After all, to de-cipher these was meant to be the task of the human brain looking with its eyes to it. It did not need to know about the "semantics" of the different parts, only about how these parts should render on screen or on paper.

Later this simple conception of PDF was extended: the ambition was/is to include an internal "markup" of various parts of the PDF visual content, and to declare their respective MEANINGS as well: "this is a headline"; "this is a subtitle"; "this is a textbox"; "this text string is the author's name". 

PDFs which are equipped with such internal markup are called "tagged" PDFs.

Very few real-world PDFs nowadays are "tagged". And if they are, the tagging very frequently is incomplete and often also plain wrong.

To convert a PDF document to DocBook is only feasible if you have a very well tagged input.

Comment 3 GitLab Migration User 2018-08-20 21:54:30 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/121.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.