103530 – Support XMP metadata for title, author etc.

Bug 103530 - Support XMP metadata for title, author etc.

Summary: Support XMP metadata for title, author etc.

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-10-31 21:28 UTC by Reuben Thomas
Modified:	2018-08-20 21:33 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Reuben Thomas 2017-10-31 21:28:37 UTC

At present, poppler doesn't use the XMP metadata. This is a shame, not just because it means that in Evince, Okular etc. some documents appear to lack metadata, but because this particularly disadvantages PDF/A-compliant documents, for example. As an author, I'd like to produce documents compliant with 10-year-old standards, and not have my users complain they're broken!

The most obvious fix would seem to be to modify the various getDocInfo* methods (ignoring for the moment the fact that this will make the names somewhat misleading, as they will no longer look only in the DOCINFO dictionary), to look in the XMP metadata, according to the relevant PDF specs (I haven't yet determined what these are and what conditions they impose).

Since the XMP is XML, it seems this will need libxml2 (or equivalent). The poppler maintainers might desire this to be an optional dependency, at least at first.

Also worth considering: might clients of poppler want to know the source of a particular piece of metadata (i.e. whether it comes from DOCINFO or XMP)? If so, is that best achieved by adding new "finer-grained" APIs, or by leaving the existing APIs unaltered, and adding new ones?

Is there any need to provide low-level access to the XMP metadata through public APIs, or, since it's all XML with well-known schemas, is that redundant?

Comment 1 Albert Astals Cid 2017-11-01 22:45:16 UTC

> If so, is that best achieved by adding new "finer-grained" APIs, or by leaving the existing APIs unaltered, and adding new ones?

Adding new ones (at least at the core level) is the way to go.

At the public level i'm unsure of what is best. Using the existing ones has the immediate benefit of "fixing" the applications without them needing to do anything but i think there's also value on applications being able to access both values.

> Is there any need to provide low-level access to the XMP metadata through public APIs, or, since it's all XML with well-known schemas, is that redundant?

The XML is already accessible, so i think that's ok

Comment 2 Reuben Thomas 2017-11-02 00:06:25 UTC

Thanks very much for the comments.

My feeling is that it would be best to enhance existing APIs unless there's an obvious reason not to, and then add new APIs for more fine-grained information if necessary (most obviously, to read *only* the DocInfo dictionary, since, as you say, the XMP XML can already be read and parsed separately with existing APIs).

Comment 3 Jose Aliste 2017-11-02 13:25:35 UTC

We have discussed this in the list but never got to a conclusion. For instance, for author there is the info in the catalog and the info in the XMP metadata. My understanding of the PDF spec says that if the XMP metadata is present, then the catalog data should be ignored. Because of that, I tend to believe that we need a "High Level API" to get the author that get the correct value that agrees with the pdf spec. Of course, it also makes sense to still have "Catalog->getAuthor" even if it's supposed to be ignored and we also have low level API to get the whole XMP metadata so we are covered there. This high level API would belong to poppler IMO since it's part of the PDF spec. The second question is whether we should provide implementations in each frontend or in core. 

Albert, how do you feel about introducing a dependency on a xml library (libxml2) for instance? so I can try to add a private high level api that could be shared by both glib and qt frontends?

Comment 4 Albert Astals Cid 2017-11-02 22:48:49 UTC

(In reply to Jose Aliste from comment #3)
> We have discussed this in the list but never got to a conclusion. 

Do you have the email subject at hand? I do remember a discussion, but juts couldn't find it.

> For
> instance, for author there is the info in the catalog and the info in the
> XMP metadata. My understanding of the PDF spec says that if the XMP metadata
> is present, then the catalog data should be ignored.

That is not correct, the rules are more complex as far as i remember.

> Because of that, I tend
> to believe that we need a "High Level API" to get the author that get the
> correct value that agrees with the pdf spec. Of course, it also makes sense
> to still have "Catalog->getAuthor" even if it's supposed to be ignored and
> we also have low level API to get the whole XMP metadata so we are covered
> there. This high level API would belong to poppler IMO since it's part of
> the PDF spec. The second question is whether we should provide
> implementations in each frontend or in core. 
> 
> Albert, how do you feel about introducing a dependency on a xml library
> (libxml2) for instance? so I can try to add a private high level api that
> could be shared by both glib and qt frontends?

I guess that'd be ok

Comment 5 Jose Aliste 2017-11-03 01:18:09 UTC

(In reply to Albert Astals Cid from comment #4)
> (In reply to Jose Aliste from comment #3)
> > We have discussed this in the list but never got to a conclusion. 
> 
> Do you have the email subject at hand? I do remember a discussion, but juts
> couldn't find it.

Actually, it's more or less the discussion we are having now. 

https://lists.freedesktop.org/archives/poppler/2017-April/012171.html
https://lists.freedesktop.org/archives/poppler/2017-September/012596.html
> 
> > For
> > instance, for author there is the info in the catalog and the info in the
> > XMP metadata. My understanding of the PDF spec says that if the XMP metadata
> > is present, then the catalog data should be ignored.
> 
> That is not correct, the rules are more complex as far as i remember.
> 
Yeah, I just wanted to point out that since the rules are complex and belong to the pdf spec, we should probably implement them in poppler core. 

> > Because of that, I tend
> > to believe that we need a "High Level API" to get the author that get the
> > correct value that agrees with the pdf spec. Of course, it also makes sense
> > to still have "Catalog->getAuthor" even if it's supposed to be ignored and
> > we also have low level API to get the whole XMP metadata so we are covered
> > there. This high level API would belong to poppler IMO since it's part of
> > the PDF spec. The second question is whether we should provide
> > implementations in each frontend or in core. 
> > 
> > Albert, how do you feel about introducing a dependency on a xml library
> > (libxml2) for instance? so I can try to add a private high level api that
> > could be shared by both glib and qt frontends?
> 
> I guess that'd be ok
thanks

Comment 6 Evangelos Rigas 2018-07-10 22:03:35 UTC

Hello all,

First, some background information.
I have recently added support for XMP metadata in Evince.
I added a couple of helper functions to extract basic objects from the xml
and then wrapper methods to actually extract the information.

The implemented methods cover the following attributes:
Title, Subject, Keywords, Author, PDF/A or PDF/X format, License info, Creator,
Producer, and Dates (created, modified).

A day ago, I looked into adding support for extracting the PDF/A and PDF/X version from the information dictionary, as some files don't have embedded
XMP metadata. Hence, in Evince PDF files are not recognised as PDF/A or PDF/X if they lack XMP metadata.

However, the extraction of dictionary keys from the DOCINFO is trivial in poppler, as the function is to read the dictionary is implemented and used for the extraction of the title, author, etc.
Thus, I decided to add the support in poppler.
The result can be seen https://gitlab.gnome.org/erigas/poppler/tree/pdf_subtype.
I plan to send a patch shortly.

Second, as I was looking through the codebase and the bugs of poppler for hints, I stumbled upon this bug.
Read all your comments, and decided to have a look on porting my existing code to poppler. 
The initial porting tests have worked. I haven't pushed the branch yet.
I plan to do in the following days.

So, here is what I have done up to now.
Please note the changes are only applied to the Glib backend.

Added libxml2 as a dependency.
Added glib/poppler-metadata.cc, glib/poppler-metadata.h
The first part of poppler-metadata.cc contains the necessary logic to read the xml metadata.
The second part contains wrapper functions around the first ones.

As an example, consider the function to extract the author.
static char * xmp_metadata_get_author (xmlXPathContextPtr xpathCtx);
It requires a xml path context to read the xml tree and extract the author.
Then there is gchar * poppler_metadata_get_author (const gchar *metadata).
It needs the metadata object that contains the xml.
Opens an xml context and passes down to xmp_metadata_get_author to retrieve the author.

Through the poppler-metadata.h only the poppler_metadata_get_* are exposed.

Furthermore, added helper methods in poppler-document.cc similar to the ones already defined for the info dict.
Following the example above, here is the definition of the author method:

gchar *
poppler_document_get_author_from_xmp (PopplerDocument *document)
{
  gchar *pdfa = nullptr;
  gchar *metadata = poppler_document_get_metadata(document);

  pdfa = poppler_metadata_get_author (metadata);

  return pdfa;
}

Finally, here is my proposition on how to progress.
First, to address the concerns about the info dict and xmp metadata.

> 
> > For
> > instance, for author there is the info in the catalog and the info in the
> > XMP metadata. My understanding of the PDF spec says that if the XMP metadata
> > is present, then the catalog data should be ignored.
> 
> That is not correct, the rules are more complex as far as i remember.
> 

From the PDF reference (https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) under section H .3 in Appendix H.

I quote, 
> For backward compatibility, applications that create PDF 1.4 documents
> should include the metadata for a document in the document information
> dictionary as well as in the document’s metadata stream. Applications that
> support PDF 1.4 should check for the existence of a metadata stream and
> synchronize the information in it with that in the document information
> dictionary. The Adobe metadata framework provides a date stamp for
> metadata expressed in the framework. If this date stamp is equal to or later
> than the document modification date recorded in the document information dictionary,
> the metadata stream can be taken as authoritative. If, however,
> the document modification date recorded in the document
> information dictionary is later than the metadata stream’s date stamp, the
> document has likely been saved by an application that is not aware of PDF
> 1.4 metadata streams. In this case, information stored in the document
> information dictionary should be taken to override any semantically
> equivalent items in the metadata stream.

So I believe that the function as implemented now should remain as is for backwards compatibility, however I propose some changes.
These can be implemented in a couple of stages.

First, add the functions to read the xmp metadata.
Probably at the same time, add a state variable (determined by the modification dates) to indicate which information source is considered valid and using an if statement to switch the current callbacks accordingly.
Thus, clients will work out of the box with an update.
Furthermore, I suggest to use the xmp values if the value from the information dict is null.

At this first stage, information on the most basic information will be extracted.
The XMP supports a plethora of additional information, such as contact address (
Iptc4xmpCore:CreatorContactInfo), contact email address(es) (CiEmailWork), etc.
This can be added at a later date to poppler-metadata.cc.

For the next stage, the ability to update the xmp information, based on the info dict as stated in the specification, must be added.

This is all I had to say.
I am open to suggestions.

My plan is to first add the PDF/A, PDF/X support.
Then I will finish the porting of the xmp metadata to conclude the first stage of the XMP support.

Kind Regards,

Evangelos

Comment 7 GitLab Migration User 2018-08-20 21:33:53 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/14.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.