Bug 107182

Summary: Add support for reading PDF/A, PDF/X version from the information dictionary (glib backend)
Product: poppler Reporter: Evangelos Rigas <e.rigas>
Component: glib frontendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: enhancement    
Priority: medium CC: e.rigas
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Patch for PDF subtype information (PDF/A)
Display PDF subtype in the output of pdfinfo if subtype key is present
Read GTS information from info dict
Display PDF subtype info in pdfinfo utility
Read GTS information from info dict using C++11 regexp
Display PDF subtype info in pdfinfo utility
PDFSubtype documentation in glib
PDFSubtype test documents
Read GTS information from info dict using C++11 regexp
Display PDF subtype info in pdfinfo utility
PDFSubtype documentation in glib

Description Evangelos Rigas 2018-07-10 22:21:31 UTC
Created attachment 140547 [details] [review]
Patch for PDF subtype information (PDF/A)

Hi, 

I have added a poppler document property for the subtype of the PDF format, i.e. PDF/A or PDF/X.

This information is read from the PDF Info Dict using the following two keys, GTS_PDFA1Version, GTS_PDFXVersion.

These return for example PDF/A-2u:2010 or PDF/X-3:2003.
This PDF Format Subtype attribute can indicate if the pdf claims that is compatible with PDF/A or PDF/X ISO Standards.

This patch is mentioned in bug 103530.

Kind Regards,

Evangelos Rigas
Comment 1 Albert Astals Cid 2018-07-10 22:32:26 UTC
Where is GTS_PDFA1Version mentioned? I can't seem to find it in the PDF spec.
Comment 2 Evangelos Rigas 2018-07-10 22:55:36 UTC
(In reply to Albert Astals Cid from comment #1)
> Where is GTS_PDFA1Version mentioned? I can't seem to find it in the PDF spec.

If you looked in the PDF spec from the other bug, then cannot find it as it is not there.
These are mentioned on the ISO standards, see https://www.loc.gov/preservation/digital/formats/fdd/fdd000125.shtml for example.
They have a link for the ISO, https://www.iso.org/standard/38920.html but you have to buy it.

However, you can find the GTS_PDFA1Version and GTS_PDFXVersion in the code of pdfx (a pdflatex package that adds support for writing PDF/A and PDF/X compliant document using LaTeX).

You can see here http://ctan.math.illinois.edu/macros/latex/contrib/pdfx/pdfx.pdf
Pages 68 and 70 where it exports the GTS_PDFA1Version or GTS_PDFXVersion to the information dictionary.

Additionaly, you can find the possible values of GTS_PDFX here http://www.npes.org/Portals/0/standards/pdf/GTS Registry-March09.pdf

Hope it helps!
Comment 3 Evangelos Rigas 2018-07-10 22:58:00 UTC
> Additionaly, you can find the possible values of GTS_PDFX here http://www.npes.org/Portals/0/standards/pdf/GTS Registry-March09.pdf

Here is the right link http://www.npes.org/Portals/0/standards/pdf/GTS%20Registry-March09.pdf
Comment 4 Evangelos Rigas 2018-07-11 17:09:56 UTC
Created attachment 140562 [details] [review]
Display PDF subtype in the output of pdfinfo if subtype key is present

Hi,

Added the PDF subtype to the pdfinfo utility so if the GTS_* keys exist, pdfinfo will print "PDF subtype: PDF/?-conformace:date" below PDF version.

An example output will be:

user@pc ~$ pdfinof Document-1.pdf

Title:          Document-1
Subject:        
Keywords:       
Author:         
Creator:        Scribus 1.5.0.svn
Producer:       Scribus PDF Library 1.5.0.svn
CreationDate:   Fri Oct  2 14:59:47 2015 BST
ModDate:        Fri Oct  2 14:59:47 2015 BST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           AcroForm
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      155620 bytes
Optimized:      no
PDF version:    1.3
PDF subtype:    PDF/X-3:2002
Comment 5 Albert Astals Cid 2018-07-20 21:53:37 UTC
Is there any chance we can enum that instead of it being a string? i.e. is the set of possible values fixed?
Comment 6 Evangelos Rigas 2018-08-06 12:50:28 UTC
I spent the last weeks trying to find information on the ISO standards.
From what I gathered, there are 5 ISO standards based on PDF.
These are:

ISO 19005 - Document management -- Electronic document file format for long-term preservation (PDF/A)
ISO 24517 - Document management -- Engineering document format using PDF (PDF/E)
ISO 14289 - Document management applications -- Electronic document file format enhancement for accessibility (PDF/UA)
ISO 16612 - Graphic technology -- Variable data exchange (PDF/VT)
ISO 15930 - Graphic technology -- Prepress digital data exchange (PDF/X)

Each standard has multiple parts (i.e. revision) and different conformance levels.
To trim down the enum, I decided to split it to three enums: subtype, part, and conformance.

The subtype represents the 5 standards (A,E,UA,VT,X), part (1-5) and conformance the 7 levels of document conformance (A ,B, G, N, P, PG, U).

These enums are extracted using a regular expression on the GTS version string.

I have attached two patches. The first is the implementation in both the core and glib backend, while the second patch adds support for the subtype in pdfinfo utility.
Comment 7 Evangelos Rigas 2018-08-06 12:53:58 UTC
Created attachment 140978 [details] [review]
Read GTS information from info dict
Comment 8 Evangelos Rigas 2018-08-06 12:55:11 UTC
Created attachment 140979 [details] [review]
Display PDF subtype info in pdfinfo utility
Comment 9 Albert Astals Cid 2018-08-09 15:45:19 UTC
I see you're using regexec which is not available on windows since it's a posix thing.

I don't really care much for windows personally, but people get annoyed when we break the build too much.

Can you try using the C++11 regexp support that should be more widely supported?
https://en.cppreference.com/w/cpp/regex

Sorry about that :/
Comment 10 Evangelos Rigas 2018-08-11 09:23:47 UTC
(In reply to Albert Astals Cid from comment #9)
> I see you're using regexec which is not available on windows since it's a
> posix thing.
> 
> I don't really care much for windows personally, but people get annoyed when
> we break the build too much.
> 
Makes total sense!

> Can you try using the C++11 regexp support that should be more widely
> supported?
> https://en.cppreference.com/w/cpp/regex
> 
> Sorry about that :/
Done! Changed regex to C++11 regexp and added documentation reference in glib.


P.S. Weirdly enough, I couldn't convert to and from an std::string to GooString.
In line 522, instead of: std::string pdfsubver = pdfSubtypeVersion->toStr();
I had to go with:

std::string pdfsubver(pdfSubtypeVersion->getCString(), // Which immitates the 
                      pdfSubtypeVersion->getLength()); // toStr() declaration.
Upon compilation it was throwing an error that toStr is not member of Class GooString.

And in line 555 instead of GooString *conf = new GooString(match.str(3));
I went with: GooString *conf = new GooString(match.str(3).c_str());

However, performance-wise the two versions are the same.
Comment 11 Evangelos Rigas 2018-08-11 09:25:29 UTC
Created attachment 141040 [details] [review]
Read GTS information from info dict using C++11 regexp
Comment 12 Evangelos Rigas 2018-08-11 09:27:39 UTC
Created attachment 141042 [details] [review]
Display PDF subtype info in pdfinfo utility
Comment 13 Evangelos Rigas 2018-08-11 09:28:21 UTC
Created attachment 141043 [details] [review]
PDFSubtype documentation in glib
Comment 14 Evangelos Rigas 2018-08-11 09:34:36 UTC
Created attachment 141044 [details]
PDFSubtype test documents

PDF documents (PDF/A-1b, PDF/A-2u, PDF/E, PDF/VT, PDF/X-5pg) for testing the functionality.
Comment 15 Albert Astals Cid 2018-08-14 12:15:34 UTC
You mean that 
  std::string pdfsubver = pdfSubtypeVersion->toStr();
doesn't work for you?

Also there's a few memory leaks in the code, you never free pdfSubtypeVersion nor conf.

You should compile poppler with debug mode enabled and then run with valgrind --leak-check=full with pdfinfo and you'll see there's a few leaks.

Tell me if you need help understanding/running valgrind/debug.
Comment 16 Evangelos Rigas 2018-08-16 10:06:39 UTC
(In reply to Albert Astals Cid from comment #15)
> You mean that 
>   std::string pdfsubver = pdfSubtypeVersion->toStr();
> doesn't work for you?
The problem was on my computer, I managed to make it work.

> Also there's a few memory leaks in the code, you never free
> pdfSubtypeVersion nor conf.
Ooops! Sorry about that. I totally forgot that the string returned from getDocInfoStringEntry has to be freed.

I have added an extra function to check if a string entry exists in the document's info dictionary, thus solving the issue with the returned strings.
Comment 17 Evangelos Rigas 2018-08-16 10:09:32 UTC
Created attachment 141134 [details] [review]
Read GTS information from info dict using C++11 regexp
Comment 18 Evangelos Rigas 2018-08-16 10:11:25 UTC
Created attachment 141135 [details] [review]
Display PDF subtype info in pdfinfo utility
Comment 19 Evangelos Rigas 2018-08-16 10:13:38 UTC
Created attachment 141136 [details] [review]
PDFSubtype documentation in glib
Comment 20 Evangelos Rigas 2018-08-21 07:46:21 UTC
Hi,

As 0.68 has been released, I changed the `Since` tag in glib from 0.68 to 0.69.

From the mail list I saw that there is now a gitlab instance, so I have opened a merge request.

Hope everything is good for merging.
Comment 21 GitLab Migration User 2018-08-21 10:46:28 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/363.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.