Bug 103873 - pdfunite introduces syntax error / illegal characters (as per pdfinfo)
Summary: pdfunite introduces syntax error / illegal characters (as per pdfinfo)
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-23 22:29 UTC by Tom Yan
Modified: 2017-12-18 23:38 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Source file (2.78 MB, application/pdf)
2017-11-23 22:29 UTC, Tom Yan
Details
Patch to use %c instead of \x to output binary (621 bytes, application/pdf)
2017-12-05 17:41 UTC, Fredrik Fornwall
Details
Patch marked with the correct mime type (621 bytes, patch)
2017-12-05 17:44 UTC, Fredrik Fornwall
Details | Splinter Review

Description Tom Yan 2017-11-23 22:29:55 UTC
Created attachment 135691 [details]
Source file

I split the attached pdf with `pdfseparate`. If I then merge some of the resulted files again with `pdfunite`, "Syntax Error" / "Illegal character"(s) are found by `pdfinfo` on the resulted files. The problem could be triggered by page 2 and 3, other pages seem fine.

Version: 0.61.1
Steps to reproduce:
$ pdfinfo pst-book-gog.pdf >/dev/null
$ mkdir sep uni
$ cd sep/
$ pdfseparate -f 2 -l 3 ../pst-book-gog.pdf %03d.pdf
$ pdfinfo 002.pdf >/dev/null
$ pdfinfo 003.pdf >/dev/null
$ cd ../uni/
$ pdfunite ../sep/002.pdf 002.pdf
$ pdfunite ../sep/003.pdf 003.pdf
$ pdfinfo 002.pdf >/dev/null
Syntax Error (69): Illegal character '}'
Syntax Error (70): Illegal character '}'
Syntax Error (79): Illegal character '>'
$ pdfinfo 003.pdf >/dev/null
Syntax Error (120): Illegal character '{'
Comment 1 Thomas Freitag 2017-11-27 13:01:55 UTC
I can't reproduce that with this PDF and git master, fetched with

git clone https://anongit.freedesktop.org/git/poppler/poppler.git
Comment 2 Albert Astals Cid 2017-12-02 09:13:18 UTC
Can't reproduce either, did you self compile? Which options did you use? Otherwise which distribution are you in?
Comment 4 Albert Astals Cid 2017-12-05 16:11:23 UTC
not going to install that, FWIW i also tried the packaged 0.61.1 in ArchLinux (besides my own self compiled one) and can't reproduce this either so I guess unless you can help us reproduce the problem you're on your own
Comment 5 Fredrik Fornwall 2017-12-05 17:41:08 UTC
Created attachment 135987 [details]
Patch to use %c instead of \x to output binary
Comment 6 Fredrik Fornwall 2017-12-05 17:41:44 UTC
From https://github.com/termux/termux-packages/issues/1900:

"prior to Android P (i.e. AOSP master), Android's printf didn't accept invalid UTF8 sequences in the format string, and so can't be used to print arbitrary bytes like this. you'd need to use %c%c%c%c -- as your other example does -- or (hilariously) %s (because only the format string is checked). this is from BSD code, so it's not just Android that has/had this limitation, so you may be able to get upstream to switch to the same %c%c%c%c workaround as your other example."

Would the above patch work for you?
Comment 7 Fredrik Fornwall 2017-12-05 17:44:47 UTC
Created attachment 135988 [details] [review]
Patch marked with the correct mime type
Comment 8 Tom Yan 2017-12-05 21:41:12 UTC
(In reply to Fredrik Fornwall from comment #6)
> From https://github.com/termux/termux-packages/issues/1900:
> 
> "prior to Android P (i.e. AOSP master), Android's printf didn't accept
> invalid UTF8 sequences in the format string, and so can't be used to print
> arbitrary bytes like this. you'd need to use %c%c%c%c -- as your other
> example does -- or (hilariously) %s (because only the format string is
> checked). this is from BSD code, so it's not just Android that has/had this
> limitation, so you may be able to get upstream to switch to the same
> %c%c%c%c workaround as your other example."
> 
> Would the above patch work for you?

Yup I am on Android N and can confirm that the \x is the problem:

$ cat test.c
#include <stdio.h>
int main() {
  printf("%%\xE2\xE3\xCF\xD3\n");
  printf("%%%s\n", "\xE2\xE3\xCF\xD3");
  printf("%%%c%c%c%c\n", 0xe2, 0xe3, 0xcf, 0xd3);
}
$ clang test.c
$ ./a.out
%%����
%����
$ ./arch/startarch
[21:37 home ]$ clang test.c
[21:38 home ]$ ./a.out
%����
%����
%����

And pdfinfo no longer complains about any file produced by pdfunite in poppler 0.61.1-2 on Termux (which has the patch applied).
Comment 9 Albert Astals Cid 2017-12-12 23:30:35 UTC
Before commiting this patch, anyone can explain the 

"printf didn't accept invalid UTF8 sequences in the format string"

part a bit more? What's invalid in that code?
Comment 10 Adrian Johnson 2017-12-13 08:35:03 UTC
(In reply to Albert Astals Cid from comment #9)
> Before commiting this patch, anyone can explain the 
> 
> "printf didn't accept invalid UTF8 sequences in the format string"
> 
> part a bit more? What's invalid in that code?

Only the first byte in a UTF-8 sequence can start with '11' (in binary). The second and subsequent bytes must start with '10'.

0xE2, 0xE3 is 11100010, 11100011

I assume the printf implementation is trying to parse the format string one character at a time in the current locale in order to pick off the conversion specifiers. I'm not sure what the standards say about this but it is probably best to avoid putting arbitrary binary sequences in the format string.
Comment 11 Albert Astals Cid 2017-12-18 23:38:16 UTC
Pushed


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.