Bug 103446 - syntax errors reported on PDFs created by xsane
Summary: syntax errors reported on PDFs created by xsane
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
: 104453 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-10-25 09:49 UTC by Larry Myerscough
Modified: 2018-01-03 22:16 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
two-page music score as PDF (315.90 KB, application/pdf)
2017-10-25 09:49 UTC, Larry Myerscough
Details
fix inline images in pdfimages (4.60 KB, patch)
2017-10-30 09:00 UTC, Adrian Johnson
Details | Splinter Review
attachment-29513-0.html (2.30 KB, text/html)
2017-11-14 14:01 UTC, Larry Myerscough
Details
the file that crashes with the proposed patch (5.43 MB, application/pdf)
2017-12-28 23:37 UTC, Albert Astals Cid
Details
Fix crash (721 bytes, patch)
2018-01-02 22:09 UTC, Adrian Johnson
Details | Splinter Review

Description Larry Myerscough 2017-10-25 09:49:45 UTC
Created attachment 135032 [details]
two-page music score as PDF

using poppler-0.60.1 with libpoppler.so.71 on PDFs created by xsane v0.99, I get errors from the pdfimages -list <filename> command like: ...

gill@happy ~/MEW_Archive/A/African Symphony/African Symphony Bari-Euph $ pdfimages -list African-Symphony-Baritone-C.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Error (174623): Unexpected end of file in flate stream
   1     0 image    4800  6959  gray    1   1  image  no   [inline]     600   600 4078K 100%
Syntax Error (175257): Unknown compression method in flate stream
Syntax Error (147236): Unexpected end of file in flate stream
   2     1 image    4800  6959  gray    1   1  image  no   [inline]     600   600 4078K 100%
Syntax Error (322879): Unknown compression method in flate stream
gill@happy ~/MEW_Archive/A/African Symphony/African Symphony Bari-Euph $ 

...
We have created a few thousand such files using xsane in the past months and distributed these to about 50 people (mainly windows users using Adobe Acrobat reader) who have had no problems viewing and printing these files.

As "one last check" however, I decided to run them through pdfimages, hoping topick up whewhter eg.g any had been scanned in colour by mistake.

The stdout data looks ok (comp=1 means zlib compression?), so I suppose I could just ignore the stderror stuff... but I'm concerned in csae there is a problem in there which will come back to bite me later!

I am attaching the file to which the above error messages relate - rather large alas!

I am willing to dig deeper myself if necessary to track down what's happening, but I would like some advice on how to proceed.

Thanks.
Comment 1 Adrian Johnson 2017-10-25 10:41:12 UTC
A 4800x6959 inline image! The PDF standard says

 "Because the inline format gives the reader less flexibility in managing the
  image data, it shall be used only for small images (4 KB or less)."

But it should work. It is just less efficient than an image stream.

It is a bug in pdfimages. I've found the cause and can make it work. I'll post a patch when a write a proper fix.
Comment 2 Larry Myerscough 2017-10-25 12:28:57 UTC
Thanks for the prompt response. From my point of view, it's great to hear that it's a bug, since I have so many would-be done-and-dusted PDFs exhibiting this phenomenon!
I guess I ought also to have a quiet word with the xsane team about their dubious use of (strictly too) big in-line images.

Thanks!

[Perhaps off-topic so don't feel compelled to reply ...]

Is there an easy way (with poppler tooling?) to re-style my PDFs to use a more standard construction without changing the actual image part of the data. (I would prefer our official archive to contain unarguably valid PDFs with no bending of the standard.)
Comment 3 Adrian Johnson 2017-10-25 19:58:45 UTC
(In reply to Larry Myerscough from comment #2)
> Is there an easy way (with poppler tooling?) to re-style my PDFs to use a
> more standard construction without changing the actual image part of the
> data. (I would prefer our official archive to contain unarguably valid PDFs
> with no bending of the standard.)

I tried running it through ghostscript:

$ pdf2ps African-Symphony-Baritone-C.pdf out.ps
$ ps2pdf out.ps out.pdf
$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4800  6959  gray    1   1  ccitt  no         8  0   600   600  105K 2.6%
   2     1 image    4800  6959  gray    1   1  ccitt  no        14  0   600   600 86.4K 2.1%

Not only has it converted it to a standard image, it has also encoded the images with CCITT which gives better compression for 1 bpc images compared with Flate.
Comment 4 Larry Myerscough 2017-10-26 10:01:56 UTC
Thanks! I'd never thought of converting via '.ps' - even though I'd used ps2pdf a lot in the past. If the space saving is typical for the whole bunch, this will also require much less space in the cloud for the official archive.
Comment 5 Adrian Johnson 2017-10-30 09:00:59 UTC
Created attachment 135162 [details] [review]
fix inline images in pdfimages
Comment 6 Larry Myerscough 2017-11-14 14:01:08 UTC
Created attachment 135449 [details]
attachment-29513-0.html

Hi Adrian & Co.

Please confirm whether any action is required of me. I wasn't able to apply
the patch probablky becaue I had the wrong base version. (my git knowledge
is sketchy!). I don't urgently need a fixed version so, unless advised
othewise, I'll wait for it to make it into the main release.

Thanks,
Larry


2017-10-30 10:00 GMT+01:00 <bugzilla-daemon@freedesktop.org>:

> *Comment # 5 <https://bugs.freedesktop.org/show_bug.cgi?id=103446#c5> on
> bug 103446 <https://bugs.freedesktop.org/show_bug.cgi?id=103446> from
> Adrian Johnson <ajohnson@redneon.com> *
>
> Created attachment 135162 [details] [review] <https://bugs.freedesktop.org/attachment.cgi?id=135162> [details] <https://bugs.freedesktop.org/attachment.cgi?id=135162&action=edit> [review] <https://bugs.freedesktop.org/page.cgi?id=splinter.html&bug=103446&attachment=135162>
> fix inline images in pdfimages
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 7 Albert Astals Cid 2017-12-28 23:35:59 UTC
Adrian, sorry it took me so long to come back to this, but this patch makes pdfimages crash with 104418018297-AttenInSuspensionsIrregularlyShapedSedimentParticles.pdf

tsdgeos@xps:~/okularfiles/pdf/scripts:$ valgrind ~/devel/poppler/build-new/utils/pdfimages -png ../104418018297-AttenInSuspensionsIrregularlyShapedSedimentParticles.pdf old-pdfimages/bla
==18590== Memcheck, a memory error detector
==18590== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==18590== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==18590== Command: /home/tsdgeos/devel/poppler/build-new/utils/pdfimages -png ../104418018297-AttenInSuspensionsIrregularlyShapedSedimentParticles.pdf old-pdfimages/bla
==18590== 
==18590== Invalid write of size 8
==18590==    at 0x4C320C3: memcpy@GLIBC_2.2.5 (vg_replace_strmem.c:1017)
==18590==    by 0x4FB14A9: memcpy (string3.h:53)
==18590==    by 0x4FB14A9: EmbedStream::getChars(int, unsigned char*) (Stream.cc:1140)
==18590==    by 0x4FB2391: doGetChars (Stream.h:120)
==18590==    by 0x4FB2391: ImageStream::getLine() (Stream.cc:512)
==18590==    by 0x10F449: ImageOutputDev::writeImageFile(ImgWriter*, ImageOutputDev::ImageFormat, char const*, Stream*, int, int, GfxImageColorMap*) (ImageOutputDev.cc:476)
==18590==    by 0x10FB5C: ImageOutputDev::writeImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, bool) (ImageOutputDev.cc:671)
==18590==    by 0x4F610EB: Gfx::doImage(Object*, Stream*, bool) (Gfx.cc:4592)
==18590==    by 0x4F618F9: Gfx::opBeginImage(Object*, int) (Gfx.cc:4895)
==18590==    by 0x4F59F30: Gfx::go(bool) (Gfx.cc:738)
==18590==    by 0x4F5A47E: Gfx::display(Object*, bool) (Gfx.cc:700)
==18590==    by 0x4FA613A: Page::displaySlice(OutputDev*, double, double, int, bool, bool, int, int, int, int, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*, bool) (Page.cc:560)
==18590==    by 0x4FA63C7: Page::display(OutputDev*, double, double, int, bool, bool, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*, bool) (Page.cc:483)
==18590==    by 0x4FAAB68: PDFDoc::displayPages(OutputDev*, int, int, double, double, int, bool, bool, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*) (PDFDoc.cc:516)
==18590==  Address 0xe2281a8 is 984 bytes inside a block of size 989 alloc'd
==18590==    at 0x4C2DB2F: malloc (vg_replace_malloc.c:299)
==18590==    by 0x4F004E3: gmalloc (gmem.cc:110)
==18590==    by 0x4F004E3: gmallocn (gmem.cc:192)
==18590==    by 0x4F004E3: gmallocn_checkoverflow (gmem.cc:200)
==18590==    by 0x4FB20BC: ImageStream::ImageStream(Stream*, int, int, int) (Stream.cc:454)
==18590==    by 0x10F21A: ImageOutputDev::writeImageFile(ImgWriter*, ImageOutputDev::ImageFormat, char const*, Stream*, int, int, GfxImageColorMap*) (ImageOutputDev.cc:384)
==18590==    by 0x10FB5C: ImageOutputDev::writeImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, bool) (ImageOutputDev.cc:671)
==18590==    by 0x4F610EB: Gfx::doImage(Object*, Stream*, bool) (Gfx.cc:4592)
==18590==    by 0x4F618F9: Gfx::opBeginImage(Object*, int) (Gfx.cc:4895)
==18590==    by 0x4F59F30: Gfx::go(bool) (Gfx.cc:738)
==18590==    by 0x4F5A47E: Gfx::display(Object*, bool) (Gfx.cc:700)
==18590==    by 0x4FA613A: Page::displaySlice(OutputDev*, double, double, int, bool, bool, int, int, int, int, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*, bool) (Page.cc:560)
==18590==    by 0x4FA63C7: Page::display(OutputDev*, double, double, int, bool, bool, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*, bool) (Page.cc:483)
==18590==    by 0x4FAAB68: PDFDoc::displayPages(OutputDev*, int, int, double, double, int, bool, bool, bool, bool (*)(void*), void*, bool (*)(Annot*, void*), void*) (PDFDoc.cc:516)
==18590==
Comment 8 Albert Astals Cid 2017-12-28 23:37:02 UTC
Created attachment 136438 [details]
the file that crashes with the proposed patch
Comment 9 Adrian Johnson 2018-01-02 22:09:27 UTC
Created attachment 136509 [details] [review]
Fix crash
Comment 10 Adrian Johnson 2018-01-02 22:12:56 UTC
*** Bug 104453 has been marked as a duplicate of this bug. ***
Comment 11 Albert Astals Cid 2018-01-03 22:16:04 UTC
Pushed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.