Bug 44085

Summary: pdfinfo/pdffonts cannot process >2GB files
Product: poppler Reporter: Hin-Tak Leung <htl10>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: use getNum instead of getReal
Add Int64 object
Large file support
windows large file support
Large file support
6GB PDF test case
Add Int64 object
Large file support
Add Int64 object
Add Int64 object
fix uninitialised memory
fix uninitialized memory
Add Int64 object
Large file support
windows large file support
Add Int64 object
fix build on ancient systems
using sys/types.h instead of stdio.h for off_t
a few more missing Goffsets

Description Hin-Tak Leung 2011-12-22 17:25:27 UTC
$ utils/pdffonts file.pdf
Error: Document base stream is not seekable
$ utils/pdfinfo file.pdf
Error: Document base stream is not seekable

Problem with evince as well:

$ evince file.pdf
Error: Document base stream is not seekable
Error: Document base stream is not seekable

(evince:26805): EvinceView-CRITICAL **: ev_document_model_set_document: assertion `EV_IS_DOCUMENT (document)' failed

(evince:26805): EvinceDocument-CRITICAL **: ev_document_get_n_pages: assertion `EV_IS_DOCUMENT (document)' failed

(evince:26805): EvinceDocument-CRITICAL **: ev_document_get_max_page_size: assertion `EV_IS_DOCUMENT (document)' failed

but xpdf works on the file just fine.

The file itself is 3.5GB.

There is a corresponding bug for mupdf at http://bugs.ghostscript.com/show_bug.cgi?id=692757
Comment 1 Albert Astals Cid 2011-12-23 00:11:56 UTC
What's your poppler version?
Comment 2 Hin-Tak Leung 2011-12-23 02:24:20 UTC
(In reply to comment #1)
> What's your poppler version?

git head yesterday (poppler-0.18.0-40-gebfab83). xpdf is from fedora 16's (0.18.0-2.fc16.x86_64).

I'd say the problem is fairly obvious - grep 'Document base stream is not seekable' gives:

poppler/PDFDoc.cc :
---------------------
GBool PDFDoc::setup(GooString *ownerPassword, GooString *userPassword) {
  str->setPos(0, -1);
  if (str->getPos() < 0)
  {
    error(-1, "Document base stream is not seekable");
    return gFalse;
---------------------

and str->getPos() returns int. While setPos() may succeed, whatever it sets for current pos would be too large for an int.

The curious thing is why xpdf works, and pdfinfo/pdffonts don't.
Comment 3 Albert Astals Cid 2011-12-23 10:28:21 UTC
Easy, because xpdf does not use poppler.

Now, i'll need you to tell me how to get such a big file if you want me to work on a fix.
Comment 4 Hin-Tak Leung 2011-12-24 08:02:56 UTC
(In reply to comment #3)
> Easy, because xpdf does not use poppler.

Argh, I thought somebody might have made the effort to retrofit it... you are right. Maybe I should give xpdf's pdfinfo/pdffonts a go and see whether they do the job.

> Now, i'll need you to tell me how to get such a big file if you want me to work
> on a fix.

It is probably not practical/legal for me to upload it. According to the xml header it is made by 'FreePic2Pdf - 1.26'. It is a scan of a book - the file is big because it consists of a lot of scanned images. 

I suppose you can probably make one by joining many small ones with pdftk or ghostscript (assuming that they don't have 32-bit limits to file-writing - AFAIK ghostscript can/do read/write files over 2GB, but please feel free to correct me and file a bug) or just pdflatex. A 24-bit color scan at 600-dpi at very poor compression should be 120MB so just repeating that 20 times would do...

I could try making one (or supply a recipe of making such) if you haven't got anything handy...
Comment 5 Albert Astals Cid 2011-12-24 08:37:54 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Easy, because xpdf does not use poppler.
> 
> Argh, I thought somebody might have made the effort to retrofit it... you are
> right. Maybe I should give xpdf's pdfinfo/pdffonts a go and see whether they do
> the job.

If the xpdf author was open to collaboration poppler would not exist.
 
> I could try making one (or supply a recipe of making such) if you haven't got
> anything handy...

It would be cool if you could provide the recipe :-)
Comment 6 Hin-Tak Leung 2011-12-24 14:56:52 UTC
(In reply to comment #5)
> If the xpdf author was open to collaboration poppler would not exist.

There is a half-empty/half-full difference in view-point... maybe one should be grateful to the xpdf author for providing the *starting point* of poppler? It would have been more difficult to start a pdf-parsing project from scratch. Look at the state (or "lack of") of GNUPDF on the GNU web site...

> It would be cool if you could provide the recipe :-)

Here is a recipe - I don't know what is the minimum what you need, but pdflatex and movie15 (both included in texlive 2011) are necessary (and hyperref is a requirement of movie15). I have 7 avi files of approx 370MB each. I don't think they need to be avi files, since movie15 complains that it can't detect the mime type of those files (hence the mimetype= parameter to \includemovie). Presumably you can just stuff a linux distro DVD image in there and lie about the mimetype.

I tried repeating the same file 7 times, but apparently movie15 is clever enough not to embed the same thing 7 times, so you *do* need to have 2GB of junk to make this work, rather than repeating the same thing.

Process this with pdflatex:
====================
\documentclass[12pt]{article}
\usepackage{movie15}
\usepackage{hyperref}
\begin{document}
movie15 can detect duplicate - use different files!
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{1.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{2.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{3.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{4.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{5.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{6.avi}
\end{figure}
\begin{figure}[ht]
\includemovie[mimetype=video/avi]{12cm}{6cm}{7.avi}
\end{figure}
\end{document}
============================

The result is a 2.5GB pdf, with 3 pages. xpdf 3.03's pdfinfo/pdffonts did work correctly (I downloaded the source and built), so poppler's pdfinfo/pdffonts is probably seem as a regression... Here is what it says.


===============================================
#distro, poppler
$ pdfinfo a.pdf
Error: Document base stream is not seekable
$ pdffonts a.pdf
Error: Document base stream is not seekable

$ ~/bin/xpdf-3.03/pdffonts a.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
EVLLHA+CMR12                         Type 1            yes yes no      78  0
$ ~/bin/xpdf-3.03/pdfinfo a.pdf
Title:          
Subject:        
Keywords:       
Author:         
Creator:        LaTeX with hyperref package
Producer:       pdfTeX-1.40.12
CreationDate:   Sat Dec 24 22:22:11 2011
ModDate:        Sat Dec 24 22:22:11 2011
Tagged:         no
Form:           none
Pages:          3
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      2543173309 bytes
Optimized:      no
PDF version:    1.5
======================================================
Comment 7 Albert Astals Cid 2011-12-25 12:52:09 UTC
(In reply to comment #6)
> The result is a 2.5GB pdf, with 3 pages. xpdf 3.03's pdfinfo/pdffonts did work
> correctly (I downloaded the source and built), so poppler's pdfinfo/pdffonts is
> probably seem as a regression... Here is what it says.

Stream:getPos returns an int in xpdf too, it's just that they don't use it much so they're lucky in this case
Comment 8 Adrian Johnson 2012-12-17 02:56:32 UTC
I started working on some patches to fix this. I used the method in comment 6 to create a test file. I used dd if=/dev/urandom bs=1M count=400 of=x.avi to create the avi files. This gave me a 2.8GB pdf file.

I also tried using single 2.5GB avi file but there seems to be a bug in pdflatex or movie15. It is using signed 32-bits for the stream length so it wrote a pdf with a negative stream length for the avi file.

With the patches, pdfinfo, pdffonts, pdftoppm, and pdftocairo (image output only) are working with the 2.8GB test file. I need to do more testing of the patches before they are ready. Some of the things I have not yet looked at are:
 - pdftops, pdfimages, pdfunit, and pdfseparate
 - testing on a 32-bit machine
 - try to create more interesting >2GB files for testing
 - reg testing of < 2GB pdfs

I'm attaching the current version of my patches. It would be great to get some feedback on the patches before I spend a lot of time on testing.
Comment 9 Adrian Johnson 2012-12-17 02:57:29 UTC
Created attachment 71620 [details] [review]
use getNum instead of getReal

Need to fix this before adding a Int64 object.
Comment 10 Adrian Johnson 2012-12-17 03:01:09 UTC
Created attachment 71621 [details] [review]
Add Int64 object

We need a 64-bit integer object type to handle 64-bit offsets and lengths in pdf files.

I'm not sure if we can use long long unconditionally. If not, I can wrap it in an #ifdef.
Comment 11 Adrian Johnson 2012-12-17 03:04:02 UTC
Created attachment 71622 [details] [review]
Large file support

This defines a Goffset type that uses the same type as the offset parameter in the version of fseek/ftell we are using. I've used this type for any variable used for file offsets or file sizes.
Comment 12 Adrian Johnson 2012-12-17 03:07:21 UTC
Created attachment 71623 [details]
windows large file support

Windows has these functions for large file support:

  int _fseeki64(FILE *stream, __int64 offset, int origin);
  __int64 _ftelli64(FILE *stream);

This patch adds windows support for large files. I have not tested this patch.

It may be better to create poppler_fseek/poppler_ftell functions instead of repeating the #ifdefs for every use of fseek/ftell.
Comment 13 Albert Astals Cid 2012-12-17 22:12:12 UTC
The patches look ok-ish as a start (i.e. haven't read them thoroughly but kind of make sense)

But now that i think the pdf spec has a "Implementation Limits" section and it mentions all the time that 32bit is the expected platorm and thus stuff is not expected to be bigger than that.

Does Adobe actually work with such big files? Should we strive to support them if Adobe does not?
Comment 14 Adrian Johnson 2012-12-19 10:51:30 UTC
(In reply to comment #13)
> But now that i think the pdf spec has a "Implementation Limits" section and
> it mentions all the time that 32bit is the expected platorm and thus stuff
> is not expected to be bigger than that.

That was not the way I interpreted the Implementation Limits section. The limits are the minimum for a 32-bit platform. 64-bit machines are now becoming common and the limits would be expected to be higher. I also did not see anything suggesting that the file size should be limited to the Integer size, particularly as it mentioned the maximum file size is 10GB (for standard XRef tables) or unlimited (for XRef streams).

> Does Adobe actually work with such big files? Should we strive to support
> them if Adobe does not?

I tested with a 3.5Gb and 5GB pdf. Adobe Reader 9.5.1 on Linux will open the 3.5GB file but fails to open the 5GB. Adobe Reader 11.0.0 on Windows will open both.

However since files > 2GB are rare and there are a lot of changes required to make poppler handle large files it may be better to hold off supporting large files until we see more demand for this feature.

We should add a check for >2GB files and print a more user friendly error message than "stream not seekable".
Comment 15 Albert Astals Cid 2012-12-19 19:05:51 UTC
Nah, if Adobe supports > 2GB fine we should go for it, and you already seem to have something, so let's do it :-)

Patch 1 seems a bit weird though, why do we need it?
Comment 16 Adrian Johnson 2012-12-19 20:43:01 UTC
(In reply to comment #15)
> Patch 1 seems a bit weird though, why do we need it?

When I added the Int64 object I was checking that all the code used getNum when it needed a real number and noticed this one location that was using getReal. Obviously this won't work for numbers parsed as an integer.
Comment 17 Adrian Johnson 2012-12-23 11:33:35 UTC
Created attachment 72021 [details] [review]
Large file support

Updated patch to fix some bugs.
Comment 18 Adrian Johnson 2012-12-23 11:45:00 UTC
Created attachment 72023 [details]
6GB PDF test case

I've used a modified version of cairo to create a more interesting test case. It consists of 2000 pages. Each page contains an uncompressed 3MB image (1000x1000 pixels RGB). The images are all zeros (ie a black square). The end result is a 6GB file that can be compressed down to 1MB. I used xz to compress it since it supports sparse files. Uncompressing with unxz will produce a 6GB file that takes up 18MB of disk space.
Comment 19 Adrian Johnson 2013-01-03 07:14:45 UTC
Created attachment 72428 [details] [review]
Add Int64 object
Comment 20 Adrian Johnson 2013-01-03 07:15:11 UTC
Created attachment 72429 [details] [review]
Large file support
Comment 21 Adrian Johnson 2013-01-03 07:16:31 UTC
Created attachment 72430 [details] [review]
Add Int64 object
Comment 22 Adrian Johnson 2013-01-03 07:26:17 UTC
The current status of the patches is after reg testing with splash:

1323 tests passed (99.47%)
7 tests failed (0.53%)

Of the 7 failures, 5 are false negatives. ie there is no visual difference. The remaing 2 failures are

bug-poppler26533.pdf - some lines have shifted a pixel when rendered at 72ppi
tauya.f8.pdf - some glyphs have shifted slightly.

I have not yet been able to determine why these two tests are failing.
Comment 23 Adrian Johnson 2013-01-03 13:26:13 UTC
Created attachment 72448 [details] [review]
Add Int64 object

small update
Comment 24 Adrian Johnson 2013-01-03 13:29:32 UTC
Created attachment 72449 [details] [review]
fix uninitialised memory

I've found the rendering differences for bug-poppler26533.pdf to be caused by uninitialised memory in splash. This patch fixes the problem.

I don't have any idea why tauya.f8.pdf is showing differences however I suspect it is an existing problem that is just failing in a different way with the large file patches. The tauya.f8 pdf is very broken. It has a broken font (corrupted Flate stream). Poppler does not seem to have error detection in the Flate decompression and renders mostly garbage. Ghostscript detects the broken stream and substitutes a different font resulting in what appears to be correct output. If I remove the font descriptor from the pdf forcing poppler to substitute the font, the output is the same as ghostscript with no regression when using the large font patches. Acroread complains the font is broken and displays garbage.

So at this point I think the patches are ready for review.
Comment 25 Thomas Freitag 2013-01-05 12:22:37 UTC
Created attachment 72539 [details] [review]
fix uninitialized memory

(In reply to comment #24)
> Created attachment 72449 [details] [review] [review]
> fix uninitialised memory
> 
> I've found the rendering differences for bug-poppler26533.pdf to be caused
> by uninitialised memory in splash. This patch fixes the problem.

Thanks for figuring out why almost all tests failed if a page using tiling patterns. But I think Your patch is slightly incomplete: 
1. You remove the initialization of the bitmap if paintType = 2 is used. This paintType is seldom used, one example is page 745 of spec PDF32000_2008.pdf.
2. I think the better way to initialize the memory is to use the paperColor (which in pdftoppm is the same You're using: white) and use the existing Splash::clear function.
This patch changes it and therefore should replace Your patch for it.
Comment 26 Adrian Johnson 2013-01-08 12:38:47 UTC
(In reply to comment #25)
> 2. I think the better way to initialize the memory is to use the paperColor
> (which in pdftoppm is the same You're using: white) and use the existing
> Splash::clear function.
> This patch changes it and therefore should replace Your patch for it.

I'm fairly certain that the initial color should make no difference when the alpha is all zero. I tried changing the color in splash->clear in SplashOutputDev::startPage and as long as the alpha is 0 there is no change in the output. So why does changing the initial color in tilingPatternFill affect the output?
Comment 27 Thomas Freitag 2013-01-08 12:56:04 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > 2. I think the better way to initialize the memory is to use the paperColor
> > (which in pdftoppm is the same You're using: white) and use the existing
> > Splash::clear function.
> > This patch changes it and therefore should replace Your patch for it.
> 
> I'm fairly certain that the initial color should make no difference when the
> alpha is all zero. I tried changing the color in splash->clear in
> SplashOutputDev::startPage and as long as the alpha is 0 there is no change
> in the output. So why does changing the initial color in tilingPatternFill
> affect the output?

I think here is a misunderstanding: My main point here was to use the Splash:clear function which is also used in every other place to initialize a just created bitmap. But if this had been the only point I hadn't uploaded a new patch. What definitely was wrong in Your patch was only that You just replaced paintType 2 by paintType 1 which means that we'll get problems with paintType 2 tiling patterns.
Comment 28 Adrian Johnson 2013-01-08 13:02:45 UTC
I don't have any issue with your patch. I was making the observation that changing the initial color, whether explicitly or as a result of uninitialized memory, was altering the output. I don't think this should happen when the alpha is all zero.
Comment 29 Thomas Freitag 2013-01-08 13:21:34 UTC
Oh, sorry. This is one of profound mystery in the pipeRun function I never completely understood how the combination of source alpha, destination alpha, source color, destination color, blending mode and so on and so on works in just one simple function for all cases. But the result seems almost be okay, so I avoid to touch it whenever is possible.
Comment 30 Albert Astals Cid 2013-01-22 19:05:02 UTC
Adrian could you rebase the patches against current master? I tried applying them after Thomas' threading patches went in and there's a lot of patch "failures" i am not sure i am fixing correctly so it'd be better if you can provide updated patches
Comment 31 Adrian Johnson 2013-01-23 10:46:32 UTC
Created attachment 73505 [details] [review]
Add Int64 object
Comment 32 Adrian Johnson 2013-01-23 10:47:01 UTC
Created attachment 73506 [details] [review]
Large file support
Comment 33 Adrian Johnson 2013-01-23 10:50:32 UTC
Created attachment 73507 [details] [review]
windows large file support
Comment 34 Albert Astals Cid 2013-01-23 22:18:21 UTC
Seems that this breaks the lexer somehow, when running the ./check_lexer test in qt4, that does not compile, needs this patch

diff --git a/qt4/tests/check_lexer.cpp b/qt4/tests/check_lexer.cpp
index 1ae849f..1835d93 100644
--- a/qt4/tests/check_lexer.cpp
+++ b/qt4/tests/check_lexer.cpp
@@ -46,8 +46,8 @@ void TestLexer::testNumbers()
     obj.free();
     
     lexer->getObj(&obj);
-    QCOMPARE(obj.getType(), objUint);
-    QCOMPARE(obj.getUint(), 2147483648u);
+    QCOMPARE(obj.getType(), objInt64);
+    QCOMPARE(obj.getInt64(), 2147483648ll);
     obj.free();
       
     lexer->getObj(&obj);

The test fails. It seems that 
-2147483647-1
is not detected as a objInt anymore but as a objInt64.

Could you have a look?
Comment 35 Adrian Johnson 2013-01-24 10:46:37 UTC
Created attachment 73557 [details] [review]
Add Int64 object

Updated to fix check_lexer failures. I've also added a couple of tests to check_lexer to test the Int64 object.
Comment 36 Albert Astals Cid 2013-01-24 12:46:18 UTC
Works good :-)

Only one thing i'm not so sure about isNum and getNum returning the int64 stuff. That is, you can't really represent all the int64 values in a double, so by using getNum we might lose data.

Proof:
    long long myLong = 9223372036854774700ll;
    double myDouble = myLong;
    printf("%lld\n", myLong);
    printf("%f\n", myDouble);

Result:
9223372036854774700
9223372036854774784.000000

On the other hand this seems to happen for ridiculously high values, i.e. a double is perfectly fine to handle stuff as the value of 160GB, so maybe just add a comment in getNum like "// Warning this may break for ridiculously big values of long long"

What do you think?
Comment 37 Adrian Johnson 2013-01-24 13:25:34 UTC
I could add a comment but it seemed obvious to me that since getNum returns a double it is not intended for getting exact integer values. The patch does not make the accuracy of getNum any worse than it was before. Anything needing exact values should be using getInt/getInt64. 

As you say, this only affects huge values. A double has 53 bits of significand so it is only values with magnitude greater than 2^53 - 1 that can not be exactly represented.
Comment 38 Albert Astals Cid 2013-01-24 18:01:58 UTC
I don't think it should hurt adding the comment as you said mentioning the 2^53 Better safe than sorry.

Beisdes that feel free to commit from my side. Given this has been around for a while i guess we can asume noone else wants to review it :D
Comment 39 Adrian Johnson 2013-01-25 11:11:53 UTC
I've added a comment to getNum() and pushed all the patches.
Comment 40 Lu Wang 2013-01-26 13:24:11 UTC
Build is broken on my machine, CentOS 5.9

fseeko is available, but in /usr/include/stdio.h, I see that the 2nd parameter is of type __off_t, although `man fseeko` shows that it should be off_t 

off_t is defined in <sys/types.h> and <unistd.h>, and I see a few header files there to include one of these 2 files for off_t.
Comment 41 Adrian Johnson 2013-01-26 14:14:26 UTC
Created attachment 73688 [details] [review]
fix build on ancient systems

Try this patch.
Comment 42 Lu Wang 2013-01-26 18:32:57 UTC
It works. Thanks a lot!
Comment 43 Even Rouault 2013-01-26 19:11:12 UTC
I confirm that the patch in attachment 73688 [details] [review] is needed too on Ubuntu 10.04 with GCC 4.4.3
Comment 44 Adrian Johnson 2013-01-27 01:11:10 UTC
Created attachment 73705 [details] [review]
using sys/types.h instead of stdio.h for off_t

I checked the POSIX specs at opengroup.org.

In the Single UNIX Specification version 2 (UNIX98), off_t is defined in sys/types.h.

In the Single UNIX Specification version 4 (POSIX:2008), stdio.h is also required to expose off_t.

Glibc bug http://sourceware.org/bugzilla/show_bug.cgi?id=11125 added off_t to stdio.h. This fix was included in glibc 2.12 released in 2010.

As there are likely to be many systems not fully compliant with POSIX:2008, using sys/types.h to import the off_t definition would be the most portable solution. The attached patch switches from stdio.h to sys/types.h.
Comment 45 Adrian Johnson 2013-01-27 05:47:49 UTC
Created attachment 73711 [details] [review]
a few more missing Goffsets
Comment 46 Albert Astals Cid 2013-01-27 09:34:46 UTC
Makes sense to me :-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.