$ utils/pdffonts file.pdf Error: Document base stream is not seekable $ utils/pdfinfo file.pdf Error: Document base stream is not seekable Problem with evince as well: $ evince file.pdf Error: Document base stream is not seekable Error: Document base stream is not seekable (evince:26805): EvinceView-CRITICAL **: ev_document_model_set_document: assertion `EV_IS_DOCUMENT (document)' failed (evince:26805): EvinceDocument-CRITICAL **: ev_document_get_n_pages: assertion `EV_IS_DOCUMENT (document)' failed (evince:26805): EvinceDocument-CRITICAL **: ev_document_get_max_page_size: assertion `EV_IS_DOCUMENT (document)' failed but xpdf works on the file just fine. The file itself is 3.5GB. There is a corresponding bug for mupdf at http://bugs.ghostscript.com/show_bug.cgi?id=692757
What's your poppler version?
(In reply to comment #1) > What's your poppler version? git head yesterday (poppler-0.18.0-40-gebfab83). xpdf is from fedora 16's (0.18.0-2.fc16.x86_64). I'd say the problem is fairly obvious - grep 'Document base stream is not seekable' gives: poppler/PDFDoc.cc : --------------------- GBool PDFDoc::setup(GooString *ownerPassword, GooString *userPassword) { str->setPos(0, -1); if (str->getPos() < 0) { error(-1, "Document base stream is not seekable"); return gFalse; --------------------- and str->getPos() returns int. While setPos() may succeed, whatever it sets for current pos would be too large for an int. The curious thing is why xpdf works, and pdfinfo/pdffonts don't.
Easy, because xpdf does not use poppler. Now, i'll need you to tell me how to get such a big file if you want me to work on a fix.
(In reply to comment #3) > Easy, because xpdf does not use poppler. Argh, I thought somebody might have made the effort to retrofit it... you are right. Maybe I should give xpdf's pdfinfo/pdffonts a go and see whether they do the job. > Now, i'll need you to tell me how to get such a big file if you want me to work > on a fix. It is probably not practical/legal for me to upload it. According to the xml header it is made by 'FreePic2Pdf - 1.26'. It is a scan of a book - the file is big because it consists of a lot of scanned images. I suppose you can probably make one by joining many small ones with pdftk or ghostscript (assuming that they don't have 32-bit limits to file-writing - AFAIK ghostscript can/do read/write files over 2GB, but please feel free to correct me and file a bug) or just pdflatex. A 24-bit color scan at 600-dpi at very poor compression should be 120MB so just repeating that 20 times would do... I could try making one (or supply a recipe of making such) if you haven't got anything handy...
(In reply to comment #4) > (In reply to comment #3) > > Easy, because xpdf does not use poppler. > > Argh, I thought somebody might have made the effort to retrofit it... you are > right. Maybe I should give xpdf's pdfinfo/pdffonts a go and see whether they do > the job. If the xpdf author was open to collaboration poppler would not exist. > I could try making one (or supply a recipe of making such) if you haven't got > anything handy... It would be cool if you could provide the recipe :-)
(In reply to comment #5) > If the xpdf author was open to collaboration poppler would not exist. There is a half-empty/half-full difference in view-point... maybe one should be grateful to the xpdf author for providing the *starting point* of poppler? It would have been more difficult to start a pdf-parsing project from scratch. Look at the state (or "lack of") of GNUPDF on the GNU web site... > It would be cool if you could provide the recipe :-) Here is a recipe - I don't know what is the minimum what you need, but pdflatex and movie15 (both included in texlive 2011) are necessary (and hyperref is a requirement of movie15). I have 7 avi files of approx 370MB each. I don't think they need to be avi files, since movie15 complains that it can't detect the mime type of those files (hence the mimetype= parameter to \includemovie). Presumably you can just stuff a linux distro DVD image in there and lie about the mimetype. I tried repeating the same file 7 times, but apparently movie15 is clever enough not to embed the same thing 7 times, so you *do* need to have 2GB of junk to make this work, rather than repeating the same thing. Process this with pdflatex: ==================== \documentclass[12pt]{article} \usepackage{movie15} \usepackage{hyperref} \begin{document} movie15 can detect duplicate - use different files! \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{1.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{2.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{3.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{4.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{5.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{6.avi} \end{figure} \begin{figure}[ht] \includemovie[mimetype=video/avi]{12cm}{6cm}{7.avi} \end{figure} \end{document} ============================ The result is a 2.5GB pdf, with 3 pages. xpdf 3.03's pdfinfo/pdffonts did work correctly (I downloaded the source and built), so poppler's pdfinfo/pdffonts is probably seem as a regression... Here is what it says. =============================================== #distro, poppler $ pdfinfo a.pdf Error: Document base stream is not seekable $ pdffonts a.pdf Error: Document base stream is not seekable $ ~/bin/xpdf-3.03/pdffonts a.pdf name type emb sub uni object ID ------------------------------------ ----------------- --- --- --- --------- EVLLHA+CMR12 Type 1 yes yes no 78 0 $ ~/bin/xpdf-3.03/pdfinfo a.pdf Title: Subject: Keywords: Author: Creator: LaTeX with hyperref package Producer: pdfTeX-1.40.12 CreationDate: Sat Dec 24 22:22:11 2011 ModDate: Sat Dec 24 22:22:11 2011 Tagged: no Form: none Pages: 3 Encrypted: no Page size: 612 x 792 pts (letter) File size: 2543173309 bytes Optimized: no PDF version: 1.5 ======================================================
(In reply to comment #6) > The result is a 2.5GB pdf, with 3 pages. xpdf 3.03's pdfinfo/pdffonts did work > correctly (I downloaded the source and built), so poppler's pdfinfo/pdffonts is > probably seem as a regression... Here is what it says. Stream:getPos returns an int in xpdf too, it's just that they don't use it much so they're lucky in this case
I started working on some patches to fix this. I used the method in comment 6 to create a test file. I used dd if=/dev/urandom bs=1M count=400 of=x.avi to create the avi files. This gave me a 2.8GB pdf file. I also tried using single 2.5GB avi file but there seems to be a bug in pdflatex or movie15. It is using signed 32-bits for the stream length so it wrote a pdf with a negative stream length for the avi file. With the patches, pdfinfo, pdffonts, pdftoppm, and pdftocairo (image output only) are working with the 2.8GB test file. I need to do more testing of the patches before they are ready. Some of the things I have not yet looked at are: - pdftops, pdfimages, pdfunit, and pdfseparate - testing on a 32-bit machine - try to create more interesting >2GB files for testing - reg testing of < 2GB pdfs I'm attaching the current version of my patches. It would be great to get some feedback on the patches before I spend a lot of time on testing.
Created attachment 71620 [details] [review] use getNum instead of getReal Need to fix this before adding a Int64 object.
Created attachment 71621 [details] [review] Add Int64 object We need a 64-bit integer object type to handle 64-bit offsets and lengths in pdf files. I'm not sure if we can use long long unconditionally. If not, I can wrap it in an #ifdef.
Created attachment 71622 [details] [review] Large file support This defines a Goffset type that uses the same type as the offset parameter in the version of fseek/ftell we are using. I've used this type for any variable used for file offsets or file sizes.
Created attachment 71623 [details] windows large file support Windows has these functions for large file support: int _fseeki64(FILE *stream, __int64 offset, int origin); __int64 _ftelli64(FILE *stream); This patch adds windows support for large files. I have not tested this patch. It may be better to create poppler_fseek/poppler_ftell functions instead of repeating the #ifdefs for every use of fseek/ftell.
The patches look ok-ish as a start (i.e. haven't read them thoroughly but kind of make sense) But now that i think the pdf spec has a "Implementation Limits" section and it mentions all the time that 32bit is the expected platorm and thus stuff is not expected to be bigger than that. Does Adobe actually work with such big files? Should we strive to support them if Adobe does not?
(In reply to comment #13) > But now that i think the pdf spec has a "Implementation Limits" section and > it mentions all the time that 32bit is the expected platorm and thus stuff > is not expected to be bigger than that. That was not the way I interpreted the Implementation Limits section. The limits are the minimum for a 32-bit platform. 64-bit machines are now becoming common and the limits would be expected to be higher. I also did not see anything suggesting that the file size should be limited to the Integer size, particularly as it mentioned the maximum file size is 10GB (for standard XRef tables) or unlimited (for XRef streams). > Does Adobe actually work with such big files? Should we strive to support > them if Adobe does not? I tested with a 3.5Gb and 5GB pdf. Adobe Reader 9.5.1 on Linux will open the 3.5GB file but fails to open the 5GB. Adobe Reader 11.0.0 on Windows will open both. However since files > 2GB are rare and there are a lot of changes required to make poppler handle large files it may be better to hold off supporting large files until we see more demand for this feature. We should add a check for >2GB files and print a more user friendly error message than "stream not seekable".
Nah, if Adobe supports > 2GB fine we should go for it, and you already seem to have something, so let's do it :-) Patch 1 seems a bit weird though, why do we need it?
(In reply to comment #15) > Patch 1 seems a bit weird though, why do we need it? When I added the Int64 object I was checking that all the code used getNum when it needed a real number and noticed this one location that was using getReal. Obviously this won't work for numbers parsed as an integer.
Created attachment 72021 [details] [review] Large file support Updated patch to fix some bugs.
Created attachment 72023 [details] 6GB PDF test case I've used a modified version of cairo to create a more interesting test case. It consists of 2000 pages. Each page contains an uncompressed 3MB image (1000x1000 pixels RGB). The images are all zeros (ie a black square). The end result is a 6GB file that can be compressed down to 1MB. I used xz to compress it since it supports sparse files. Uncompressing with unxz will produce a 6GB file that takes up 18MB of disk space.
Created attachment 72428 [details] [review] Add Int64 object
Created attachment 72429 [details] [review] Large file support
Created attachment 72430 [details] [review] Add Int64 object
The current status of the patches is after reg testing with splash: 1323 tests passed (99.47%) 7 tests failed (0.53%) Of the 7 failures, 5 are false negatives. ie there is no visual difference. The remaing 2 failures are bug-poppler26533.pdf - some lines have shifted a pixel when rendered at 72ppi tauya.f8.pdf - some glyphs have shifted slightly. I have not yet been able to determine why these two tests are failing.
Created attachment 72448 [details] [review] Add Int64 object small update
Created attachment 72449 [details] [review] fix uninitialised memory I've found the rendering differences for bug-poppler26533.pdf to be caused by uninitialised memory in splash. This patch fixes the problem. I don't have any idea why tauya.f8.pdf is showing differences however I suspect it is an existing problem that is just failing in a different way with the large file patches. The tauya.f8 pdf is very broken. It has a broken font (corrupted Flate stream). Poppler does not seem to have error detection in the Flate decompression and renders mostly garbage. Ghostscript detects the broken stream and substitutes a different font resulting in what appears to be correct output. If I remove the font descriptor from the pdf forcing poppler to substitute the font, the output is the same as ghostscript with no regression when using the large font patches. Acroread complains the font is broken and displays garbage. So at this point I think the patches are ready for review.
Created attachment 72539 [details] [review] fix uninitialized memory (In reply to comment #24) > Created attachment 72449 [details] [review] [review] > fix uninitialised memory > > I've found the rendering differences for bug-poppler26533.pdf to be caused > by uninitialised memory in splash. This patch fixes the problem. Thanks for figuring out why almost all tests failed if a page using tiling patterns. But I think Your patch is slightly incomplete: 1. You remove the initialization of the bitmap if paintType = 2 is used. This paintType is seldom used, one example is page 745 of spec PDF32000_2008.pdf. 2. I think the better way to initialize the memory is to use the paperColor (which in pdftoppm is the same You're using: white) and use the existing Splash::clear function. This patch changes it and therefore should replace Your patch for it.
(In reply to comment #25) > 2. I think the better way to initialize the memory is to use the paperColor > (which in pdftoppm is the same You're using: white) and use the existing > Splash::clear function. > This patch changes it and therefore should replace Your patch for it. I'm fairly certain that the initial color should make no difference when the alpha is all zero. I tried changing the color in splash->clear in SplashOutputDev::startPage and as long as the alpha is 0 there is no change in the output. So why does changing the initial color in tilingPatternFill affect the output?
(In reply to comment #26) > (In reply to comment #25) > > 2. I think the better way to initialize the memory is to use the paperColor > > (which in pdftoppm is the same You're using: white) and use the existing > > Splash::clear function. > > This patch changes it and therefore should replace Your patch for it. > > I'm fairly certain that the initial color should make no difference when the > alpha is all zero. I tried changing the color in splash->clear in > SplashOutputDev::startPage and as long as the alpha is 0 there is no change > in the output. So why does changing the initial color in tilingPatternFill > affect the output? I think here is a misunderstanding: My main point here was to use the Splash:clear function which is also used in every other place to initialize a just created bitmap. But if this had been the only point I hadn't uploaded a new patch. What definitely was wrong in Your patch was only that You just replaced paintType 2 by paintType 1 which means that we'll get problems with paintType 2 tiling patterns.
I don't have any issue with your patch. I was making the observation that changing the initial color, whether explicitly or as a result of uninitialized memory, was altering the output. I don't think this should happen when the alpha is all zero.
Oh, sorry. This is one of profound mystery in the pipeRun function I never completely understood how the combination of source alpha, destination alpha, source color, destination color, blending mode and so on and so on works in just one simple function for all cases. But the result seems almost be okay, so I avoid to touch it whenever is possible.
Adrian could you rebase the patches against current master? I tried applying them after Thomas' threading patches went in and there's a lot of patch "failures" i am not sure i am fixing correctly so it'd be better if you can provide updated patches
Created attachment 73505 [details] [review] Add Int64 object
Created attachment 73506 [details] [review] Large file support
Created attachment 73507 [details] [review] windows large file support
Seems that this breaks the lexer somehow, when running the ./check_lexer test in qt4, that does not compile, needs this patch diff --git a/qt4/tests/check_lexer.cpp b/qt4/tests/check_lexer.cpp index 1ae849f..1835d93 100644 --- a/qt4/tests/check_lexer.cpp +++ b/qt4/tests/check_lexer.cpp @@ -46,8 +46,8 @@ void TestLexer::testNumbers() obj.free(); lexer->getObj(&obj); - QCOMPARE(obj.getType(), objUint); - QCOMPARE(obj.getUint(), 2147483648u); + QCOMPARE(obj.getType(), objInt64); + QCOMPARE(obj.getInt64(), 2147483648ll); obj.free(); lexer->getObj(&obj); The test fails. It seems that -2147483647-1 is not detected as a objInt anymore but as a objInt64. Could you have a look?
Created attachment 73557 [details] [review] Add Int64 object Updated to fix check_lexer failures. I've also added a couple of tests to check_lexer to test the Int64 object.
Works good :-) Only one thing i'm not so sure about isNum and getNum returning the int64 stuff. That is, you can't really represent all the int64 values in a double, so by using getNum we might lose data. Proof: long long myLong = 9223372036854774700ll; double myDouble = myLong; printf("%lld\n", myLong); printf("%f\n", myDouble); Result: 9223372036854774700 9223372036854774784.000000 On the other hand this seems to happen for ridiculously high values, i.e. a double is perfectly fine to handle stuff as the value of 160GB, so maybe just add a comment in getNum like "// Warning this may break for ridiculously big values of long long" What do you think?
I could add a comment but it seemed obvious to me that since getNum returns a double it is not intended for getting exact integer values. The patch does not make the accuracy of getNum any worse than it was before. Anything needing exact values should be using getInt/getInt64. As you say, this only affects huge values. A double has 53 bits of significand so it is only values with magnitude greater than 2^53 - 1 that can not be exactly represented.
I don't think it should hurt adding the comment as you said mentioning the 2^53 Better safe than sorry. Beisdes that feel free to commit from my side. Given this has been around for a while i guess we can asume noone else wants to review it :D
I've added a comment to getNum() and pushed all the patches.
Build is broken on my machine, CentOS 5.9 fseeko is available, but in /usr/include/stdio.h, I see that the 2nd parameter is of type __off_t, although `man fseeko` shows that it should be off_t off_t is defined in <sys/types.h> and <unistd.h>, and I see a few header files there to include one of these 2 files for off_t.
Created attachment 73688 [details] [review] fix build on ancient systems Try this patch.
It works. Thanks a lot!
I confirm that the patch in attachment 73688 [details] [review] is needed too on Ubuntu 10.04 with GCC 4.4.3
Created attachment 73705 [details] [review] using sys/types.h instead of stdio.h for off_t I checked the POSIX specs at opengroup.org. In the Single UNIX Specification version 2 (UNIX98), off_t is defined in sys/types.h. In the Single UNIX Specification version 4 (POSIX:2008), stdio.h is also required to expose off_t. Glibc bug http://sourceware.org/bugzilla/show_bug.cgi?id=11125 added off_t to stdio.h. This fix was included in glibc 2.12 released in 2010. As there are likely to be many systems not fully compliant with POSIX:2008, using sys/types.h to import the off_t definition would be the most portable solution. The attached patch switches from stdio.h to sys/types.h.
Created attachment 73711 [details] [review] a few more missing Goffsets
Makes sense to me :-)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.