107057 – [Patch] Skip XRef gaps in PDFDoc save methods

Bug 107057 - [Patch] Skip XRef gaps in PDFDoc save methods

Summary: [Patch] Skip XRef gaps in PDFDoc save methods

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-06-27 21:22 UTC by Tobias Deiminger
Modified:	2018-08-20 21:58 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Skip XRef xrefEntryNone entries on incremental update and full rewrite (6.46 KB, patch) 2018-06-27 21:22 UTC, Tobias Deiminger	Details \| Splinter Review
Skip XRef xrefEntryNone entries on incremental update and full rewrite (v2) (13.48 KB, patch) 2018-07-06 21:39 UTC, Tobias Deiminger	Details \| Splinter Review
Skip XRef::constructXRef in XRef::readXRefUntil if last XRef section is a stream (4.36 KB, patch) 2018-07-09 22:02 UTC, Tobias Deiminger	Details \| Splinter Review
Show Obsolete (2) View All

Description Tobias Deiminger 2018-06-27 21:22:48 UTC

Created attachment 140372 [details] [review]
Skip XRef xrefEntryNone entries on incremental update and full rewrite

This patch aims to solve Okular bug https://bugs.kde.org/show_bug.cgi?id=395660. Actually there are several related problems, the patch fixes the easiest and sufficient part.

Currently, PDFDoc::saveIncrementalUpdate and saveCompleteRewrite iterate over the full continuous range of XRef::entries, ranging from object id 0 up to XRef::size. That seems problematic. The range may contain gaps (type == xrefEntryNone) which are merely details of popplers memory allocation, but don't contain objects from the PDF document. Hitting a gap with XRef::getEntry triggers "damaged file repair". It's implemented by XRef::constructXRef, and it makes things actually worse because constructXRef can't handle XRef streams nor objects from inside object streams. It wipes out formerly existing XRef information in that case.

The idea of this patch is to spare the gaps from iteration, and therefore avoid a bunch of consecutive faults early.

Atm. the patch is only tested for Okular test case. Poppler regression tests are yet outstanding.

Comment 1 Tobias Deiminger 2018-06-28 21:23:05 UTC

(In reply to Tobias Deiminger from comment #0)
> Poppler regression tests are yet outstanding.

All regression tests (that I'm aware of) are done now. Everything passes / behaves as expected.

# Automated tests pass
poppler$ make test
Running tests...
Test project /home/deiminge/cmake/poppler
      Start  1: check_qt5_attachments
 1/17 Test  #1: check_qt5_attachments ............   Passed    0.08 sec
... snip ...
17/17 Test #17: check_qt5_strings ................   Passed    0.02 sec

100% tests passed, 0 tests failed out of 17

Total Test time (real) =   0.83 sec


# Incremental update of *.pdf from poppler unittest documents works
# (exclude encrypted for loop simplicity)
test$ good=0; for f in ~/git/test/unittestcases/*.pdf; do [[ "$f" == *Encrypted* ]] || [[ "$f" == *Gday*open* ]] && continue; ./pdf-fullrewrite -check -i "$f" out.pdf && good=$((good+1)) || echo "$f failed"; done; echo "Passed: $good"
Passed: 20


# Full rewrite of *.pdf from poppler unittest documents
# (exclude encrypted for loop simplicity)
test$ good=0; for f in ~/git/test/unittestcases/*.pdf; do [[ "$f" == *Encrypted* ]] || [[ "$f" == *Gday*open* ]] && continue; ./pdf-fullrewrite -check "$f" out.pdf && good=$((good+1)) || echo "$f failed"; done; echo "Passed: $good"
Passed: 20


# Incremental update of malformed document (object 0 missing) works.
# out.pdf is readable in Okular and Adobe Reader.
# pdf-fullrewrite verify function complains, that's expected
test$ ./pdf-fullrewrite -check -i '1_PDFsam_Untitled 2.pdf' out.pdf
XRef table: Unexpected number of entries (18+1 != 18)
Syntax Error: Invalid XRef entry  # XRef::getEntry(0) on 1_PDFsam_Untitled 2.pdf
Syntax Error: Invalid XRef entry  # XRef::getEntry(0) on out.pdf
XRef entry 0: generation number was expected to be 65535 (0 != 65535)
 # Error expected: Incremental update doesn't add object 0.
 # That's exactly how Adobe Reader behaves on incremental update of this file.
Verification failed


# Test full rewrite of malformed PDF
# out.pdf is completely fine, object 0 is added.
test$ ./pdf-fullrewrite -check '1_PDFsam_Untitled 2.pdf' out.pdf
Syntax Error: Invalid XRef entry  # XRef::getEntry(0) on 1_PDFsam_Untitled 2.pdf

Comment 2 oliver.sander 2018-06-29 07:46:28 UTC

You are copying the objectIDs into a separate std::vector just to make your loops look pretty.  Could that have a negative impact on program run-time?

Comment 3 Tobias Deiminger 2018-06-29 08:27:16 UTC

(In reply to oliver.sander from comment #2)
> You are copying the objectIDs into a separate std::vector just to make your
> loops look pretty.  Could that have a negative impact on program run-time?

I initially felt a bit bad because of that too. But after all, the new XRef::getObjectIds method is only called once on save, and saving is nothing that happens at high frequency. There are just plain ints in the intermediate vector, and the number of entries is usually small, say in the range 1..1024 (however, some extreme documents could have an extreme number of objects). Further, 'return objectIds' is basically a non op, because of copy elision / RVO.

The alternative to assemble an intermediate vector are
- implement input iterator on XRef
- or change the underlying containment for XRef::entries to list or map
- or something about coroutines :)

I guess it's not worth it. But it needs a bit of poppler experience to judge about. What do you think?

Comment 4 Tobias Deiminger 2018-06-29 08:30:48 UTC

(In reply to oliver.sander from comment #2)
> just to make your loops look pretty

Ah, and by the way, it's not to make the loop "look pretty". It's to make the loop non-contiguous (like 1, 4, 5, 6, 18, 19, 20,...), which is essentially the reason why the patch fixes the bug.

Comment 5 oliver.sander 2018-06-29 08:43:58 UTC

I trust your judgement on the efficiency impact.

>Ah, and by the way, it's not to make the loop "look pretty". It's to make the 
> loop non-contiguous (like 1, 4, 5, 6, 18, 19, 20,...), which is essentially the 
> reason why the patch fixes the bug.

I know that. :-)  But you could have written something like

for(int i=0; i<xref->getNumObjects(); i++) {
  if (entries[i].type == xrefEntryNone)
    continue;

which does the same, but is not as pretty.

Comment 6 Tobias Deiminger 2018-06-29 10:19:16 UTC

(In reply to oliver.sander from comment #5)
> I trust your judgement on the efficiency impact.
> 
> >Ah, and by the way, it's not to make the loop "look pretty". It's to make the 
> > loop non-contiguous (like 1, 4, 5, 6, 18, 19, 20,...), which is essentially the 
> > reason why the patch fixes the bug.
> 
> I know that. :-)  But you could have written something like

Yeah, was quite sure that you knew :) And thanks for the review at all.

> for(int i=0; i<xref->getNumObjects(); i++) {
>   if (entries[i].type == xrefEntryNone)
>     continue;
> 
> which does the same, but is not as pretty.

Nope, XRef::entries is private, so no access from PDFDoc. And private is a good thing, we shouldn't expose implementation details outside of XRef. An input iterator could do what you suggested while maintaining proper design. But it's not that easy to write one. If there are votes to do it, I'll try.

Comment 7 Albert Astals Cid 2018-06-29 11:40:34 UTC

Took me a while to realize the problem was getEntry tries to reconstruct on the none entries (because i didn't read the comment before reading the patch).

I was wondering if changig getEntry would be easier for the next people that read the code next time?

i.e. a new bool (or changing the bool to a flag) to tell getEntry to don't try to reconstruct?

enum GetEntryFlags
{
   None,
   DoNotComplainIfMissing
   DoNotTryToRecoverIfNone
}

XRefEntry *getEntry(int i, GetEntryFlags flags = None);

then the code in saveXXX would be xref->getEntry(i, DoNotTryToRecoverIfNone) which may be easier to understand for third party readers?

Comment 8 Tobias Deiminger 2018-06-29 14:50:40 UTC

(In reply to Albert Astals Cid from comment #7)
> Took me a while to realize the problem was getEntry tries to reconstruct on
> the none entries (because i didn't read the comment before reading the
> patch).
> 
> I was wondering if changig getEntry would be easier for the next people that
> read the code next time?
> 
> i.e. a new bool (or changing the bool to a flag) to tell getEntry to don't
> try to reconstruct?
> 
> enum GetEntryFlags
> {
>    None,
>    DoNotComplainIfMissing
>    DoNotTryToRecoverIfNone
> }
> 
> XRefEntry *getEntry(int i, GetEntryFlags flags = None);
> 
> then the code in saveXXX would be xref->getEntry(i, DoNotTryToRecoverIfNone)
> which may be easier to understand for third party readers?

That's an interesting idea. What I woukd tell getEntry then (perhaps implicitly) is: "don't recover if none, but yes do recover if offset is wrong and yes do recover if...<other reasons>". Is this still understandable for third parties?

I'm traveling right now, will come back to that on Monday.

Comment 9 Albert Astals Cid 2018-07-03 22:55:36 UTC

(In reply to Tobias Deiminger from comment #8)
> (In reply to Albert Astals Cid from comment #7)
> > Took me a while to realize the problem was getEntry tries to reconstruct on
> > the none entries (because i didn't read the comment before reading the
> > patch).
> > 
> > I was wondering if changig getEntry would be easier for the next people that
> > read the code next time?
> > 
> > i.e. a new bool (or changing the bool to a flag) to tell getEntry to don't
> > try to reconstruct?
> > 
> > enum GetEntryFlags
> > {
> >    None,
> >    DoNotComplainIfMissing
> >    DoNotTryToRecoverIfNone
> > }
> > 
> > XRefEntry *getEntry(int i, GetEntryFlags flags = None);
> > 
> > then the code in saveXXX would be xref->getEntry(i, DoNotTryToRecoverIfNone)
> > which may be easier to understand for third party readers?
> 
> That's an interesting idea. What I woukd tell getEntry then (perhaps
> implicitly) is: "don't recover if none, but yes do recover if offset is
> wrong and yes do recover if...<other reasons>". Is this still understandable
> for third parties?

No, on save you always use DoNotTryToRecoverIfNone, it's what you're doing with your patch code, no?

Comment 10 Tobias Deiminger 2018-07-05 16:14:22 UTC

(In reply to Albert Astals Cid from comment #9)
> No, on save you always use DoNotTryToRecoverIfNone, it's what you're doing
> with your patch code, no?

Yes, that's perfectly fine, the code in saveXXX can always look like xref->getEntry(i, DoNotTryToRecoverIfNone).

I meant this: If we have a different problem besides "none" during saveXXX (say "offset mismatch"), we actually do want to start recovery, is it? That's how the current patch behaves. The getEntry API is not explicit about other error scenarios. The enum for example contains no DoNotTryToRecoverIfWrongOffset. My comment was about if this is consistent and understandable for other programmers. My own answer would be "no" to consistent, and "yes" to understandable.

I'm currently working on grasping XRef::readXRefUntil, only then I can provide an updated patch.

Comment 11 Albert Astals Cid 2018-07-06 15:29:28 UTC

But we can not have a different problem, no?

XRef::getEntry does

XRefEntry *XRef::getEntry(int i, GBool complainIfMissing)
{
  if (i >= size || entries[i].type == xrefEntryNone) {
    // stuff
  }
  return &entries[i];
}


and the saving code does

for(int i=0; i<xref->getNumObjects(); i++) {
    xref->getEntry(i);

and since xref->getNumObjects() is xref->size, if we make it that it ignores the entries[i].type == xrefEntryNone part because we pass DoNotTryToRecoverIfNone then it always just returns  &entries[i] directly, right?

Comment 12 Tobias Deiminger 2018-07-06 21:39:32 UTC

Created attachment 140484 [details] [review]
Skip XRef xrefEntryNone entries on incremental update and full rewrite (v2)

Skip XRef xrefEntryNone entries on incremental update and full rewrite

Add GetEntryOption "DontLoadIfMissing", and use it for PDFDoc::saveIncrementalUpdate and PDFDoc::saveCompleteRewrite.

Comment 13 Tobias Deiminger 2018-07-06 21:59:23 UTC

(In reply to Albert Astals Cid from comment #11)
> But we can not have a different problem, no?
Ah thanks, you're right of course. Sorry, I overlooked that.

The new patch in attachment 140484 [details] [review] does basically what you suggested, I just changed the enum names a bit. I thought it's a bit better to name the actions "load", instead of "recover", because "load from file if not yet in memory" is the primary intention of a getEntry() caller if they select the new LoadSilentlyIfMissing or LoadAndComplainIfMissing option. Recovery is only a rare secondary step if loading failed.

Had no time to test the new patch yet. I will repeat the steps from comment 1 on Mondays.

Comment 14 Tobias Deiminger 2018-07-08 19:36:01 UTC

There's a problem with both of my patches: I realized only by now that the XRef::XRef(BaseStream *strA, ...) Ctor is doing some kind of "lazy load". It only loads the first table/stream and its entries during construction. The 2..n-th streams (linked by Prev) are not yet loaded. They're actually loaded later on demand with first access to a missing entry by XRef::getEntry. We get to this scenario when working with incrementally updated or linearized documents, because in this case there are more then 1 xref tables/streams.

As a consequence, when a user calls saveXXX, we have to expect that XRef::entries list may not yet be fully populated. If we simply patch out "load if missing", we avoid not only the bad recovery, but we avoid also the required initial load of an entry into memory. I'm not yet sure how to solve it.

Comment 15 Tobias Deiminger 2018-07-09 22:02:55 UTC

Created attachment 140530 [details] [review]
Skip XRef::constructXRef in XRef::readXRefUntil if last XRef section is a stream

Here's an approach to avoid the problem described in comment 14. We let the lazy load during PDFDoc::saveXXX happen, but avoid corruption of the XRef by skipping constructXRef in XRef::readXRefUntil for situations where we know it won't work.

Afaikt XRef::fetch does the same for the same reason, checkout commit 57b7a52c.

When using the patch in the okular test case, we'll get a single "Error: Invalid XRef entry 0" on console, but saved file works. I think that's ok, as the original file is actually buggy and it should be noted somewhere.

Tests from comment 1 pass.

Comment 16 GitLab Migration User 2018-08-20 21:58:19 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/139.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.