Bug 64815

Summary:	[TAGGEDPDF] Parse the Tagged-PDF document structure tree when present
Product:	poppler	Reporter:	Adrian Perez de Castro <aperez>
Component:	general	Assignee:	Adrian Perez de Castro <aperez>
Status:	RESOLVED FIXED	QA Contact:
Severity:	enhancement
Priority:	medium	CC:	apinheiro, carlosgc, jdiggs, poppler-bugs
Version:	unspecified
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	64813, 64821, 67710
Attachments:	[PATCH 1/6] Tagged-PDF: Accessors in Catalog for the MarkInfo [PATCH 2/6] Tagged-PDF: Interpret the document structure [PATCH v2 2/6] Tagged-PDF: Interpret the document structure [PATCH v3 2/7] Tagged-PDF: Interpret the document structure [PATCH v5 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary [PATCH v5 02/10] Tagged-PDF: Implement parsing of StructTreeRoot [PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects [PATCH v5 04/10] Tagged-PDF: Implement parsing of StructElem attributes Tagged-PDF: Text content extraction from structure elements [PATCH v6 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary [PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects [PATCH v6 04/10] Tagged-PDF: Implement parsing of StructElem attributes [PATCH v6 05/10] Tagged-PDF: Text content extraction from structure elements [PATCH v8 01/15] Tagged-PDF: Implement parsing of StructTreeRoot [PATCH v8 02/15] Implement Object::takeString() method [PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects [PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements [PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects [PATCH v9 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes [PATCH v10 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes [PATCH v10 02/12] Tagged-PDF: Text content extraction from structure elements Tagged-PDF: Text content extraction from structure elements [PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements [PATCH v13 01/11] Tagged-PDF: Text content extraction from structure elements

Description Adrian Perez de Castro 2013-05-21 07:49:48 UTC

Comment 1 Albert Astals Cid 2013-05-21 07:55:47 UTC

If you remove poppler-bugs@lists.freedesktop.org  from the list of asignee's or CC's no one is going to hear you, I suggest you don't do that :-)

Comment 2 Adrian Perez de Castro 2013-05-21 09:56:19 UTC

(In reply to comment #1)
> If you remove poppler-bugs@lists.freedesktop.org  from the list of asignee's
> or CC's no one is going to hear you, I suggest you don't do that :-)

Ouch, I just wanted to add people, not removing... I re-added poppler-bugs.

O:-)

Comment 3 Adrian Perez de Castro 2013-05-29 23:49:26 UTC

Created attachment 79990 [details] [review]
[PATCH 1/6] Tagged-PDF: Accessors in Catalog for the MarkInfo

Comment 4 Adrian Perez de Castro 2013-05-29 23:50:04 UTC

Created attachment 79991 [details] [review]
[PATCH 2/6] Tagged-PDF: Interpret the document structure

Comment 5 Adrian Perez de Castro 2013-05-31 18:06:29 UTC

Created attachment 80101 [details] [review]
[PATCH v2 2/6] Tagged-PDF: Interpret the document structure

This new version of patch 2/6 adds support for recursively resolving
the type of structure elements using the /RoleMap (it was marked as
“TODO” in the initial version of the patch).

Comment 6 Adrian Perez de Castro 2013-06-06 14:17:24 UTC

Created attachment 80407 [details] [review]
[PATCH v3 2/7] Tagged-PDF: Interpret the document structure

New version of the patch, with the following changes:

* Implemented parsing object-reference structure elements (type /OBJR).
  This was one of the TODO items in the previous versions of the patch.

* Implemented parsing of the /ParentTree entry in /StructTreeRoot. This
  allows to find out which structure element is the parent of objects
  which have a /StructParent(s) entry in their properties dictionary.

Comment 7 Carlos Garcia Campos 2013-06-12 12:46:32 UTC

Comment on attachment 79990 [details] [review]
[PATCH 1/6] Tagged-PDF: Accessors in Catalog for the MarkInfo

Review of attachment 79990 [details] [review]:
-----------------------------------------------------------------

LGTM

::: poppler/Catalog.h
@@ +228,4 @@
>    GooString *baseURI;		// base URI for URI-type links
>    Object metadata;		// metadata stream
>    Object structTreeRoot;	// structure tree root dictionary
> +  int markInfo;                 // Flags from MarkInfo dictionary

Maybe this should be Guint instead of int, since it's what getMarkInfo() returns too.

Comment 8 Carlos Garcia Campos 2013-06-12 12:58:57 UTC

Comment on attachment 80407 [details] [review]
[PATCH v3 2/7] Tagged-PDF: Interpret the document structure

Review of attachment 80407 [details] [review]:
-----------------------------------------------------------------

Maybe you can split the patch in those 3 things? to make it a bit easier to review?

Comment 9 Adrian Perez de Castro 2013-06-13 10:39:36 UTC

> Maybe you can split the patch in those 3 things? to make it a bit easier to
> review?

Ouch, the commit message does not really describe well the patch after
the rebasing/squashing done prior to uploading. But I can split it up
in three logical parts:

- Defining the StructElement class and parsing of the corresponding
  objects from the PDF.
- Defining the Attribute class and parsing the corresponding objects
  from the PDF.
- Text content extraction, a.k.a. the machinery needed for
  StructElement::getText() and StructElement::getMCOps()

If there's no objections, I would upload an updated version with the
patch split in three mentioned above soon.

Comment 10 Carlos Garcia Campos 2013-06-16 09:25:39 UTC

(In reply to comment #9)
> > Maybe you can split the patch in those 3 things? to make it a bit easier to
> > review?
> 
> Ouch, the commit message does not really describe well the patch after
> the rebasing/squashing done prior to uploading. But I can split it up
> in three logical parts:
> 
> - Defining the StructElement class and parsing of the corresponding
>   objects from the PDF.
> - Defining the Attribute class and parsing the corresponding objects
>   from the PDF.
> - Text content extraction, a.k.a. the machinery needed for
>   StructElement::getText() and StructElement::getMCOps()
> 
> If there's no objections, I would upload an updated version with the
> patch split in three mentioned above soon.

Sounds good to me, I think I would split it even more, since the patch is mixing 2 things, document logical structure and tagged PDFs. The first patch could define the StructTreeRoot and StructTreeElement without any tagged PDF info. And an top of logical structure support, add tagged PDF implementation. I'll do a first review of the current patch anyway, but upload a new version split.

Comment 11 Carlos Garcia Campos 2013-06-16 10:25:05 UTC

Comment on attachment 80407 [details] [review]
[PATCH v3 2/7] Tagged-PDF: Interpret the document structure

Review of attachment 80407 [details] [review]:
-----------------------------------------------------------------

Ok, I have just a few comments for now, I'll do a full review when the patch is split. In general, * are placed wrongly (close to the type instead of the argument/variable name). Some parts of the code are hard to follow with all those abbreviations.

::: poppler/Catalog.h
@@ +124,4 @@
>    GooString *readMetadata();
>  
>    // Return the structure tree root object.
> +  StructTreeRoot* getStructTreeRoot();

Note that this is used by pdfinfo, you need to update it too

::: poppler/StructElement.cc
@@ +197,5 @@
> +{
> +  return value->isNum();
> +}
> +
> +static GBool isNumber_or_AutoName(Object* value)

isNumerOrAutoName I guess or even just isNumberOrAuto

@@ +231,5 @@
> +      }                                                                 \
> +      return okay;                                                      \
> +    }
> +
> +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor,        4, gTrue,  gTrue );

what is OptX4?

@@ +247,5 @@
> +// Maps attributes to their names and whether the attribute can be inherited.
> +struct AttributeMapEntry {
> +  Attribute::Type    type;
> +  const char*        name;
> +  const Object*      defval;

Move the * to the variable name.

@@ +862,5 @@
> +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA):
> +  type(Unknown),
> +  treeRoot(treeRootA),
> +  parent(parentA),
> +  pageRef(),

Do you really need this? I guess it does nothing.

@@ +863,5 @@
> +  type(Unknown),
> +  treeRoot(treeRootA),
> +  parent(parentA),
> +  pageRef(),
> +  s(new StructData())

Why having this as a separate struct?

@@ +900,5 @@
> +{
> +  if (isContent())
> +    delete c;
> +  else
> +    delete s;

I see why using separate structs now. I would just keep the mcid or object ref in the same StructElement and parse it on demand. I find very confusing using variable names like c and s

@@ +995,5 @@
> +    assert(map);
> +
> +    const MCOpArray& ops(getMCOps());
> +    if (!ops.size())
> +      return NULL;

This leaks the map, GlobalParams::getTextEncoding() returns a new reference. I would get the map below right before you need it.

@@ +1149,5 @@
> +
> +  // Language (optional).
> +  if (element->lookup("Lang", &obj)->isString()) {
> +    s->language = obj.getString()->getCString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

This is weird, I guess you are trying to avoid another string copy. I wonder why you didn't do the same for title. Maybe we could add "take" methods to Object and do something like:

language = obj.takeString();

hmm, but here you are getting the CString from the GooString and leaking the GooString.

@@ +1160,5 @@
> +  if (element->lookup("Alt", &obj)->isString()) {
> +    s->altText = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString
> +  } else if (!obj.isNull()) {
> +    error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName());

I think we can just ignore invalid objects for optional fields.

Comment 12 Albert Astals Cid 2013-06-16 20:58:48 UTC

Wow, [PATCH v3 2/7] Tagged-PDF: Interpret the document structure is huge :D

I admit I haven't looked at the code in depth. But I'll assume there's some tree structure parsing somewhere because PDF are full of those trees, are we protecting against loop in the tree somehow?

Comment 13 Adrian Perez de Castro 2013-06-17 21:53:43 UTC

(In reply to comment #10)

> Sounds good to me, I think I would split it even more [...]

I went ahead and splitted the patch set even more, now my local branch is
looking like this (output from “git log --pretty=oneline --reverse master..”):

  84bdca9 Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary
  ef399c0 Tagged-PDF: Implement parsing of StructTreeRoot
  df1bd34 Tagged-PDF: Implement parsing of StructElem objects
  4698e47 Tagged-PDF: Implement parsing of StructElem attributes
  5a79abd Tagged-PDF: Text content extraction from structure elements
  ddff120 Tagged-PDF: Modify pdfinfo to show the document structure
  be378ca Tagged-PDF: Implement the utils/pdfstructtohtml tool
  6fb852f Tagged-PDF: Expose the structure tree in poppler-glib
  c5c01a4 Tagged-PDF: Pane in poppler-glib demo showing the structure
  0c6ba00 Tagged-PDF: Heuristics in poppler-glib for data/layout table identification

So the huge patch is now split in four. I am going over the rest of your
comments tomorrow before uploading the split patches.

Comment 14 Adrian Perez de Castro 2013-06-17 22:06:21 UTC

(In reply to comment #11)
> Comment on attachment 80407 [details] [review] [review]
> [PATCH v3 2/7] Tagged-PDF: Interpret the document structure
> 
> Review of attachment 80407 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> Ok, I have just a few comments for now, I'll do a full review when the patch
> is split. In general, * are placed wrongly (close to the type instead of the
> argument/variable name). Some parts of the code are hard to follow with all
> those abbreviations.
> 
> ::: poppler/Catalog.h
> @@ +124,4 @@
> >    GooString *readMetadata();
> >  
> >    // Return the structure tree root object.
> > +  StructTreeRoot* getStructTreeRoot();
> 
> Note that this is used by pdfinfo, you need to update it too

The corresponding patch for “pdfinfo” is attached to bug #64813

> ::: poppler/StructElement.cc
> @@ +231,5 @@
> > +      }                                                                 \
> > +      return okay;                                                      \
> > +    }
> > +
> > +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor,        4, gTrue,  gTrue );
> 
> what is OptX4?

Instead one single color value specifying e.g. the color of all the
four sides a box border, it is possible to *optionally* have four
values (hence *X4*) with one value for each side. I will add a comment
about that in the code and think about a better name.
 
> @@ +862,5 @@
> > +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA):
> > +  type(Unknown),
> > +  treeRoot(treeRootA),
> > +  parent(parentA),
> > +  pageRef(),
> 
> Do you really need this? I guess it does nothing.

Explicitly initializes using the default constructor. I tend to prefer to
write code that is explicit, but I can just remove the initializer for
the “pageRef” attribute.
 
> @@ +863,5 @@
> > +  type(Unknown),
> > +  treeRoot(treeRootA),
> > +  parent(parentA),
> > +  pageRef(),
> > +  s(new StructData())
> 
> Why having this as a separate struct?

The idea is to make the StructElement instances smaller in memory when
just storing a MCID or an OBJR.

> @@ +900,5 @@
> > +{
> > +  if (isContent())
> > +    delete c;
> > +  else
> > +    delete s;
> 
> I see why using separate structs now. I would just keep the mcid or object
> ref in the same StructElement and parse it on demand. I find very confusing
> using variable names like c and s

Do you mean that you would keep them in the parent StructElement? IMHO that
would make the API of StructElement more cumbersome. Also, note that there
can be mixed MCID, OBJR and StructElem children in a structure object, so
that would also complicate to store the children (now it is just a simple
“std::vector<StructElement*>”). Or am I understanding your comment wrongly?

> @@ +1160,5 @@
> > +  if (element->lookup("Alt", &obj)->isString()) {
> > +    s->altText = obj.getString();
> > +    obj.initNull(); // The StructElement takes ownership of the GooString
> > +  } else if (!obj.isNull()) {
> > +    error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName());
> 
> I think we can just ignore invalid objects for optional fields.

Do you mean to not even print warning messages?

FWIW, the rest of your comments are valid concerns (especially the ones
related to leaking objects) and I will be fixing the issues tomorrow.

Thanks for the review, and next round would be (hopefully) easier with
the split patches :)

Comment 15 Adrian Perez de Castro 2013-06-18 15:58:28 UTC

Created attachment 81008 [details] [review]
[PATCH v5 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary

Comment 16 Adrian Perez de Castro 2013-06-18 15:59:03 UTC

Created attachment 81009 [details] [review]
[PATCH v5 02/10] Tagged-PDF: Implement parsing of StructTreeRoot

Comment 17 Adrian Perez de Castro 2013-06-18 15:59:32 UTC

Created attachment 81010 [details] [review]
[PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects

Comment 18 Adrian Perez de Castro 2013-06-18 16:00:03 UTC

Created attachment 81011 [details] [review]
[PATCH v5 04/10] Tagged-PDF: Implement parsing of StructElem attributes

Comment 19 Adrian Perez de Castro 2013-06-18 16:00:28 UTC

Created attachment 81012 [details] [review]
Tagged-PDF: Text content extraction from structure elements

Comment 20 Adrian Perez de Castro 2013-06-18 16:13:24 UTC

(In reply to comment #12)
> Wow, [PATCH v3 2/7] Tagged-PDF: Interpret the document structure is huge :D
> 
> I admit I haven't looked at the code in depth [...]

I have uploaded an updated patch set with the outstanding issues
already mentioned by Carlos Garcia Campos fixed. This updated patch
set has the big patch split in four, I hope that eases the review
process  O:-)

> [...] But I'll assume there's some tree structure parsing somewhere
> because PDF are full of those trees, are we protecting against loop
> in the tree somehow?

I didn't see code in Poppler that would be particularly useful to parse
the document structure tree. WRT loops in the structure tree, there is
no protection and a malformed PDF file could potentially cause an
infinite loop when parsing the tree. If I am understanding the PDF spec
correctly, well-formed PDFs must not have loops in the tree. How critical
would you say having protection against loops in the tree would be?

Comment 21 Albert Astals Cid 2013-06-18 17:05:34 UTC

(In reply to comment #20)
> (In reply to comment #12)
> > [...] But I'll assume there's some tree structure parsing somewhere
> > because PDF are full of those trees, are we protecting against loop
> > in the tree somehow?
> 
> I didn't see code in Poppler that would be particularly useful to parse
> the document structure tree. WRT loops in the structure tree, there is
> no protection and a malformed PDF file could potentially cause an
> infinite loop when parsing the tree. If I am understanding the PDF spec
> correctly, well-formed PDFs must not have loops in the tree. How critical
> would you say having protection against loops in the tree would be?

Quite critical, people usually spend their time writing such kind of pdf since that way they can make us crash (heap exhaustion) and thus they can say "whooooo i found a CVE in poppler"...

That's why we have in various places the passing of std::set<int> with the refs of already parsed things in the tree so in case you find you have to process something that is already in the set you know you found a loop and simply bail out.

It should not be very hard to add so it'd be cool if you could add it.

Comment 22 Carlos Garcia Campos 2013-06-20 08:42:59 UTC

(In reply to comment #14)
> (In reply to comment #11)
> > Comment on attachment 80407 [details] [review] [review] [review]
> > [PATCH v3 2/7] Tagged-PDF: Interpret the document structure
> > 
> > Review of attachment 80407 [details] [review] [review] [review]:
> > -----------------------------------------------------------------
> > 
> > Ok, I have just a few comments for now, I'll do a full review when the patch
> > is split. In general, * are placed wrongly (close to the type instead of the
> > argument/variable name). Some parts of the code are hard to follow with all
> > those abbreviations.
> > 
> > ::: poppler/Catalog.h
> > @@ +124,4 @@
> > >    GooString *readMetadata();
> > >  
> > >    // Return the structure tree root object.
> > > +  StructTreeRoot* getStructTreeRoot();
> > 
> > Note that this is used by pdfinfo, you need to update it too
> 
> The corresponding patch for “pdfinfo” is attached to bug #64813

Every patch should build without depending on any other, if you change getStructTreeRoot to return a StructTreeRoot * instead of an Object * and pdutils still does doc->getStructTreeRoot()->isDict() this is not going to build. 

> > ::: poppler/StructElement.cc
> > @@ +231,5 @@
> > > +      }                                                                 \
> > > +      return okay;                                                      \
> > > +    }
> > > +
> > > +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor,        4, gTrue,  gTrue );
> > 
> > what is OptX4?
> 
> Instead one single color value specifying e.g. the color of all the
> four sides a box border, it is possible to *optionally* have four
> values (hence *X4*) with one value for each side. I will add a comment
> about that in the code and think about a better name.

Ok, it's a bit confusing I would say.
  
> > @@ +862,5 @@
> > > +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA):
> > > +  type(Unknown),
> > > +  treeRoot(treeRootA),
> > > +  parent(parentA),
> > > +  pageRef(),
> > 
> > Do you really need this? I guess it does nothing.
> 
> Explicitly initializes using the default constructor. I tend to prefer to
> write code that is explicit, but I can just remove the initializer for
> the “pageRef” attribute.

The default constructor is used in any case, I think those unneeded explicit initializers are just noise, but I'm not C++ expert.
  
> > @@ +863,5 @@
> > > +  type(Unknown),
> > > +  treeRoot(treeRootA),
> > > +  parent(parentA),
> > > +  pageRef(),
> > > +  s(new StructData())
> > 
> > Why having this as a separate struct?
> 
> The idea is to make the StructElement instances smaller in memory when
> just storing a MCID or an OBJR.
>
> > @@ +900,5 @@
> > > +{
> > > +  if (isContent())
> > > +    delete c;
> > > +  else
> > > +    delete s;
> > 
> > I see why using separate structs now. I would just keep the mcid or object
> > ref in the same StructElement and parse it on demand. I find very confusing
> > using variable names like c and s
> 
> Do you mean that you would keep them in the parent StructElement? IMHO that
> would make the API of StructElement more cumbersome. Also, note that there
> can be mixed MCID, OBJR and StructElem children in a structure object, so
> that would also complicate to store the children (now it is just a simple
> “std::vector<StructElement*>”). Or am I understanding your comment wrongly?

I think we should try to delay most of this to be done in demand, note that in most of the cases we will be parsing al this for nothing, because poppler api users don't support structure/tagged pdfs. So, I would just keep the mcid or object ref, and parse it on demand if it's required only.
 
> > @@ +1160,5 @@
> > > +  if (element->lookup("Alt", &obj)->isString()) {
> > > +    s->altText = obj.getString();
> > > +    obj.initNull(); // The StructElement takes ownership of the GooString
> > > +  } else if (!obj.isNull()) {
> > > +    error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName());
> > 
> > I think we can just ignore invalid objects for optional fields.
> 
> Do you mean to not even print warning messages?

Yes.

> FWIW, the rest of your comments are valid concerns (especially the ones
> related to leaking objects) and I will be fixing the issues tomorrow.
> 
> Thanks for the review, and next round would be (hopefully) easier with
> the split patches :)

Thanks

Comment 23 Carlos Garcia Campos 2013-06-20 10:08:30 UTC

Comment on attachment 81008 [details] [review]
[PATCH v5 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary

Review of attachment 81008 [details] [review]:
-----------------------------------------------------------------

This looks good to me.

::: poppler/Catalog.h
@@ +129,5 @@
> +  enum MarkInfoFlags {
> +    markInfoNull           = (1 << 0),
> +    markInfoMarked         = (1 << 1),
> +    markInfoUserProperties = (1 << 2),
> +    markInfoSuspects       = (1 << 3),

I think you don't need parentheses for the enum values.

Comment 24 Carlos Garcia Campos 2013-06-20 10:45:49 UTC

Comment on attachment 81009 [details] [review]
[PATCH v5 02/10] Tagged-PDF: Implement parsing of StructTreeRoot

Review of attachment 81009 [details] [review]:
-----------------------------------------------------------------

Looks good in general, but you should make sure it doesn't break the build. I have a few other comments.

::: poppler/Catalog.cc
@@ +854,5 @@
> +      return NULL;
> +    }
> +
> +    if (catalog.dictLookup("StructTreeRoot", &root)->isDict("StructTreeRoot")) {
> +      structTreeRoot = new StructTreeRoot(doc, root.getDict(), getMarkInfo() & markInfoMarked);

What should we do in case of suspects? I don't like bool parameters, maybe we could just pass the mark info flags and let the StructTreeRoot decide what to do depending on the flags. You don't even need to pass this as a parameter, since the StructTreeRoot keeps a pointer to the PDFDoc, you can get the mark inflo flags from the doc directly:

doc->getCatalog()->getMarkInfo()

::: poppler/Catalog.h
@@ +124,4 @@
>    GooString *readMetadata();
>  
>    // Return the structure tree root object.
> +  StructTreeRoot *getStructTreeRoot();

As I said, this breaks the build. It's very important that every patch builds, since it might affect any future bisection, for example.

::: poppler/StructTreeRoot.cc
@@ +110,5 @@
> +  // Parse the children StructElements
> +  Object kids;
> +  if (root->lookup("K", &kids)->isArray()) {
> +    if (marked && kids.arrayGetLength() > 1) {
> +      error(errSyntaxWarning, -1, "K in StructTreeRoot has more than one children in a tagged PDF");

I would leave this for patches to add tagged pdf support, adding first just the generic logical structure parsing.

@@ +118,5 @@
> +      kids.arrayGetNF(i, &ref);
> +      if (kids.arrayGet(i, &obj)->isDict()) {
> +        StructElement *child = new StructElement(obj.getDict(), this);
> +        if (child->isOk()) {
> +          if (marked && !(child->getType() == StructElement::Document ||

Ditto.

@@ +125,5 @@
> +                          child->getType() == StructElement::Div)) {
> +            error(errSyntaxWarning, -1, "StructTreeRoot element of tagged PDF is wrong type ({0:s})", child->getTypeName());
> +          }
> +          appendElement(child);
> +          if (ref.isRef()) parentTreeAdd(ref.getRef(), child);

Move the if body to its own line, please.

::: poppler/StructTreeRoot.h
@@ +36,5 @@
> +  unsigned getNumElements() const { return elements.size(); }
> +  const StructElement *getElement(int i) const { return elements.at(i); }
> +  StructElement *getElement(int i) { return elements.at(i); }
> +  void appendElement(StructElement *element)
> +  { if (element && element->isOk()) elements.push_back(element); }

We normally indent the body when it's in a new line.

@@ +39,5 @@
> +  void appendElement(StructElement *element)
> +  { if (element && element->isOk()) elements.push_back(element); }
> +
> +  const StructElement *findParentElement(unsigned index) const
> +  { return (index < parentTree.size() && parentTree[index].size() == 1) ? parentTree[index][0].element : NULL; }

Ditto. I wonder if this would be easier to read in the .cc file, though.

Comment 25 Carlos Garcia Campos 2013-06-20 10:59:51 UTC

Comment on attachment 81010 [details] [review]
[PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects

Review of attachment 81010 [details] [review]:
-----------------------------------------------------------------

Please, try to make sure patches build before submitting them.

::: poppler/StructElement.cc
@@ +82,5 @@
> +  { StructElement::Figure,     "Figure",     elementTypeUndefined },
> +  { StructElement::Formula,    "Formula",    elementTypeUndefined },
> +  { StructElement::Form,       "Form",       elementTypeUndefined },
> +  { StructElement::TOC,        "TOC",        elementTypeUndefined },
> +  { StructElement::TOCI,       "TOCI",       elementTypeUndefined },

Isn't all this also part of the standard attributes defined for tagged PDFs? If we leave the tagged pdf support for a different patch, maybe you can merge this and the StructTreeRoot into a single patch and you don't need to dummy impl of StructElement in the previous patch.

@@ +237,5 @@
> +
> +GooString* StructElement::getText(GooString *string, GBool recursive) const
> +{
> +  // TODO: Dummy implementation, complete
> +  return NULL;

Do we really need this? or can it be added in a follow up patch with the proper code instead?

@@ +338,5 @@
> +
> +  // Language (optional).
> +  if (element->lookup("Lang", &obj)->isString()) {
> +    s->language = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

Same concern I had in previous review.

::: poppler/StructElement.h
@@ +74,5 @@
> +  const GooString *getID() const { return isContent() ? NULL : s->id; }
> +  GooString *getID() { return isContent() ? NULL : s->id; }
> +
> +  // Optional ISO language name, e.g. en_US
> +  GooString *getLang()

getLanguage() is not that long :-)

@@ +75,5 @@
> +  GooString *getID() { return isContent() ? NULL : s->id; }
> +
> +  // Optional ISO language name, e.g. en_US
> +  GooString *getLang()
> +  { return (!isContent() && s->language) ? s->language : (parent ? parent->getLang() : NULL); }

We normally indent this.

@@ +85,5 @@
> +  void setRevision(Guint revision) { if (isContent()) s->revision = revision; }
> +
> +  // Optional element title, in human-readable form.
> +  const GooString *getTitle() const { return isContent() ? NULL : s->title; }
> +  GooString *getTitle() { return isContent() ? NULL : s->title; }

Do we really need the two versions of every getter? I prefer to only add the code that is actually used and add more later when needed.

@@ +98,5 @@
> +
> +  void appendElement(StructElement *element)
> +  { if (!isContent() && element && element->isOk()) s->elements.push_back(element); }
> +
> +<<<<<<< HEAD

Please, at least try to build the patches before submitting them.

@@ +133,5 @@
> +  //
> +  // The text will be appended to the passed GooString. If NULL is passed,
> +  // a new string is returned, and the ownership passed to the caller.
> +  //
> +  GooString* getText(GooString *string = NULL, GBool recursive = gTrue) const;

Don't include this in this first patch if it's unused and unimplemented.

Comment 26 Adrian Perez de Castro 2013-06-25 17:46:13 UTC

(In reply to comment #25)
> Comment on attachment 81010 [details] [review] [review]
> [PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects
> 
> Review of attachment 81010 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> Please, try to make sure patches build before submitting them.
> 
> ::: poppler/StructElement.cc
> @@ +82,5 @@
> > +  { StructElement::Figure,     "Figure",     elementTypeUndefined },
> > +  { StructElement::Formula,    "Formula",    elementTypeUndefined },
> > +  { StructElement::Form,       "Form",       elementTypeUndefined },
> > +  { StructElement::TOC,        "TOC",        elementTypeUndefined },
> > +  { StructElement::TOCI,       "TOCI",       elementTypeUndefined },
> 
> Isn't all this also part of the standard attributes defined for tagged PDFs?
> If we leave the tagged pdf support for a different patch, maybe you can
> merge this and the StructTreeRoot into a single patch and you don't need to
> dummy impl of StructElement in the previous patch.

Yes, this is for the standard attributes, I can try to avoid having a
completely dummy StructElement class in the patch implementing the
StructTreeRoot one — but be aware that will move the parsing of
StructElement to the patch where StructTreeRoot is parsed and it
will grow bigger :-/

> @@ +237,5 @@
> > +
> > +GooString* StructElement::getText(GooString *string, GBool recursive) const
> > +{
> > +  // TODO: Dummy implementation, complete
> > +  return NULL;
> 
> Do we really need this? or can it be added in a follow up patch with the
> proper code instead?

Not really. I am changing this to not have a dummy StructElement::getText
and add it in a later patch.

> @@ +338,5 @@
> > +
> > +  // Language (optional).
> > +  if (element->lookup("Lang", &obj)->isString()) {
> > +    s->language = obj.getString();
> > +    obj.initNull(); // The StructElement takes ownership of the GooString
> 
> Same concern I had in previous review.

I have seen similar uses of Object::initNull() in other parts of the
Poppler code (to mention one example: Parser.cc, line 178), so this
is not a completely alien code pattern, I would say ;-)

Adding a Object::takeString() method (and related ones) is not really
needed. It would only make the code more self-documenting, avoiding
to have to add a comment about the ownership being transferred — which
is nice but hardly a priority, I guess.


> ::: poppler/StructElement.h
> @@ +74,5 @@
> > +  const GooString *getID() const { return isContent() ? NULL : s->id; }
> > +  GooString *getID() { return isContent() ? NULL : s->id; }
> > +
> > +  // Optional ISO language name, e.g. en_US
> > +  GooString *getLang()
> 
> getLanguage() is not that long :-)

I am changing it to getLanguage()


> @@ +75,5 @@
> > +  GooString *getID() { return isContent() ? NULL : s->id; }
> > +
> > +  // Optional ISO language name, e.g. en_US
> > +  GooString *getLang()
> > +  { return (!isContent() && s->language) ? s->language : (parent ? parent->getLang() : NULL); }
> 
> We normally indent this.

Fixing. Also, I am changing the body to this, to make it clearer:

  GooString *getLanguage() {
     if (!isContent() && s->language) return s->language;
     return parent ? parent->getLanguage() : NULL;
  }

> @@ +85,5 @@
> > +  void setRevision(Guint revision) { if (isContent()) s->revision = revision; }
> > +
> > +  // Optional element title, in human-readable form.
> > +  const GooString *getTitle() const { return isContent() ? NULL : s->title; }
> > +  GooString *getTitle() { return isContent() ? NULL : s->title; }
> 
> Do we really need the two versions of every getter? I prefer to only add the
> code that is actually used and add more later when needed.

Ideally only the “const” getter should be needed, but for that to work
GooString (or whatever the type the returned object has) must also tag
its own methods that do not modify the instance as “const”. Take for
example the following:

  const StructElement *elem = /* ... */
  printf("Language: %s\n", elem->getLanguage->getCString());

Here, getLanguage() would return a “const GooString*”, and following
the rules for constness in C++, only “const” methods can be invoked
using the returned pointer (because they are guaranteed to not modify
the GooString instance), but unfortunately GooString::getCString() is
not declared “const” (even when it does _not_ modify the GooString),
so the compiler assumes that the internals of the returned GooString
may change; and if they change, the internals of the StructElement
are also changing, therefore a “const” getter cannot be used...

The Good Solution™ for this is marking methods of existing classes
properly as “const” (where appropriate). Here I decided to try to
limit the impact in the rest of the code by having the non-“const”
getter.

There is one thing that could be to improve the readability, though.
I could rewrite the non-“const” getters like this:

  const GooString* getLanguage() const { /* ... */ }
  GooString* getLanguage() { 
     return const_cast<GooString*>(static_cast<GooString*>(this)->getLanguage());
  }

(Not sure which one is uglier, if the original or this one I have
just suggested... IMHO it is okay to repeat the simple getters,
and maybe use this suggestion I have done here for complex getters).


> @@ +133,5 @@
> > +  //
> > +  // The text will be appended to the passed GooString. If NULL is passed,
> > +  // a new string is returned, and the ownership passed to the caller.
> > +  //
> > +  GooString* getText(GooString *string = NULL, GBool recursive = gTrue) const;
> 
> Don't include this in this first patch if it's unused and unimplemented.

Okay.

Comment 27 Adrian Perez de Castro 2013-07-10 16:45:20 UTC

Created attachment 82286 [details] [review]
[PATCH v6 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary

Comment 28 Adrian Perez de Castro 2013-07-10 16:45:49 UTC

Created attachment 82287 [details] [review]
[PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot

Comment 29 Adrian Perez de Castro 2013-07-10 16:46:20 UTC

Created attachment 82288 [details] [review]
[PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects

Comment 30 Adrian Perez de Castro 2013-07-10 16:47:29 UTC

Created attachment 82289 [details] [review]
[PATCH v6 04/10] Tagged-PDF: Implement parsing of StructElem attributes

Comment 31 Adrian Perez de Castro 2013-07-10 16:48:23 UTC

Created attachment 82290 [details] [review]
[PATCH v6 05/10] Tagged-PDF: Text content extraction from structure elements

Comment 32 Adrian Perez de Castro 2013-07-11 19:29:50 UTC

JFTR, the updates patches I have uploaded should cover the comments
from the review (thanks for your comments!). Also I have implemented
Albert's suggestion about tracking the structure elements as they
are parsed, to avoid the case in which a malformed/crafted PDF with
an element referring to an ancestor in the tree would cause an
infinite loop.

Comment 33 Carlos Garcia Campos 2013-08-01 10:14:52 UTC

Comment on attachment 82286 [details] [review]
[PATCH v6 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary

Review of attachment 82286 [details] [review]:
-----------------------------------------------------------------

Pushed to git master, thanks!

Comment 34 Carlos Garcia Campos 2013-08-01 10:24:32 UTC

Comment on attachment 82287 [details] [review]
[PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot

Review of attachment 82287 [details] [review]:
-----------------------------------------------------------------

::: utils/pdfinfo.cc
@@ +229,5 @@
> +	  (doc->getCatalog()->getMarkInfo() & Catalog::markInfoMarked) ? "yes" : "no");
> +   printf("UserProperties: %s\n",
> +	  (doc->getCatalog()->getMarkInfo() & Catalog::markInfoUserProperties) ? "yes" : "no");
> +   printf("Suspects:       %s\n",
> +	  (doc->getCatalog()->getMarkInfo() & Catalog::markInfoSuspects) ? "yes" : "no");

This doesn't actually belongs to this patch, since it uses the getMarkInfo new API added in previous patch. I've pushed this as a separate patch.

Comment 35 Adrian Perez de Castro 2013-08-01 10:57:53 UTC

Great to see this merged, thanks to you for the thorough reviews. I hope to help improving this over time. There are still things that can be done, like for example adding support in Poppler to generate the document structure :)

Comment 36 Carlos Garcia Campos 2013-08-28 07:55:19 UTC

Comment on attachment 82287 [details] [review]
[PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot

Review of attachment 82287 [details] [review]:
-----------------------------------------------------------------

If we simplify this patch removing the tagged pdf bits and role/class maps that are currently unused maybe you can merge this with the following one avoiding the dummy class.

::: poppler/StructTreeRoot.cc
@@ +43,5 @@
> +  // The RoleMap/ClassMap dictionaries are needed by all the parsing
> +  // functions, which will resolve the custom names to canonical
> +  // standard names.
> +  root->lookup("RoleMap", &roleMap);
> +  root->lookup("ClassMap", &classMap);

Since these are actually unused in this patch, we can probably add them in the first patch that actually uses them, since they are optional entries in the end.

@@ +103,5 @@
> +  }
> +  obj.free();
> +
> +  // Parse the children StructElements
> +  const GBool marked = doc->getCatalog()->getMarkInfo() & Catalog::markInfoMarked;

As I commented previously I would focus first on logical structure (Section 14.7) only and then add the tagged pdf support (Section 14.8) on top of that. That would leave this patch even simpler and maybe you can even merge this and the following one instead of using a dummy class here just to make it build.

@@ +111,5 @@
> +      error(errSyntaxWarning, -1, "K in StructTreeRoot has more than one children in a tagged PDF");
> +    }
> +    for (int i = 0; i < kids.arrayGetLength(); i++) {
> +      Object obj, ref;
> +      kids.arrayGetNF(i, &ref);

This reference is only used in the if below.

@@ +122,5 @@
> +                          child->getType() == StructElement::Div)) {
> +            error(errSyntaxWarning, -1, "StructTreeRoot element of tagged PDF is wrong type ({0:s})", child->getTypeName());
> +          }
> +          appendElement(child);
> +          if (ref.isRef()) {

Here you could directly do:

if (kids.arrayGetNF(i, &ref)->isRef())

Comment 37 Carlos Garcia Campos 2013-08-28 08:35:21 UTC

Comment on attachment 82288 [details] [review]
[PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects

Review of attachment 82288 [details] [review]:
-----------------------------------------------------------------

I still don't see the point of having two structs heap allocated and all the API methods with 'if isFoo()' to use one or the other, but I'm not opposed either if you guys think it's better.

::: poppler/StructElement.cc
@@ +83,5 @@
> +  { StructElement::Formula,    "Formula",    elementTypeUndefined },
> +  { StructElement::Form,       "Form",       elementTypeUndefined },
> +  { StructElement::TOC,        "TOC",        elementTypeUndefined },
> +  { StructElement::TOCI,       "TOCI",       elementTypeUndefined },
> +};

All this is part of the standard types defined for tagged pdf, right? If we move this to a follow up patch and can merge this with the previous one, containing only logical structure implementation.

@@ +126,5 @@
> +//------------------------------------------------------------------------
> +// StructElement
> +//------------------------------------------------------------------------
> +
> +const Ref StructElement::InvalidRef = { -1, -1 };

This looks unused here

@@ +176,5 @@
> +  c(new ContentData(mcid))
> +{
> +  assert(treeRoot);
> +  assert(parent);
> +  assert(c->mcid != InvalidMCID);

Instead of creating the ContentData struct with an invalid mcid, and asserting here, I think we should ensure that StructElement is never created with an invalid mcid. I don't understand in which situation mcid can be InvalidMCID or why -1 can't be a valid MCID, the spec says it's an integer that uniquely identifies the marked content sequence.

@@ +188,5 @@
> +{
> +  assert(treeRoot);
> +  assert(parent);
> +  assert(c->ref.num >= 0);
> +  assert(c->ref.gen >= 0);

Same here, this should never be called with an invalid ref and I think it's already ensured by the callers.

@@ +226,5 @@
> +  if (parent)
> +    return parent->getPageRef();
> +
> +  static const Ref invalidRef = { -1, -1 };
> +  return invalidRef;

I would make this boolean returning the Ref as an out parameter instead of using this invalidRef

@@ +299,5 @@
> +  obj.free();
> +
> +  // Object ID (optional), to be looked at the IDTree in the tree root.
> +  if (element->lookup("ID", &obj)->isString()) {
> +    s->id = new GooString(obj.getString());

It's probably simpler to do obj.getString()->copy(), but it's exactly the same.

@@ +317,5 @@
> +  obj.free();
> +
> +  // Element title (optional).
> +  if (element->lookup("T", &obj)->isString()) {
> +    s->title = new GooString(obj.getString());

Ditto.

@@ +324,5 @@
> +
> +  // Language (optional).
> +  if (element->lookup("Lang", &obj)->isString()) {
> +    s->language = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

I still don't understand why you are doing this here and not for the title for example. And I still think the proper way of doing this is adding Object::takeString() or something like that. Another way of avoid duplicating the string is to save the Object instead of getting the String contained.

@@ +331,5 @@
> +
> +  // Alternative text (optional).
> +  if (element->lookup("Alt", &obj)->isString()) {
> +    s->altText = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

Ditto.

@@ +338,5 @@
> +
> +  // Expanded form of an abbreviation (optional).
> +  if (element->lookup("E", &obj)->isString()) {
> +    s->expandedAbbr = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

Ditto.

@@ +345,5 @@
> +
> +  // Actual text (optional).
> +  if (element->lookup("ActualText", &obj)->isString()) {
> +    s->actualText = obj.getString();
> +    obj.initNull(); // The StructElement takes ownership of the GooString

Ditto.

@@ +378,5 @@
> +      mcidObj.free();
> +      return NULL;
> +    }
> +
> +    child = new StructElement(mcidObj.getInt(), treeRoot, this);

mcidObj.free() after this.

@@ +383,5 @@
> +
> +    if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) {
> +      child->pageRef = pageRefObj;
> +    } else {
> +      pageRefObj.free();

I don't think you need the else, you can call .free unconditionally:

if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef())
  child->pageRef = pageRefObj;
pageRefObj.free();

@@ +396,5 @@
> +
> +      if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) {
> +        child->pageRef = pageRefObj;
> +      } else {
> +        pageRefObj.free();

Ditto.

@@ +403,5 @@
> +      error(errSyntaxError, -1, "Obj object is wrong type ({0:s})", refObj.getTypeName());
> +    }
> +    refObj.free();
> +  } else if (childObj->isDict()) {
> +    assert(ref->isRef());

Why is this required to be a valid ref when child object is a dict?

::: poppler/StructElement.h
@@ +63,3 @@
>  
> +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }

I don't think we should consider the cases of invalid objects, isOk should return FALSE in all those cases

@@ +63,5 @@
>  
> +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> +
> +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }

This is not actually public API, so I think it's simpler to assume this is only going to be called when type == MCID, so the caller should ensure that by checking the type first.

@@ +64,5 @@
> +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> +
> +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }
> +  Ref getObjectRef() const { return type == OBJR ? c->ref : InvalidRef; }

Ditto.

::: poppler/StructTreeRoot.cc
@@ +115,4 @@
>      for (int i = 0; i < kids.arrayGetLength(); i++) {
>        Object obj, ref;
>        kids.arrayGetNF(i, &ref);
> +      assert(ref.isRef());

Why should this be a valid ref? Shouldn't we ignore invalid entries and continue parsing instead of crashing?

Comment 38 Carlos Garcia Campos 2013-08-28 08:36:01 UTC

Comment on attachment 82288 [details] [review]
[PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects

Review of attachment 82288 [details] [review]:
-----------------------------------------------------------------

I also think there are too many asserts in this patch that might not be necessary

Comment 39 Adrian Perez de Castro 2013-09-26 18:15:00 UTC

(In reply to comment #37)
> Comment on attachment 82288 [details] [review] [review]
> [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects
> 
> Review of attachment 82288 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> I still don't see the point of having two structs heap allocated and all the
> API methods with 'if isFoo()' to use one or the other, but I'm not opposed
> either if you guys think it's better.
> 
> ::: poppler/StructElement.cc
> @@ +83,5 @@
> > +  { StructElement::Formula,    "Formula",    elementTypeUndefined },
> > +  { StructElement::Form,       "Form",       elementTypeUndefined },
> > +  { StructElement::TOC,        "TOC",        elementTypeUndefined },
> > +  { StructElement::TOCI,       "TOCI",       elementTypeUndefined },
> > +};
> 
> All this is part of the standard types defined for tagged pdf, right? If we
> move this to a follow up patch and can merge this with the previous one,
> containing only logical structure implementation.

Righy. I will split this to a separate patch.

> @@ +126,5 @@
> > +//------------------------------------------------------------------------
> > +// StructElement
> > +//------------------------------------------------------------------------
> > +
> > +const Ref StructElement::InvalidRef = { -1, -1 };
> 
> This looks unused here

Removed.

> @@ +176,5 @@
> > +  c(new ContentData(mcid))
> > +{
> > +  assert(treeRoot);
> > +  assert(parent);
> > +  assert(c->mcid != InvalidMCID);
> 
> Instead of creating the ContentData struct with an invalid mcid, and
> asserting here, I think we should ensure that StructElement is never created
> with an invalid mcid. I don't understand in which situation mcid can be
> InvalidMCID or why -1 can't be a valid MCID, the spec says it's an integer
> that uniquely identifies the marked content sequence.

You are right, instantiation inside StructElement::parseChild() checks that
the MCID is an integer.


> @@ +188,5 @@
> > +{
> > +  assert(treeRoot);
> > +  assert(parent);
> > +  assert(c->ref.num >= 0);
> > +  assert(c->ref.gen >= 0);
> 
> Same here, this should never be called with an invalid ref and I think it's
> already ensured by the callers.

Yes, StructElement::parseChild() also checks that this will be a Ref.

Personally, I think it does not hurt to have them even when with the
current implementation it is guaranteed that the passed values will
be correct. For example, if the API is extended at some point to also
allow creation of structure trees (i.e: not just reading them) those
constructors could/would be used by client code, and it would be good
to have the asserts to ensure that the API is used correctly… Anyway,
I am digressing; I will remove the extra asserts for now  O:-)


> @@ +226,5 @@
> > +  if (parent)
> > +    return parent->getPageRef();
> > +
> > +  static const Ref invalidRef = { -1, -1 };
> > +  return invalidRef;
> 
> I would make this boolean returning the Ref as an out parameter instead of
> using this invalidRef

Seems better, yes. I am changing this to:

  GBool StructElement::getPageRef(Ref& ref);

with the suggested semantics.

> @@ +299,5 @@
> > +  obj.free();
> > +
> > +  // Object ID (optional), to be looked at the IDTree in the tree root.
> > +  if (element->lookup("ID", &obj)->isString()) {
> > +    s->id = new GooString(obj.getString());
> 
> It's probably simpler to do obj.getString()->copy(), but it's exactly the
> same.

Changed.

> @@ +317,5 @@
> > +  obj.free();
> > +
> > +  // Element title (optional).
> > +  if (element->lookup("T", &obj)->isString()) {
> > +    s->title = new GooString(obj.getString());
> 
> Ditto.

Changed as well.

> @@ +324,5 @@
> > +
> > +  // Language (optional).
> > +  if (element->lookup("Lang", &obj)->isString()) {
> > +    s->language = obj.getString();
> > +    obj.initNull(); // The StructElement takes ownership of the GooString
> 
> I still don't understand why you are doing this here and not for the title
> for example. And I still think the proper way of doing this is adding
> Object::takeString() or something like that. Another way of avoid
> duplicating the string is to save the Object instead of getting the String
> contained.

I have added Object::takeString(), and also modified the rest of the
places where strings are used to use Object::takeString() if possible.

> @@ +378,5 @@
> > +      mcidObj.free();
> > +      return NULL;
> > +    }
> > +
> > +    child = new StructElement(mcidObj.getInt(), treeRoot, this);
> 
> mcidObj.free() after this.

Good catch, thanks. I have added it.

> @@ +383,5 @@
> > +
> > +    if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) {
> > +      child->pageRef = pageRefObj;
> > +    } else {
> > +      pageRefObj.free();
> 
> I don't think you need the else, you can call .free unconditionally:
> 
> if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef())
>   child->pageRef = pageRefObj;
> pageRefObj.free();

Note that “pageRefObj” is in Object, not a Ref. Note that “child->pageRef”
is a pointer to the Object, so if it is deinitialized by calling .free()
on it, it may be left in an undetermined state. The idea is to only call
.free() if the object is *not* a Ref. If it is a Ref, “child->pageRef” is
freed in the destructor for StructElement.

> @@ +403,5 @@
> > +      error(errSyntaxError, -1, "Obj object is wrong type ({0:s})", refObj.getTypeName());
> > +    }
> > +    refObj.free();
> > +  } else if (childObj->isDict()) {
> > +    assert(ref->isRef());
> 
> Why is this required to be a valid ref when child object is a dict?

Because the reference number is used to populate the “seen” set (it is
a “std::set<int>”) which tracks the seen nodes to avoid infinite loops
in malformed PDFs. I have changed this so instead of asserting, an error
is produced instead. According to the spec a properly formed /StructTree
must contain references for the /StrucElem items (not dictionaries
directly embedded in the tree), so causing an error should be reasonable.

> ::: poppler/StructElement.h
> @@ +63,3 @@
> >  
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> 
> I don't think we should consider the cases of invalid objects, isOk should
> return FALSE in all those cases

Okay.

> @@ +63,5 @@
> >  
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> > +
> > +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }
> 
> This is not actually public API, so I think it's simpler to assume this is
> only going to be called when type == MCID, so the caller should ensure that
> by checking the type first.

Yep.

> @@ +64,5 @@
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> > +
> > +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }
> > +  Ref getObjectRef() const { return type == OBJR ? c->ref : InvalidRef; }
> 
> Ditto.

Yepyep.

> ::: poppler/StructTreeRoot.cc
> @@ +115,4 @@
> >      for (int i = 0; i < kids.arrayGetLength(); i++) {
> >        Object obj, ref;
> >        kids.arrayGetNF(i, &ref);
> > +      assert(ref.isRef());
> 
> Why should this be a valid ref? Shouldn't we ignore invalid entries and
> continue parsing instead of crashing?

Same reason as earlier. In the case of the structure tree it is okay to
skip adding the integer to the “seen” set, as the loop would be caught
later on, so I have removed this assertion.

Comment 40 Adrian Perez de Castro 2013-09-26 18:41:58 UTC

*** Bug 67710 has been marked as a duplicate of this bug. ***

Comment 41 Adrian Perez de Castro 2013-09-26 18:43:48 UTC

Created attachment 86674 [details] [review]
[PATCH v8 01/15] Tagged-PDF: Implement parsing of StructTreeRoot

Comment 42 Adrian Perez de Castro 2013-09-26 18:44:10 UTC

Created attachment 86675 [details] [review]
[PATCH v8 02/15] Implement Object::takeString() method

Comment 43 Adrian Perez de Castro 2013-09-26 18:44:48 UTC

Created attachment 86676 [details] [review]
[PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects

Comment 44 Adrian Perez de Castro 2013-09-26 18:45:14 UTC

Created attachment 86677 [details] [review]
[PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and

Comment 45 Adrian Perez de Castro 2013-09-26 18:45:38 UTC

Created attachment 86678 [details] [review]
[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements

Comment 46 Adrian Perez de Castro 2013-09-27 16:25:15 UTC

Created attachment 86730 [details] [review]
[PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects

Comment 47 Carlos Garcia Campos 2013-09-30 08:27:57 UTC

Comment on attachment 86675 [details] [review]
[PATCH v8 02/15] Implement Object::takeString() method

Review of attachment 86675 [details] [review]:
-----------------------------------------------------------------

::: poppler/Object.h
@@ +199,5 @@
>    double getNum() { OBJECT_3TYPES_CHECK(objInt, objInt64, objReal);
>      return type == objInt ? (double)intg : type == objInt64 ? (double)int64g : real; }
>    GooString *getString() { OBJECT_TYPE_CHECK(objString); return string; }
> +  GooString *takeString() {
> +    OBJECT_TYPE_CHECK(objString); GooString *s = string; initNull(); return s; }

I don't think we should call initNull. We are just stealing the object contents, but the object type should not change. Also calling initNull here would unbalance the numAllocs array increments/decrements when memory debug is enabled. I think we should just set string to NULL, so that when free() is called it will do nothing.

Comment 48 Carlos Garcia Campos 2013-09-30 08:32:08 UTC

The three first patches look good to me in general, but I would squash 1 and 3 and make 2 be the first one with the modifications I suggested. I've done this in the tagged-pdf branch after rebasing it. All other patches have been merged unreviewed. I'll continue reviewing the patches when I have time, but for now, if nobody objects I can merge the two first patches as they are in tagged-pdf branch into master.

http://cgit.freedesktop.org/poppler/poppler/log/?h=tagged-pdf

Comment 49 Carlos Garcia Campos 2013-10-04 08:26:43 UTC

Comment on attachment 86677 [details] [review]
[PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and

Review of attachment 86677 [details] [review]:
-----------------------------------------------------------------

::: poppler/Makefile.am
@@ +299,2 @@
>  	StructElement.cc	\
> +	StructTreeRoot.cc	\

This change looks unrelated.

::: poppler/StructElement.cc
@@ +260,5 @@
> +  Object Auto;
> +  Object Start;
> +  Object None;
> +  Object Before;
> +  Object Nat1;

It's weird that struct members start with a capital letter, but I guess it looks better in the  macros below.

@@ +282,5 @@
> +static const AttributeDefaults attributeDefaults;
> +
> +
> +#define ATTR_LIST_END      { Attribute::Unknown, NULL, NULL, gFalse, NULL }
> +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c }

What D stand for? Default?

@@ +283,5 @@
> +
> +
> +#define ATTR_LIST_END      { Attribute::Unknown, NULL, NULL, gFalse, NULL }
> +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c }
> +#define ATTR_N(x, i, c)    { Attribute::x, #x, NULL, i, c }

And N? Normal? NotDefault? What about using ATTR and ATTR_WITH_DEFAULT? Also I would use words instead of abbreviations to improve the readability type, inheritable, check, defaultValue.

@@ +288,5 @@
> +
> +static const AttributeMapEntry attributeMapCommonShared[] =
> +{
> +  ATTR_D(Placement,       gFalse, isPlacementName, Inline),
> +  ATTR_D(WritingMode,     gFalse, isWritingModeName, LrTb),

WritingMode is inheritable

@@ +295,5 @@
> +  ATTR_D(BorderStyle,     gFalse, isBorderStyle, None),
> +  ATTR_N(BorderThickness, gTrue,  isPositiveOrOptionalArray4),
> +  ATTR_D(Padding,         gFalse, isPositiveOrArray4, Zero),
> +  ATTR_N(Color,           gTrue,  isRGBColor),
> +  ATTR_LIST_END

Do we really need this being the array statically allocated?

@@ +341,5 @@
> +  ATTR_LIST_END
> +};
> +
> +static const AttributeMapEntry attributeMapCommonList[] = {
> +  ATTR_D(ListNumbering, gFalse, isListNumberingName, None),

This is inheritable

@@ +740,5 @@
> +      formatted->Set(formattedA);
> +    else
> +      formatted = new GooString(formattedA);
> +  } else {
> +    delete formatted;

You should set formatted to NULL here

@@ +744,5 @@
> +    delete formatted;
> +  }
> +}
> +
> +GBool Attribute::typeCheck(StructElement *element)

checkType sounds a bit better to me

@@ +747,5 @@
> +
> +GBool Attribute::typeCheck(StructElement *element)
> +{
> +  // If an element is passed, tighther type-checking can be done.
> +  if (element) {

You could use an early return here, to save an indentation level.

@@ +765,5 @@
> +
> +  return gTrue;
> +}
> +
> +Attribute::Type Attribute::typeForName(const char *name, StructElement *element)

You are using get for all other methods, use getTypeForName here for consistency.

@@ +936,5 @@
> +{
> +  if (isContent())
> +    return parent->findAttribute(attributeType, inherit, attributeOwner);
> +
> +  if (attributeType != Attribute::Unknown && attributeType != Attribute::UserProperty) {

I would return early here.

@@ +1132,5 @@
> +            for (unsigned j = attrIndex; j < getNumAttributes(); j++)
> +              getAttribute(j)->setRevision(revision);
> +          } else {
> +            error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName());
> +          }

iobj.free()

@@ +1133,5 @@
> +              getAttribute(j)->setRevision(revision);
> +          } else {
> +            error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName());
> +          }
> +        }

attr.free()

::: poppler/StructElement.h
@@ +27,2 @@
>  class StructTreeRoot;
> +class TextWordList;

What do you need this for?

@@ +116,5 @@
> +  static Type typeForName(const char *name, StructElement *element = NULL);
> +  static Attribute *parseUserProperty(Dict *property);
> +
> +  friend class StructElement;
> +};

The Attribute class implementation is huge, you might consider moving it to its own file

Comment 50 Carlos Garcia Campos 2013-10-09 08:26:25 UTC

Comment on attachment 86678 [details] [review]
[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements

Review of attachment 86678 [details] [review]:
-----------------------------------------------------------------

Looks good in general, my main concern is that some code is duplicated with TextOutputDev, so I prefer if we could somehoe add support for marked content to TextOutputDev instead of inventing yet another output dev. All those abbreviations makes the code a bit more difficult to read for me.

::: poppler/MCOutputDev.cc
@@ +34,5 @@
> +};
> +
> +
> +MCOutputDev::MCOutputDev(int mcid):
> +  p(new Priv(mcid))

Why do we need a heap allocated private struct?

@@ +66,5 @@
> +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> +{
> +  int id = -1;
> +  if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> +    p->capturing = true;

capturing sounds a bit confusing to me, I would use something like inMarkedContent

@@ +149,5 @@
> +  if (p->lastFlags != flags) {
> +    if (p->capturing)
> +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> +    p->lastFlags = flags;
> +  }

It seems you are not adding any color "op". Either remove the color type or add it when it changes in ::drawChar (or beginWord)

@@ +150,5 @@
> +    if (p->capturing)
> +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> +    p->lastFlags = flags;
> +  }
> +}

I wonder if we can integrate all this with TextOutputDev somehow. There's some code duplication and this is about extracting text after all. Maybe we could add a MarkedContentText object, similar to TextPage or ActualText objects, and use it from the TextOutputDev. We could add a new constructor that only creates a MarkedContentText and receives the marked content id, and then you can get the text and attributes.

::: poppler/MCOutputDev.h
@@ +17,5 @@
> +class GfxState;
> +class GooString;
> +class Dict;
> +
> +struct MCOp {

What's op? operator?

@@ +33,5 @@
> +      return ((Guint) (r * 255) & 0xFF) << 16
> +           | ((Guint) (g * 255) & 0xFF) << 8
> +           | ((Guint) (b * 255) & 0xFF);
> +    }
> +  };

Why not using GfxColor? or even GfxRGB if only RGB is supported?

@@ +67,5 @@
> +  }
> +  MCOp(): type(FontName), value(NULL) {}
> +  MCOp(Unicode u): type(Unichar), unichar(u) {}
> +  MCOp(const char *s): type(FontName), value(strdup(s)) {}
> +  MCOp(Type t, Guint f = 0): type(t), flags(f) {}

Why do you need to pass also the Type for the flags constructor?

::: poppler/StructElement.h
@@ +233,5 @@
> +  //
> +  // The text will be appended to the passed GooString. If NULL is passed,
> +  // a new string is returned, and the ownership passed to the caller.
> +  //
> +  GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;

appendText maybe?

Comment 51 Adrian Perez de Castro 2013-10-15 21:37:54 UTC

(In reply to comment #49)
> Comment on attachment 86677 [details] [review] [review]
> [PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and
> 
> Review of attachment 86677 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> ::: poppler/Makefile.am
> @@ +299,2 @@
> >  	StructElement.cc	\
> > +	StructTreeRoot.cc	\
> 
> This change looks unrelated.

Yep.

> ::: poppler/StructElement.cc
> @@ +260,5 @@
> > +  Object Auto;
> > +  Object Start;
> > +  Object None;
> > +  Object Before;
> > +  Object Nat1;
> 
> It's weird that struct members start with a capital letter, but I guess it
> looks better in the  macros below.

Yes, they do look easier in the eyes when used in the macros below
(at least for me). Should I change them?

> @@ +282,5 @@
> > +static const AttributeDefaults attributeDefaults;
> > +
> > +
> > +#define ATTR_LIST_END      { Attribute::Unknown, NULL, NULL, gFalse, NULL }
> > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c }
> 
> What D stand for? Default?

Yes, it stands for “default”. ATTR_D defines an attribute with a 
default value.

> @@ +283,5 @@
> > +
> > +
> > +#define ATTR_LIST_END      { Attribute::Unknown, NULL, NULL, gFalse, NULL }
> > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c }
> > +#define ATTR_N(x, i, c)    { Attribute::x, #x, NULL, i, c }
> 
> And N? Normal? NotDefault? What about using ATTR and ATTR_WITH_DEFAULT? Also
> I would use words instead of abbreviations to improve the readability type,
> inheritable, check, defaultValue.

“N” is for “no-default“. Sure I will change this to use descriptive names.

> @@ +288,5 @@
> > +
> > +static const AttributeMapEntry attributeMapCommonShared[] =
> > +{
> > +  ATTR_D(Placement,       gFalse, isPlacementName, Inline),
> > +  ATTR_D(WritingMode,     gFalse, isWritingModeName, LrTb),
> 
> WritingMode is inheritable

Good catch. Fixing.

> @@ +295,5 @@
> > +  ATTR_D(BorderStyle,     gFalse, isBorderStyle, None),
> > +  ATTR_N(BorderThickness, gTrue,  isPositiveOrOptionalArray4),
> > +  ATTR_D(Padding,         gFalse, isPositiveOrArray4, Zero),
> > +  ATTR_N(Color,           gTrue,  isRGBColor),
> > +  ATTR_LIST_END
> 
> Do we really need this being the array statically allocated?

With “this” I suppose you mean the ATTR_LIST_END entry. Removing
the end marker would make it necessary to store the length of the
array in the entries of the “attributeMapAll” (and related) arrays.
The amount or static data is just a few bytes more having the end
markers in the arrays, and makes the code simpler. If there is any
good reason to remove the end markers, it could be done, but I
would rather leave them there.

> @@ +341,5 @@
> > +  ATTR_LIST_END
> > +};
> > +
> > +static const AttributeMapEntry attributeMapCommonList[] = {
> > +  ATTR_D(ListNumbering, gFalse, isListNumberingName, None),
> 
> This is inheritable

You're right. Fixing.

> @@ +740,5 @@
> > +      formatted->Set(formattedA);
> > +    else
> > +      formatted = new GooString(formattedA);
> > +  } else {
> > +    delete formatted;
> 
> You should set formatted to NULL here

Good catch, thanks.

> @@ +744,5 @@
> > +    delete formatted;
> > +  }
> > +}
> > +
> > +GBool Attribute::typeCheck(StructElement *element)
> 
> checkType sounds a bit better to me

Sure, I am changing it to checkType().

> @@ +747,5 @@
> > +
> > +GBool Attribute::typeCheck(StructElement *element)
> > +{
> > +  // If an element is passed, tighther type-checking can be done.
> > +  if (element) {
> 
> You could use an early return here, to save an indentation level.

Sure, it will be fixed.

> @@ +765,5 @@
> > +
> > +  return gTrue;
> > +}
> > +
> > +Attribute::Type Attribute::typeForName(const char *name, StructElement *element)
> 
> You are using get for all other methods, use getTypeForName here for
> consistency.

Sure.

> @@ +936,5 @@
> > +{
> > +  if (isContent())
> > +    return parent->findAttribute(attributeType, inherit, attributeOwner);
> > +
> > +  if (attributeType != Attribute::Unknown && attributeType != Attribute::UserProperty) {
> 
> I would return early here.

Okay. I am changing it.

> @@ +1132,5 @@
> > +            for (unsigned j = attrIndex; j < getNumAttributes(); j++)
> > +              getAttribute(j)->setRevision(revision);
> > +          } else {
> > +            error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName());
> > +          }
> 
> iobj.free()

Fixed.

> @@ +1133,5 @@
> > +              getAttribute(j)->setRevision(revision);
> > +          } else {
> > +            error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName());
> > +          }
> > +        }
> 
> attr.free()

Fixed as well.

> ::: poppler/StructElement.h
> @@ +27,2 @@
> >  class StructTreeRoot;
> > +class TextWordList;
> 
> What do you need this for?

It had sense at some point when the feature was WIP, but it ended
up not being needed. I will remove it.

> @@ +116,5 @@
> > +  static Type typeForName(const char *name, StructElement *element = NULL);
> > +  static Attribute *parseUserProperty(Dict *property);
> > +
> > +  friend class StructElement;
> > +};
> 
> The Attribute class implementation is huge, you might consider moving it to
> its own file

Sure thing.

Comment 52 Adrian Perez de Castro 2013-10-15 21:48:08 UTC

(In reply to comment #51)
> (In reply to comment #49)
> > 
> > The Attribute class implementation is huge, you might consider moving it to
> > its own file
> 
> Sure thing.

...with the problem that it is not that easy. The “typeMap” table and the
private static functions at StructElement.cc that use the TypeMapEntry use
AttributeMapEntry. So the tables would still need to be in StructElement.cc.
Also, some Attribute methods use the static helper functions, and I would
rather have the Attribute class in StructElement.cc than duplicating code
for those.

Comment 53 Adrian Perez de Castro 2013-10-15 21:50:40 UTC

Created attachment 87701 [details] [review]
[PATCH v9 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes

Comment 54 Carlos Garcia Campos 2013-10-17 09:48:39 UTC

Comment on attachment 87701 [details] [review]
[PATCH v9 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes

Review of attachment 87701 [details] [review]:
-----------------------------------------------------------------

Looks good, I have just a few more comments

::: poppler/Makefile.am
@@ +299,2 @@
>  	StructElement.cc	\
> +	StructTreeRoot.cc	\

This is still unrelated to this patch.

::: poppler/StructElement.cc
@@ +74,5 @@
> +      || value->isName("End")
> +      || value->isName("Center");
> +}
> +
> +static GBool isNumber(Object *value);

Simply move the implementation of isNumber before it's needed so that we don't need prototypes. You could move all isFoo methods for the basic types at the beginning.

@@ +445,5 @@
> +  attributeMapCommonList,
> +  NULL,
> +};
> +
> +static const AttributeMapEntry *attributeMapPrintField[] = {

This is unused.

@@ +493,5 @@
> +
> +
> +static GBool ownerHasMorePriority(Attribute::Owner a, Attribute::Owner b)
> +{
> +  unsigned a_index, b_index;

aIndex, bIndex

@@ +577,5 @@
> +//------------------------------------------------------------------------
> +// Helpers for the attribute and structure type tables
> +//------------------------------------------------------------------------
> +
> +static inline const AttributeMapEntry*

AttributeMapEntry* -> AttributeMapEntry *

@@ +594,5 @@
> +  }
> +  return NULL;
> +}
> +
> +static inline const AttributeMapEntry*

AttributeMapEntry* -> AttributeMapEntry *

@@ +635,5 @@
> +  const OwnerMapEntry *entry = getOwnerMapEntry(owner);
> +  return entry ? entry->name : "UnknownOwner";
> +}
> +
> +Attribute::Owner nameToOwner(const char *name)

This should be static

@@ +695,5 @@
> +
> +  if (copyValue)
> +    valueA->copy(&value);
> +  else
> +    valueA->shallowCopy(&value);

When does this happen? It seems that attributes are always created with copyValue = gFalse.

@@ +712,5 @@
> +
> +  if (copyValue)
> +    valueA->copy(&value);
> +  else
> +    valueA->shallowCopy(&value);

Ditto.

@@ +716,5 @@
> +    valueA->shallowCopy(&value);
> +
> +  if (!checkType()) {
> +    type = Unknown;
> +  }

You are not using braces for all other single statement ifs

@@ +1322,5 @@
> +            if (attribute->isOk() && (typeCheckOk = attribute->checkType(this))) {
> +              appendAttribute(attribute);
> +            } else {
> +              // It is not needed to free "value", the Attribute instance
> +              // owns the contents, so deleting "attribute" is enough.

I think it's fine to always copy the value to make the code simpler. For heavy to copy objects we use reference counter.

Comment 55 Adrian Perez de Castro 2013-11-08 18:15:40 UTC

(In reply to comment #50)
> Comment on attachment 86678 [details] [review] [review]
> [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
> 
> Review of attachment 86678 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> Looks good in general, my main concern is that some code is duplicated with
> TextOutputDev, so I prefer if we could somehoe add support for marked
> content to TextOutputDev instead of inventing yet another output dev. All
> those abbreviations makes the code a bit more difficult to read for me.

New version of the patch ditches the “MC” abbreviation and it appears
expanded to “MarkedContent”.

> ::: poppler/MCOutputDev.cc
> @@ +34,5 @@
> > +};
> > +
> > +
> > +MCOutputDev::MCOutputDev(int mcid):
> > +  p(new Priv(mcid))
> 
> Why do we need a heap allocated private struct?

We don't, I first wrote this before being aware that the Poppler low-level
API is not considered public (as in “public stable API”).

> @@ +66,5 @@
> > +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> > +{
> > +  int id = -1;
> > +  if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> > +    p->capturing = true;
> 
> capturing sounds a bit confusing to me, I would use something like
> inMarkedContent

Changed.

> @@ +149,5 @@
> > +  if (p->lastFlags != flags) {
> > +    if (p->capturing)
> > +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > +    p->lastFlags = flags;
> > +  }
> 
> It seems you are not adding any color "op". Either remove the color type or
> add it when it changes in ::drawChar (or beginWord)

Ooops. New version of the patch adds the missing code.

> @@ +150,5 @@
> > +    if (p->capturing)
> > +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > +    p->lastFlags = flags;
> > +  }
> > +}
> 
> I wonder if we can integrate all this with TextOutputDev somehow. There's
> some code duplication and this is about extracting text after all. Maybe we
> could add a MarkedContentText object, similar to TextPage or ActualText
> objects, and use it from the TextOutputDev. We could add a new constructor
> that only creates a MarkedContentText and receives the marked content id,
> and then you can get the text and attributes.
> 
> ::: poppler/MCOutputDev.h
> @@ +17,5 @@
> > +class GfxState;
> > +class GooString;
> > +class Dict;
> > +
> > +struct MCOp {
> 
> What's op? operator?

Operation.

> @@ +33,5 @@
> > +      return ((Guint) (r * 255) & 0xFF) << 16
> > +           | ((Guint) (g * 255) & 0xFF) << 8
> > +           | ((Guint) (b * 255) & 0xFF);
> > +    }
> > +  };
> 
> Why not using GfxColor? or even GfxRGB if only RGB is supported?

GfxRGB seems a resonable compromise. Next version will use it.

> @@ +67,5 @@
> > +  }
> > +  MCOp(): type(FontName), value(NULL) {}
> > +  MCOp(Unicode u): type(Unichar), unichar(u) {}
> > +  MCOp(const char *s): type(FontName), value(strdup(s)) {}
> > +  MCOp(Type t, Guint f = 0): type(t), flags(f) {}
> 
> Why do you need to pass also the Type for the flags constructor?

Type “Unicode” is also a “Guint”, so there would be two overloaded
functions with one argument of the same type. The compiler cannot
resolve the ambiguity and generates an error. Adding the extra
“Type” argument is a workaround for that.

> ::: poppler/StructElement.h
> @@ +233,5 @@
> > +  //
> > +  // The text will be appended to the passed GooString. If NULL is passed,
> > +  // a new string is returned, and the ownership passed to the caller.
> > +  //
> > +  GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;
> 
> appendText maybe?

After thinking twice about this, it looks like it is wrong to have a
single function that can work in two ways. The reason for those two
behaviours is that client code will usually call “getText()”, without
passing arguments, so internally the function will create a GooString
and then iterate over the children recursively, passing that one string
to avoid allocating one GooString per child element.

IMHO the best is to split this in two functions:

  public:
    GooString* getText(GBool recursive = gTrue);
  private:
    void appendSubTreeText(GooString *string);

The new “appendSubTreeText()” function would be a private helper
function used when “getText()” is called with “recusive = gTrue”.

Comment 56 Adrian Perez de Castro 2013-11-11 20:34:20 UTC

Created attachment 89046 [details] [review]
[PATCH v10 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes

Comment 57 Adrian Perez de Castro 2013-11-11 20:35:16 UTC

Created attachment 89047 [details] [review]
[PATCH v10 02/12] Tagged-PDF: Text content extraction from structure elements

Comment 58 Carlos Garcia Campos 2013-11-12 08:43:46 UTC

Comment on attachment 89046 [details] [review]
[PATCH v10 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes

Review of attachment 89046 [details] [review]:
-----------------------------------------------------------------

LGTM, pushed to git master. Thanks!

Comment 59 Adrian Perez de Castro 2013-11-27 20:36:00 UTC

Created attachment 89917 [details] [review]
Tagged-PDF: Text content extraction from structure elements

Comment 60 Adrian Perez de Castro 2013-11-27 20:38:13 UTC

(In reply to comment #59)
> Created attachment 89917 [details] [review] [review]
> Tagged-PDF: Text content extraction from structure elements

In this latest version of the patch the MCOp/MarkedContentOp code is
removed, and the API is now more high-level: text spans are returned by 
MarkedContentOutputDev::getTextSpans(), instead of doing the conversion
later in poppler-glib.

Comment 61 Carlos Garcia Campos 2013-11-28 11:48:14 UTC

Comment on attachment 89917 [details] [review]
Tagged-PDF: Text content extraction from structure elements

Review of attachment 89917 [details] [review]:
-----------------------------------------------------------------

This looks much better and simpler :-)

::: poppler/MarkedContentOutputDev.cc
@@ +43,5 @@
> +
> +void MarkedContentOutputDev::endSpan()
> +{
> +  if (currentText && currentText->getLength()) {
> +    TextSpan span(currentText, currentFont, currentLink);

Add a comment here saying that TextSpan adopts the pointers, because at a first glance it looks like this method leaks memory.

@@ +44,5 @@
> +void MarkedContentOutputDev::endSpan()
> +{
> +  if (currentText && currentText->getLength()) {
> +    TextSpan span(currentText, currentFont, currentLink);
> +    memcpy(&span.data->color, &currentColor, sizeof(GfxRGB));

Why don't you pass the color to the constructor too? Can't you simply do span.data->color = currentColor instead of the memcpy in any case?

@@ +73,5 @@
> +void MarkedContentOutputDev::beginMarkedContent(char *name, Dict *properties)
> +{
> +  int id = -1;
> +  if (properties && properties->lookupInt("MCID", NULL, &id) && id == mcid)
> +    inMarkedContent = true;

I wonder what happens in case of nested marked contents, should we set inMarkedContent to false until the inner marked contents ends? or do we want also the contents of any nested marked content?

@@ +144,5 @@
> +
> +  if (font == currentFont)
> +    return;
> +
> +  endSpan();

Do we want to call endSpan even when inMarkedContent = false?

@@ +158,5 @@
> +
> +
> +void MarkedContentOutputDev::updateFillColor(GfxState *state)
> +{
> +  GfxRGB color;

Same here, should we ignore fill color updates when not in marked content?

@@ +167,5 @@
> +      color.b == currentColor.b)
> +    return;
> +
> +  endSpan();
> +  currentColor = color;

Note that the color of the text depends on the current text rendering mode, for rendering mode 1 (stroke) what we want is the stroke color instead of the fill color. Maybe we could get the current fill/stroke color from the state depending on the current render mode in drawChar instead of using updateFillColor and updateStrokeColor

::: poppler/MarkedContentOutputDev.h
@@ +25,5 @@
> +  enum {
> +    Italic = 1 << 0,
> +    Bold   = 1 << 1,
> +    Fixed  = 1 << 2,
> +  };

This looks unused.

@@ +46,5 @@
> +  }
> +
> +  GfxFont* getFont() const { return data->font; }
> +  GooString* getText() const { return data->text; }
> +  GooString* getLink() const { return data->link; }

What's exactly this link? a named destination? It doesn't seem to be assigned anywhere. If this si something that will be used in a future patch I prefer to add it in that patch instead of adding unused code here.

@@ +52,5 @@
> +
> +private:
> +  // NOTE: Takes ownership of strings, increases refcount for font.
> +  TextSpan(GooString *text, GfxFont *font = NULL, GooString *link = NULL)
> +      : data(new Data) {

Why do you need this heap allocated data struct instead of using members of the class directly? Why is this constructor private?

Comment 62 Adrian Perez de Castro 2013-12-04 16:36:06 UTC

(In reply to comment #61)

> ::: poppler/MarkedContentOutputDev.cc
> @@ +43,5 @@
> > +
> > +void MarkedContentOutputDev::endSpan()
> > +{
> > +  if (currentText && currentText->getLength()) {
> > +    TextSpan span(currentText, currentFont, currentLink);
> 
> Add a comment here saying that TextSpan adopts the pointers, because at a
> first glance it looks like this method leaks memory.

Sure, I am adding a comment. There is already a note in the header file
next to the TextSpan constructor but it will not hurt to have a note
also here.

> @@ +44,5 @@
> > +void MarkedContentOutputDev::endSpan()
> > +{
> > +  if (currentText && currentText->getLength()) {
> > +    TextSpan span(currentText, currentFont, currentLink);
> > +    memcpy(&span.data->color, &currentColor, sizeof(GfxRGB));
> 
> Why don't you pass the color to the constructor too? Can't you simply do
> span.data->color = currentColor instead of the memcpy in any case?

Changed.

> @@ +73,5 @@
> > +void MarkedContentOutputDev::beginMarkedContent(char *name, Dict *properties)
> > +{
> > +  int id = -1;
> > +  if (properties && properties->lookupInt("MCID", NULL, &id) && id == mcid)
> > +    inMarkedContent = true;
> 
> I wonder what happens in case of nested marked contents, should we set
> inMarkedContent to false until the inner marked contents ends? or do we want
> also the contents of any nested marked content?

I cannot think of an use-case where one would not be interested in the
nested content. For the moment I think we can keep it as-is, and later
on if the use-case surfaces, add the change.

On a different note, this comment of yours made me notice that there
was a bug with nested marked content sequences: beginMarkedContent()
would be called again for the nested sequence, and when the inner
sequence finishes endMarkedContent() would set inMarkedContent=gFalse
even when the content from the outer marked content sequence has not
been yet extracted. I have changed this so a “std::vector<int>” is
used as a stack to keep track of the nested sequences.

> @@ +144,5 @@
> > +
> > +  if (font == currentFont)
> > +    return;
> > +
> > +  endSpan();
> 
> Do we want to call endSpan even when inMarkedContent = false?

Well, the string with the text content would by NULL or empty, so it is
harmless; but you are right that how this works is expressed better in
the code by checking inMarkedContent before calling endSpan().

> @@ +158,5 @@
> > +
> > +
> > +void MarkedContentOutputDev::updateFillColor(GfxState *state)
> > +{
> > +  GfxRGB color;
> 
> Same here, should we ignore fill color updates when not in marked content?

Not really. Color changes can affect upcoming elements, so if a color
changes before a marked content sequence, the new color also affects
it. Or am I interpreting the spec wrong?

> @@ +167,5 @@
> > +      color.b == currentColor.b)
> > +    return;
> > +
> > +  endSpan();
> > +  currentColor = color;
> 
> Note that the color of the text depends on the current text rendering mode,
> for rendering mode 1 (stroke) what we want is the stroke color instead of
> the fill color. Maybe we could get the current fill/stroke color from the
> state depending on the current render mode in drawChar instead of using
> updateFillColor and updateStrokeColor

Moved into drawChar() as suggested.

> ::: poppler/MarkedContentOutputDev.h
> @@ +25,5 @@
> > +  enum {
> > +    Italic = 1 << 0,
> > +    Bold   = 1 << 1,
> > +    Fixed  = 1 << 2,
> > +  };
> 
> This looks unused.

Removed.

> @@ +46,5 @@
> > +  }
> > +
> > +  GfxFont* getFont() const { return data->font; }
> > +  GooString* getText() const { return data->text; }
> > +  GooString* getLink() const { return data->link; }
> 
> What's exactly this link? a named destination? It doesn't seem to be
> assigned anywhere. If this si something that will be used in a future patch
> I prefer to add it in that patch instead of adding unused code here.

This is a leftover from the past, when I didn't have clear how links
would end up in the structure tree. Links get their own StructElement
so it is not needed to have this in the TextSpan in the end. I have
removed the “data->link” member and the code accessing it.

> @@ +52,5 @@
> > +
> > +private:
> > +  // NOTE: Takes ownership of strings, increases refcount for font.
> > +  TextSpan(GooString *text, GfxFont *font = NULL, GooString *link = NULL)
> > +      : data(new Data) {
> 
> Why do you need this heap allocated data struct instead of using members of
> the class directly? Why is this constructor private?

TextSpanArray does not contain pointers to TextSpans, but the TextSpans
themselves. The Data member is to share the data across them when the
TextSpans are copied/assigned.

Comment 63 Adrian Perez de Castro 2013-12-04 16:42:19 UTC

Created attachment 90249 [details] [review]
[PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements

Comment 64 Carlos Garcia Campos 2013-12-05 11:51:20 UTC

Comment on attachment 90249 [details] [review]
[PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements

Review of attachment 90249 [details] [review]:
-----------------------------------------------------------------

::: poppler/MarkedContentOutputDev.cc
@@ +135,5 @@
> +  // the render mode (for mode 1 stroke color is used), so there is no need
> +  // to implement both updateFillColor() and updateStrokeColor().
> +  GBool colorChange = gFalse;
> +  GfxRGB color;
> +  if (state->getRender() == 1)

This is a mask, it should be if ((state->getRender() & 3) == 1)

Comment 65 Adrian Perez de Castro 2013-12-05 14:14:59 UTC

(In reply to comment #64)
> Comment on attachment 90249 [details] [review] [review]
> [PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements
> 
> Review of attachment 90249 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> ::: poppler/MarkedContentOutputDev.cc
> @@ +135,5 @@
> > +  // the render mode (for mode 1 stroke color is used), so there is no need
> > +  // to implement both updateFillColor() and updateStrokeColor().
> > +  GBool colorChange = gFalse;
> > +  GfxRGB color;
> > +  if (state->getRender() == 1)
> 
> This is a mask, it should be if ((state->getRender() & 3) == 1)

Done. Also, the updated version I am uploading has better code for GfxFont
comparison.

(See http://lists.freedesktop.org/archives/poppler/2013-December/010702.html)

Comment 66 Adrian Perez de Castro 2013-12-05 14:16:01 UTC

Created attachment 90311 [details] [review]
[PATCH v13 01/11] Tagged-PDF: Text content extraction from structure elements

Comment 67 Carlos Garcia Campos 2013-12-05 16:44:02 UTC

(In reply to comment #66)
> Created attachment 90311 [details] [review] [review]
> [PATCH v13 01/11] Tagged-PDF: Text content extraction from structure elements

Pushed to git master! This was the last patch pending to land, so I'm closing this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.