If you remove poppler-bugs@lists.freedesktop.org from the list of asignee's or CC's no one is going to hear you, I suggest you don't do that :-)
(In reply to comment #1) > If you remove poppler-bugs@lists.freedesktop.org from the list of asignee's > or CC's no one is going to hear you, I suggest you don't do that :-) Ouch, I just wanted to add people, not removing... I re-added poppler-bugs. O:-)
Created attachment 79990 [details] [review] [PATCH 1/6] Tagged-PDF: Accessors in Catalog for the MarkInfo
Created attachment 79991 [details] [review] [PATCH 2/6] Tagged-PDF: Interpret the document structure
Created attachment 80101 [details] [review] [PATCH v2 2/6] Tagged-PDF: Interpret the document structure This new version of patch 2/6 adds support for recursively resolving the type of structure elements using the /RoleMap (it was marked as “TODO” in the initial version of the patch).
Created attachment 80407 [details] [review] [PATCH v3 2/7] Tagged-PDF: Interpret the document structure New version of the patch, with the following changes: * Implemented parsing object-reference structure elements (type /OBJR). This was one of the TODO items in the previous versions of the patch. * Implemented parsing of the /ParentTree entry in /StructTreeRoot. This allows to find out which structure element is the parent of objects which have a /StructParent(s) entry in their properties dictionary.
Comment on attachment 79990 [details] [review] [PATCH 1/6] Tagged-PDF: Accessors in Catalog for the MarkInfo Review of attachment 79990 [details] [review]: ----------------------------------------------------------------- LGTM ::: poppler/Catalog.h @@ +228,4 @@ > GooString *baseURI; // base URI for URI-type links > Object metadata; // metadata stream > Object structTreeRoot; // structure tree root dictionary > + int markInfo; // Flags from MarkInfo dictionary Maybe this should be Guint instead of int, since it's what getMarkInfo() returns too.
Comment on attachment 80407 [details] [review] [PATCH v3 2/7] Tagged-PDF: Interpret the document structure Review of attachment 80407 [details] [review]: ----------------------------------------------------------------- Maybe you can split the patch in those 3 things? to make it a bit easier to review?
> Maybe you can split the patch in those 3 things? to make it a bit easier to > review? Ouch, the commit message does not really describe well the patch after the rebasing/squashing done prior to uploading. But I can split it up in three logical parts: - Defining the StructElement class and parsing of the corresponding objects from the PDF. - Defining the Attribute class and parsing the corresponding objects from the PDF. - Text content extraction, a.k.a. the machinery needed for StructElement::getText() and StructElement::getMCOps() If there's no objections, I would upload an updated version with the patch split in three mentioned above soon.
(In reply to comment #9) > > Maybe you can split the patch in those 3 things? to make it a bit easier to > > review? > > Ouch, the commit message does not really describe well the patch after > the rebasing/squashing done prior to uploading. But I can split it up > in three logical parts: > > - Defining the StructElement class and parsing of the corresponding > objects from the PDF. > - Defining the Attribute class and parsing the corresponding objects > from the PDF. > - Text content extraction, a.k.a. the machinery needed for > StructElement::getText() and StructElement::getMCOps() > > If there's no objections, I would upload an updated version with the > patch split in three mentioned above soon. Sounds good to me, I think I would split it even more, since the patch is mixing 2 things, document logical structure and tagged PDFs. The first patch could define the StructTreeRoot and StructTreeElement without any tagged PDF info. And an top of logical structure support, add tagged PDF implementation. I'll do a first review of the current patch anyway, but upload a new version split.
Comment on attachment 80407 [details] [review] [PATCH v3 2/7] Tagged-PDF: Interpret the document structure Review of attachment 80407 [details] [review]: ----------------------------------------------------------------- Ok, I have just a few comments for now, I'll do a full review when the patch is split. In general, * are placed wrongly (close to the type instead of the argument/variable name). Some parts of the code are hard to follow with all those abbreviations. ::: poppler/Catalog.h @@ +124,4 @@ > GooString *readMetadata(); > > // Return the structure tree root object. > + StructTreeRoot* getStructTreeRoot(); Note that this is used by pdfinfo, you need to update it too ::: poppler/StructElement.cc @@ +197,5 @@ > +{ > + return value->isNum(); > +} > + > +static GBool isNumber_or_AutoName(Object* value) isNumerOrAutoName I guess or even just isNumberOrAuto @@ +231,5 @@ > + } \ > + return okay; \ > + } > + > +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor, 4, gTrue, gTrue ); what is OptX4? @@ +247,5 @@ > +// Maps attributes to their names and whether the attribute can be inherited. > +struct AttributeMapEntry { > + Attribute::Type type; > + const char* name; > + const Object* defval; Move the * to the variable name. @@ +862,5 @@ > +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA): > + type(Unknown), > + treeRoot(treeRootA), > + parent(parentA), > + pageRef(), Do you really need this? I guess it does nothing. @@ +863,5 @@ > + type(Unknown), > + treeRoot(treeRootA), > + parent(parentA), > + pageRef(), > + s(new StructData()) Why having this as a separate struct? @@ +900,5 @@ > +{ > + if (isContent()) > + delete c; > + else > + delete s; I see why using separate structs now. I would just keep the mcid or object ref in the same StructElement and parse it on demand. I find very confusing using variable names like c and s @@ +995,5 @@ > + assert(map); > + > + const MCOpArray& ops(getMCOps()); > + if (!ops.size()) > + return NULL; This leaks the map, GlobalParams::getTextEncoding() returns a new reference. I would get the map below right before you need it. @@ +1149,5 @@ > + > + // Language (optional). > + if (element->lookup("Lang", &obj)->isString()) { > + s->language = obj.getString()->getCString(); > + obj.initNull(); // The StructElement takes ownership of the GooString This is weird, I guess you are trying to avoid another string copy. I wonder why you didn't do the same for title. Maybe we could add "take" methods to Object and do something like: language = obj.takeString(); hmm, but here you are getting the CString from the GooString and leaking the GooString. @@ +1160,5 @@ > + if (element->lookup("Alt", &obj)->isString()) { > + s->altText = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString > + } else if (!obj.isNull()) { > + error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName()); I think we can just ignore invalid objects for optional fields.
Wow, [PATCH v3 2/7] Tagged-PDF: Interpret the document structure is huge :D I admit I haven't looked at the code in depth. But I'll assume there's some tree structure parsing somewhere because PDF are full of those trees, are we protecting against loop in the tree somehow?
(In reply to comment #10) > Sounds good to me, I think I would split it even more [...] I went ahead and splitted the patch set even more, now my local branch is looking like this (output from “git log --pretty=oneline --reverse master..”): 84bdca9 Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary ef399c0 Tagged-PDF: Implement parsing of StructTreeRoot df1bd34 Tagged-PDF: Implement parsing of StructElem objects 4698e47 Tagged-PDF: Implement parsing of StructElem attributes 5a79abd Tagged-PDF: Text content extraction from structure elements ddff120 Tagged-PDF: Modify pdfinfo to show the document structure be378ca Tagged-PDF: Implement the utils/pdfstructtohtml tool 6fb852f Tagged-PDF: Expose the structure tree in poppler-glib c5c01a4 Tagged-PDF: Pane in poppler-glib demo showing the structure 0c6ba00 Tagged-PDF: Heuristics in poppler-glib for data/layout table identification So the huge patch is now split in four. I am going over the rest of your comments tomorrow before uploading the split patches.
(In reply to comment #11) > Comment on attachment 80407 [details] [review] [review] > [PATCH v3 2/7] Tagged-PDF: Interpret the document structure > > Review of attachment 80407 [details] [review] [review]: > ----------------------------------------------------------------- > > Ok, I have just a few comments for now, I'll do a full review when the patch > is split. In general, * are placed wrongly (close to the type instead of the > argument/variable name). Some parts of the code are hard to follow with all > those abbreviations. > > ::: poppler/Catalog.h > @@ +124,4 @@ > > GooString *readMetadata(); > > > > // Return the structure tree root object. > > + StructTreeRoot* getStructTreeRoot(); > > Note that this is used by pdfinfo, you need to update it too The corresponding patch for “pdfinfo” is attached to bug #64813 > ::: poppler/StructElement.cc > @@ +231,5 @@ > > + } \ > > + return okay; \ > > + } > > + > > +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor, 4, gTrue, gTrue ); > > what is OptX4? Instead one single color value specifying e.g. the color of all the four sides a box border, it is possible to *optionally* have four values (hence *X4*) with one value for each side. I will add a comment about that in the code and think about a better name. > @@ +862,5 @@ > > +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA): > > + type(Unknown), > > + treeRoot(treeRootA), > > + parent(parentA), > > + pageRef(), > > Do you really need this? I guess it does nothing. Explicitly initializes using the default constructor. I tend to prefer to write code that is explicit, but I can just remove the initializer for the “pageRef” attribute. > @@ +863,5 @@ > > + type(Unknown), > > + treeRoot(treeRootA), > > + parent(parentA), > > + pageRef(), > > + s(new StructData()) > > Why having this as a separate struct? The idea is to make the StructElement instances smaller in memory when just storing a MCID or an OBJR. > @@ +900,5 @@ > > +{ > > + if (isContent()) > > + delete c; > > + else > > + delete s; > > I see why using separate structs now. I would just keep the mcid or object > ref in the same StructElement and parse it on demand. I find very confusing > using variable names like c and s Do you mean that you would keep them in the parent StructElement? IMHO that would make the API of StructElement more cumbersome. Also, note that there can be mixed MCID, OBJR and StructElem children in a structure object, so that would also complicate to store the children (now it is just a simple “std::vector<StructElement*>”). Or am I understanding your comment wrongly? > @@ +1160,5 @@ > > + if (element->lookup("Alt", &obj)->isString()) { > > + s->altText = obj.getString(); > > + obj.initNull(); // The StructElement takes ownership of the GooString > > + } else if (!obj.isNull()) { > > + error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName()); > > I think we can just ignore invalid objects for optional fields. Do you mean to not even print warning messages? FWIW, the rest of your comments are valid concerns (especially the ones related to leaking objects) and I will be fixing the issues tomorrow. Thanks for the review, and next round would be (hopefully) easier with the split patches :)
Created attachment 81008 [details] [review] [PATCH v5 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary
Created attachment 81009 [details] [review] [PATCH v5 02/10] Tagged-PDF: Implement parsing of StructTreeRoot
Created attachment 81010 [details] [review] [PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects
Created attachment 81011 [details] [review] [PATCH v5 04/10] Tagged-PDF: Implement parsing of StructElem attributes
Created attachment 81012 [details] [review] Tagged-PDF: Text content extraction from structure elements
(In reply to comment #12) > Wow, [PATCH v3 2/7] Tagged-PDF: Interpret the document structure is huge :D > > I admit I haven't looked at the code in depth [...] I have uploaded an updated patch set with the outstanding issues already mentioned by Carlos Garcia Campos fixed. This updated patch set has the big patch split in four, I hope that eases the review process O:-) > [...] But I'll assume there's some tree structure parsing somewhere > because PDF are full of those trees, are we protecting against loop > in the tree somehow? I didn't see code in Poppler that would be particularly useful to parse the document structure tree. WRT loops in the structure tree, there is no protection and a malformed PDF file could potentially cause an infinite loop when parsing the tree. If I am understanding the PDF spec correctly, well-formed PDFs must not have loops in the tree. How critical would you say having protection against loops in the tree would be?
(In reply to comment #20) > (In reply to comment #12) > > [...] But I'll assume there's some tree structure parsing somewhere > > because PDF are full of those trees, are we protecting against loop > > in the tree somehow? > > I didn't see code in Poppler that would be particularly useful to parse > the document structure tree. WRT loops in the structure tree, there is > no protection and a malformed PDF file could potentially cause an > infinite loop when parsing the tree. If I am understanding the PDF spec > correctly, well-formed PDFs must not have loops in the tree. How critical > would you say having protection against loops in the tree would be? Quite critical, people usually spend their time writing such kind of pdf since that way they can make us crash (heap exhaustion) and thus they can say "whooooo i found a CVE in poppler"... That's why we have in various places the passing of std::set<int> with the refs of already parsed things in the tree so in case you find you have to process something that is already in the set you know you found a loop and simply bail out. It should not be very hard to add so it'd be cool if you could add it.
(In reply to comment #14) > (In reply to comment #11) > > Comment on attachment 80407 [details] [review] [review] [review] > > [PATCH v3 2/7] Tagged-PDF: Interpret the document structure > > > > Review of attachment 80407 [details] [review] [review] [review]: > > ----------------------------------------------------------------- > > > > Ok, I have just a few comments for now, I'll do a full review when the patch > > is split. In general, * are placed wrongly (close to the type instead of the > > argument/variable name). Some parts of the code are hard to follow with all > > those abbreviations. > > > > ::: poppler/Catalog.h > > @@ +124,4 @@ > > > GooString *readMetadata(); > > > > > > // Return the structure tree root object. > > > + StructTreeRoot* getStructTreeRoot(); > > > > Note that this is used by pdfinfo, you need to update it too > > The corresponding patch for “pdfinfo” is attached to bug #64813 Every patch should build without depending on any other, if you change getStructTreeRoot to return a StructTreeRoot * instead of an Object * and pdutils still does doc->getStructTreeRoot()->isDict() this is not going to build. > > ::: poppler/StructElement.cc > > @@ +231,5 @@ > > > + } \ > > > + return okay; \ > > > + } > > > + > > > +ARRAY_CHECKER(isRGBColor_or_OptX4, isRGBColor, 4, gTrue, gTrue ); > > > > what is OptX4? > > Instead one single color value specifying e.g. the color of all the > four sides a box border, it is possible to *optionally* have four > values (hence *X4*) with one value for each side. I will add a comment > about that in the code and think about a better name. Ok, it's a bit confusing I would say. > > @@ +862,5 @@ > > > +StructElement::StructElement(Dict *element, StructTreeRoot *treeRootA, StructElement *parentA): > > > + type(Unknown), > > > + treeRoot(treeRootA), > > > + parent(parentA), > > > + pageRef(), > > > > Do you really need this? I guess it does nothing. > > Explicitly initializes using the default constructor. I tend to prefer to > write code that is explicit, but I can just remove the initializer for > the “pageRef” attribute. The default constructor is used in any case, I think those unneeded explicit initializers are just noise, but I'm not C++ expert. > > @@ +863,5 @@ > > > + type(Unknown), > > > + treeRoot(treeRootA), > > > + parent(parentA), > > > + pageRef(), > > > + s(new StructData()) > > > > Why having this as a separate struct? > > The idea is to make the StructElement instances smaller in memory when > just storing a MCID or an OBJR. > > > @@ +900,5 @@ > > > +{ > > > + if (isContent()) > > > + delete c; > > > + else > > > + delete s; > > > > I see why using separate structs now. I would just keep the mcid or object > > ref in the same StructElement and parse it on demand. I find very confusing > > using variable names like c and s > > Do you mean that you would keep them in the parent StructElement? IMHO that > would make the API of StructElement more cumbersome. Also, note that there > can be mixed MCID, OBJR and StructElem children in a structure object, so > that would also complicate to store the children (now it is just a simple > “std::vector<StructElement*>”). Or am I understanding your comment wrongly? I think we should try to delay most of this to be done in demand, note that in most of the cases we will be parsing al this for nothing, because poppler api users don't support structure/tagged pdfs. So, I would just keep the mcid or object ref, and parse it on demand if it's required only. > > @@ +1160,5 @@ > > > + if (element->lookup("Alt", &obj)->isString()) { > > > + s->altText = obj.getString(); > > > + obj.initNull(); // The StructElement takes ownership of the GooString > > > + } else if (!obj.isNull()) { > > > + error(errSyntaxWarning, -1, "Alt object is wrong type ({0:s})", obj.getTypeName()); > > > > I think we can just ignore invalid objects for optional fields. > > Do you mean to not even print warning messages? Yes. > FWIW, the rest of your comments are valid concerns (especially the ones > related to leaking objects) and I will be fixing the issues tomorrow. > > Thanks for the review, and next round would be (hopefully) easier with > the split patches :) Thanks
Comment on attachment 81008 [details] [review] [PATCH v5 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary Review of attachment 81008 [details] [review]: ----------------------------------------------------------------- This looks good to me. ::: poppler/Catalog.h @@ +129,5 @@ > + enum MarkInfoFlags { > + markInfoNull = (1 << 0), > + markInfoMarked = (1 << 1), > + markInfoUserProperties = (1 << 2), > + markInfoSuspects = (1 << 3), I think you don't need parentheses for the enum values.
Comment on attachment 81009 [details] [review] [PATCH v5 02/10] Tagged-PDF: Implement parsing of StructTreeRoot Review of attachment 81009 [details] [review]: ----------------------------------------------------------------- Looks good in general, but you should make sure it doesn't break the build. I have a few other comments. ::: poppler/Catalog.cc @@ +854,5 @@ > + return NULL; > + } > + > + if (catalog.dictLookup("StructTreeRoot", &root)->isDict("StructTreeRoot")) { > + structTreeRoot = new StructTreeRoot(doc, root.getDict(), getMarkInfo() & markInfoMarked); What should we do in case of suspects? I don't like bool parameters, maybe we could just pass the mark info flags and let the StructTreeRoot decide what to do depending on the flags. You don't even need to pass this as a parameter, since the StructTreeRoot keeps a pointer to the PDFDoc, you can get the mark inflo flags from the doc directly: doc->getCatalog()->getMarkInfo() ::: poppler/Catalog.h @@ +124,4 @@ > GooString *readMetadata(); > > // Return the structure tree root object. > + StructTreeRoot *getStructTreeRoot(); As I said, this breaks the build. It's very important that every patch builds, since it might affect any future bisection, for example. ::: poppler/StructTreeRoot.cc @@ +110,5 @@ > + // Parse the children StructElements > + Object kids; > + if (root->lookup("K", &kids)->isArray()) { > + if (marked && kids.arrayGetLength() > 1) { > + error(errSyntaxWarning, -1, "K in StructTreeRoot has more than one children in a tagged PDF"); I would leave this for patches to add tagged pdf support, adding first just the generic logical structure parsing. @@ +118,5 @@ > + kids.arrayGetNF(i, &ref); > + if (kids.arrayGet(i, &obj)->isDict()) { > + StructElement *child = new StructElement(obj.getDict(), this); > + if (child->isOk()) { > + if (marked && !(child->getType() == StructElement::Document || Ditto. @@ +125,5 @@ > + child->getType() == StructElement::Div)) { > + error(errSyntaxWarning, -1, "StructTreeRoot element of tagged PDF is wrong type ({0:s})", child->getTypeName()); > + } > + appendElement(child); > + if (ref.isRef()) parentTreeAdd(ref.getRef(), child); Move the if body to its own line, please. ::: poppler/StructTreeRoot.h @@ +36,5 @@ > + unsigned getNumElements() const { return elements.size(); } > + const StructElement *getElement(int i) const { return elements.at(i); } > + StructElement *getElement(int i) { return elements.at(i); } > + void appendElement(StructElement *element) > + { if (element && element->isOk()) elements.push_back(element); } We normally indent the body when it's in a new line. @@ +39,5 @@ > + void appendElement(StructElement *element) > + { if (element && element->isOk()) elements.push_back(element); } > + > + const StructElement *findParentElement(unsigned index) const > + { return (index < parentTree.size() && parentTree[index].size() == 1) ? parentTree[index][0].element : NULL; } Ditto. I wonder if this would be easier to read in the .cc file, though.
Comment on attachment 81010 [details] [review] [PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects Review of attachment 81010 [details] [review]: ----------------------------------------------------------------- Please, try to make sure patches build before submitting them. ::: poppler/StructElement.cc @@ +82,5 @@ > + { StructElement::Figure, "Figure", elementTypeUndefined }, > + { StructElement::Formula, "Formula", elementTypeUndefined }, > + { StructElement::Form, "Form", elementTypeUndefined }, > + { StructElement::TOC, "TOC", elementTypeUndefined }, > + { StructElement::TOCI, "TOCI", elementTypeUndefined }, Isn't all this also part of the standard attributes defined for tagged PDFs? If we leave the tagged pdf support for a different patch, maybe you can merge this and the StructTreeRoot into a single patch and you don't need to dummy impl of StructElement in the previous patch. @@ +237,5 @@ > + > +GooString* StructElement::getText(GooString *string, GBool recursive) const > +{ > + // TODO: Dummy implementation, complete > + return NULL; Do we really need this? or can it be added in a follow up patch with the proper code instead? @@ +338,5 @@ > + > + // Language (optional). > + if (element->lookup("Lang", &obj)->isString()) { > + s->language = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString Same concern I had in previous review. ::: poppler/StructElement.h @@ +74,5 @@ > + const GooString *getID() const { return isContent() ? NULL : s->id; } > + GooString *getID() { return isContent() ? NULL : s->id; } > + > + // Optional ISO language name, e.g. en_US > + GooString *getLang() getLanguage() is not that long :-) @@ +75,5 @@ > + GooString *getID() { return isContent() ? NULL : s->id; } > + > + // Optional ISO language name, e.g. en_US > + GooString *getLang() > + { return (!isContent() && s->language) ? s->language : (parent ? parent->getLang() : NULL); } We normally indent this. @@ +85,5 @@ > + void setRevision(Guint revision) { if (isContent()) s->revision = revision; } > + > + // Optional element title, in human-readable form. > + const GooString *getTitle() const { return isContent() ? NULL : s->title; } > + GooString *getTitle() { return isContent() ? NULL : s->title; } Do we really need the two versions of every getter? I prefer to only add the code that is actually used and add more later when needed. @@ +98,5 @@ > + > + void appendElement(StructElement *element) > + { if (!isContent() && element && element->isOk()) s->elements.push_back(element); } > + > +<<<<<<< HEAD Please, at least try to build the patches before submitting them. @@ +133,5 @@ > + // > + // The text will be appended to the passed GooString. If NULL is passed, > + // a new string is returned, and the ownership passed to the caller. > + // > + GooString* getText(GooString *string = NULL, GBool recursive = gTrue) const; Don't include this in this first patch if it's unused and unimplemented.
(In reply to comment #25) > Comment on attachment 81010 [details] [review] [review] > [PATCH v5 03/10] Tagged-PDF: Implement parsing of StructElem objects > > Review of attachment 81010 [details] [review] [review]: > ----------------------------------------------------------------- > > Please, try to make sure patches build before submitting them. > > ::: poppler/StructElement.cc > @@ +82,5 @@ > > + { StructElement::Figure, "Figure", elementTypeUndefined }, > > + { StructElement::Formula, "Formula", elementTypeUndefined }, > > + { StructElement::Form, "Form", elementTypeUndefined }, > > + { StructElement::TOC, "TOC", elementTypeUndefined }, > > + { StructElement::TOCI, "TOCI", elementTypeUndefined }, > > Isn't all this also part of the standard attributes defined for tagged PDFs? > If we leave the tagged pdf support for a different patch, maybe you can > merge this and the StructTreeRoot into a single patch and you don't need to > dummy impl of StructElement in the previous patch. Yes, this is for the standard attributes, I can try to avoid having a completely dummy StructElement class in the patch implementing the StructTreeRoot one — but be aware that will move the parsing of StructElement to the patch where StructTreeRoot is parsed and it will grow bigger :-/ > @@ +237,5 @@ > > + > > +GooString* StructElement::getText(GooString *string, GBool recursive) const > > +{ > > + // TODO: Dummy implementation, complete > > + return NULL; > > Do we really need this? or can it be added in a follow up patch with the > proper code instead? Not really. I am changing this to not have a dummy StructElement::getText and add it in a later patch. > @@ +338,5 @@ > > + > > + // Language (optional). > > + if (element->lookup("Lang", &obj)->isString()) { > > + s->language = obj.getString(); > > + obj.initNull(); // The StructElement takes ownership of the GooString > > Same concern I had in previous review. I have seen similar uses of Object::initNull() in other parts of the Poppler code (to mention one example: Parser.cc, line 178), so this is not a completely alien code pattern, I would say ;-) Adding a Object::takeString() method (and related ones) is not really needed. It would only make the code more self-documenting, avoiding to have to add a comment about the ownership being transferred — which is nice but hardly a priority, I guess. > ::: poppler/StructElement.h > @@ +74,5 @@ > > + const GooString *getID() const { return isContent() ? NULL : s->id; } > > + GooString *getID() { return isContent() ? NULL : s->id; } > > + > > + // Optional ISO language name, e.g. en_US > > + GooString *getLang() > > getLanguage() is not that long :-) I am changing it to getLanguage() > @@ +75,5 @@ > > + GooString *getID() { return isContent() ? NULL : s->id; } > > + > > + // Optional ISO language name, e.g. en_US > > + GooString *getLang() > > + { return (!isContent() && s->language) ? s->language : (parent ? parent->getLang() : NULL); } > > We normally indent this. Fixing. Also, I am changing the body to this, to make it clearer: GooString *getLanguage() { if (!isContent() && s->language) return s->language; return parent ? parent->getLanguage() : NULL; } > @@ +85,5 @@ > > + void setRevision(Guint revision) { if (isContent()) s->revision = revision; } > > + > > + // Optional element title, in human-readable form. > > + const GooString *getTitle() const { return isContent() ? NULL : s->title; } > > + GooString *getTitle() { return isContent() ? NULL : s->title; } > > Do we really need the two versions of every getter? I prefer to only add the > code that is actually used and add more later when needed. Ideally only the “const” getter should be needed, but for that to work GooString (or whatever the type the returned object has) must also tag its own methods that do not modify the instance as “const”. Take for example the following: const StructElement *elem = /* ... */ printf("Language: %s\n", elem->getLanguage->getCString()); Here, getLanguage() would return a “const GooString*”, and following the rules for constness in C++, only “const” methods can be invoked using the returned pointer (because they are guaranteed to not modify the GooString instance), but unfortunately GooString::getCString() is not declared “const” (even when it does _not_ modify the GooString), so the compiler assumes that the internals of the returned GooString may change; and if they change, the internals of the StructElement are also changing, therefore a “const” getter cannot be used... The Good Solution™ for this is marking methods of existing classes properly as “const” (where appropriate). Here I decided to try to limit the impact in the rest of the code by having the non-“const” getter. There is one thing that could be to improve the readability, though. I could rewrite the non-“const” getters like this: const GooString* getLanguage() const { /* ... */ } GooString* getLanguage() { return const_cast<GooString*>(static_cast<GooString*>(this)->getLanguage()); } (Not sure which one is uglier, if the original or this one I have just suggested... IMHO it is okay to repeat the simple getters, and maybe use this suggestion I have done here for complex getters). > @@ +133,5 @@ > > + // > > + // The text will be appended to the passed GooString. If NULL is passed, > > + // a new string is returned, and the ownership passed to the caller. > > + // > > + GooString* getText(GooString *string = NULL, GBool recursive = gTrue) const; > > Don't include this in this first patch if it's unused and unimplemented. Okay.
Created attachment 82286 [details] [review] [PATCH v6 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary
Created attachment 82287 [details] [review] [PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot
Created attachment 82288 [details] [review] [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects
Created attachment 82289 [details] [review] [PATCH v6 04/10] Tagged-PDF: Implement parsing of StructElem attributes
Created attachment 82290 [details] [review] [PATCH v6 05/10] Tagged-PDF: Text content extraction from structure elements
JFTR, the updates patches I have uploaded should cover the comments from the review (thanks for your comments!). Also I have implemented Albert's suggestion about tracking the structure elements as they are parsed, to avoid the case in which a malformed/crafted PDF with an element referring to an ancestor in the tree would cause an infinite loop.
Comment on attachment 82286 [details] [review] [PATCH v6 01/10] Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary Review of attachment 82286 [details] [review]: ----------------------------------------------------------------- Pushed to git master, thanks!
Comment on attachment 82287 [details] [review] [PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot Review of attachment 82287 [details] [review]: ----------------------------------------------------------------- ::: utils/pdfinfo.cc @@ +229,5 @@ > + (doc->getCatalog()->getMarkInfo() & Catalog::markInfoMarked) ? "yes" : "no"); > + printf("UserProperties: %s\n", > + (doc->getCatalog()->getMarkInfo() & Catalog::markInfoUserProperties) ? "yes" : "no"); > + printf("Suspects: %s\n", > + (doc->getCatalog()->getMarkInfo() & Catalog::markInfoSuspects) ? "yes" : "no"); This doesn't actually belongs to this patch, since it uses the getMarkInfo new API added in previous patch. I've pushed this as a separate patch.
Great to see this merged, thanks to you for the thorough reviews. I hope to help improving this over time. There are still things that can be done, like for example adding support in Poppler to generate the document structure :)
Comment on attachment 82287 [details] [review] [PATCH v6 02/10] Tagged-PDF: Implement parsing of StructTreeRoot Review of attachment 82287 [details] [review]: ----------------------------------------------------------------- If we simplify this patch removing the tagged pdf bits and role/class maps that are currently unused maybe you can merge this with the following one avoiding the dummy class. ::: poppler/StructTreeRoot.cc @@ +43,5 @@ > + // The RoleMap/ClassMap dictionaries are needed by all the parsing > + // functions, which will resolve the custom names to canonical > + // standard names. > + root->lookup("RoleMap", &roleMap); > + root->lookup("ClassMap", &classMap); Since these are actually unused in this patch, we can probably add them in the first patch that actually uses them, since they are optional entries in the end. @@ +103,5 @@ > + } > + obj.free(); > + > + // Parse the children StructElements > + const GBool marked = doc->getCatalog()->getMarkInfo() & Catalog::markInfoMarked; As I commented previously I would focus first on logical structure (Section 14.7) only and then add the tagged pdf support (Section 14.8) on top of that. That would leave this patch even simpler and maybe you can even merge this and the following one instead of using a dummy class here just to make it build. @@ +111,5 @@ > + error(errSyntaxWarning, -1, "K in StructTreeRoot has more than one children in a tagged PDF"); > + } > + for (int i = 0; i < kids.arrayGetLength(); i++) { > + Object obj, ref; > + kids.arrayGetNF(i, &ref); This reference is only used in the if below. @@ +122,5 @@ > + child->getType() == StructElement::Div)) { > + error(errSyntaxWarning, -1, "StructTreeRoot element of tagged PDF is wrong type ({0:s})", child->getTypeName()); > + } > + appendElement(child); > + if (ref.isRef()) { Here you could directly do: if (kids.arrayGetNF(i, &ref)->isRef())
Comment on attachment 82288 [details] [review] [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects Review of attachment 82288 [details] [review]: ----------------------------------------------------------------- I still don't see the point of having two structs heap allocated and all the API methods with 'if isFoo()' to use one or the other, but I'm not opposed either if you guys think it's better. ::: poppler/StructElement.cc @@ +83,5 @@ > + { StructElement::Formula, "Formula", elementTypeUndefined }, > + { StructElement::Form, "Form", elementTypeUndefined }, > + { StructElement::TOC, "TOC", elementTypeUndefined }, > + { StructElement::TOCI, "TOCI", elementTypeUndefined }, > +}; All this is part of the standard types defined for tagged pdf, right? If we move this to a follow up patch and can merge this with the previous one, containing only logical structure implementation. @@ +126,5 @@ > +//------------------------------------------------------------------------ > +// StructElement > +//------------------------------------------------------------------------ > + > +const Ref StructElement::InvalidRef = { -1, -1 }; This looks unused here @@ +176,5 @@ > + c(new ContentData(mcid)) > +{ > + assert(treeRoot); > + assert(parent); > + assert(c->mcid != InvalidMCID); Instead of creating the ContentData struct with an invalid mcid, and asserting here, I think we should ensure that StructElement is never created with an invalid mcid. I don't understand in which situation mcid can be InvalidMCID or why -1 can't be a valid MCID, the spec says it's an integer that uniquely identifies the marked content sequence. @@ +188,5 @@ > +{ > + assert(treeRoot); > + assert(parent); > + assert(c->ref.num >= 0); > + assert(c->ref.gen >= 0); Same here, this should never be called with an invalid ref and I think it's already ensured by the callers. @@ +226,5 @@ > + if (parent) > + return parent->getPageRef(); > + > + static const Ref invalidRef = { -1, -1 }; > + return invalidRef; I would make this boolean returning the Ref as an out parameter instead of using this invalidRef @@ +299,5 @@ > + obj.free(); > + > + // Object ID (optional), to be looked at the IDTree in the tree root. > + if (element->lookup("ID", &obj)->isString()) { > + s->id = new GooString(obj.getString()); It's probably simpler to do obj.getString()->copy(), but it's exactly the same. @@ +317,5 @@ > + obj.free(); > + > + // Element title (optional). > + if (element->lookup("T", &obj)->isString()) { > + s->title = new GooString(obj.getString()); Ditto. @@ +324,5 @@ > + > + // Language (optional). > + if (element->lookup("Lang", &obj)->isString()) { > + s->language = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString I still don't understand why you are doing this here and not for the title for example. And I still think the proper way of doing this is adding Object::takeString() or something like that. Another way of avoid duplicating the string is to save the Object instead of getting the String contained. @@ +331,5 @@ > + > + // Alternative text (optional). > + if (element->lookup("Alt", &obj)->isString()) { > + s->altText = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString Ditto. @@ +338,5 @@ > + > + // Expanded form of an abbreviation (optional). > + if (element->lookup("E", &obj)->isString()) { > + s->expandedAbbr = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString Ditto. @@ +345,5 @@ > + > + // Actual text (optional). > + if (element->lookup("ActualText", &obj)->isString()) { > + s->actualText = obj.getString(); > + obj.initNull(); // The StructElement takes ownership of the GooString Ditto. @@ +378,5 @@ > + mcidObj.free(); > + return NULL; > + } > + > + child = new StructElement(mcidObj.getInt(), treeRoot, this); mcidObj.free() after this. @@ +383,5 @@ > + > + if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) { > + child->pageRef = pageRefObj; > + } else { > + pageRefObj.free(); I don't think you need the else, you can call .free unconditionally: if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) child->pageRef = pageRefObj; pageRefObj.free(); @@ +396,5 @@ > + > + if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) { > + child->pageRef = pageRefObj; > + } else { > + pageRefObj.free(); Ditto. @@ +403,5 @@ > + error(errSyntaxError, -1, "Obj object is wrong type ({0:s})", refObj.getTypeName()); > + } > + refObj.free(); > + } else if (childObj->isDict()) { > + assert(ref->isRef()); Why is this required to be a valid ref when child object is a dict? ::: poppler/StructElement.h @@ +63,3 @@ > > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } I don't think we should consider the cases of invalid objects, isOk should return FALSE in all those cases @@ +63,5 @@ > > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } > + > + int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; } This is not actually public API, so I think it's simpler to assume this is only going to be called when type == MCID, so the caller should ensure that by checking the type first. @@ +64,5 @@ > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } > + > + int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; } > + Ref getObjectRef() const { return type == OBJR ? c->ref : InvalidRef; } Ditto. ::: poppler/StructTreeRoot.cc @@ +115,4 @@ > for (int i = 0; i < kids.arrayGetLength(); i++) { > Object obj, ref; > kids.arrayGetNF(i, &ref); > + assert(ref.isRef()); Why should this be a valid ref? Shouldn't we ignore invalid entries and continue parsing instead of crashing?
Comment on attachment 82288 [details] [review] [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects Review of attachment 82288 [details] [review]: ----------------------------------------------------------------- I also think there are too many asserts in this patch that might not be necessary
(In reply to comment #37) > Comment on attachment 82288 [details] [review] [review] > [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects > > Review of attachment 82288 [details] [review] [review]: > ----------------------------------------------------------------- > > I still don't see the point of having two structs heap allocated and all the > API methods with 'if isFoo()' to use one or the other, but I'm not opposed > either if you guys think it's better. > > ::: poppler/StructElement.cc > @@ +83,5 @@ > > + { StructElement::Formula, "Formula", elementTypeUndefined }, > > + { StructElement::Form, "Form", elementTypeUndefined }, > > + { StructElement::TOC, "TOC", elementTypeUndefined }, > > + { StructElement::TOCI, "TOCI", elementTypeUndefined }, > > +}; > > All this is part of the standard types defined for tagged pdf, right? If we > move this to a follow up patch and can merge this with the previous one, > containing only logical structure implementation. Righy. I will split this to a separate patch. > @@ +126,5 @@ > > +//------------------------------------------------------------------------ > > +// StructElement > > +//------------------------------------------------------------------------ > > + > > +const Ref StructElement::InvalidRef = { -1, -1 }; > > This looks unused here Removed. > @@ +176,5 @@ > > + c(new ContentData(mcid)) > > +{ > > + assert(treeRoot); > > + assert(parent); > > + assert(c->mcid != InvalidMCID); > > Instead of creating the ContentData struct with an invalid mcid, and > asserting here, I think we should ensure that StructElement is never created > with an invalid mcid. I don't understand in which situation mcid can be > InvalidMCID or why -1 can't be a valid MCID, the spec says it's an integer > that uniquely identifies the marked content sequence. You are right, instantiation inside StructElement::parseChild() checks that the MCID is an integer. > @@ +188,5 @@ > > +{ > > + assert(treeRoot); > > + assert(parent); > > + assert(c->ref.num >= 0); > > + assert(c->ref.gen >= 0); > > Same here, this should never be called with an invalid ref and I think it's > already ensured by the callers. Yes, StructElement::parseChild() also checks that this will be a Ref. Personally, I think it does not hurt to have them even when with the current implementation it is guaranteed that the passed values will be correct. For example, if the API is extended at some point to also allow creation of structure trees (i.e: not just reading them) those constructors could/would be used by client code, and it would be good to have the asserts to ensure that the API is used correctly… Anyway, I am digressing; I will remove the extra asserts for now O:-) > @@ +226,5 @@ > > + if (parent) > > + return parent->getPageRef(); > > + > > + static const Ref invalidRef = { -1, -1 }; > > + return invalidRef; > > I would make this boolean returning the Ref as an out parameter instead of > using this invalidRef Seems better, yes. I am changing this to: GBool StructElement::getPageRef(Ref& ref); with the suggested semantics. > @@ +299,5 @@ > > + obj.free(); > > + > > + // Object ID (optional), to be looked at the IDTree in the tree root. > > + if (element->lookup("ID", &obj)->isString()) { > > + s->id = new GooString(obj.getString()); > > It's probably simpler to do obj.getString()->copy(), but it's exactly the > same. Changed. > @@ +317,5 @@ > > + obj.free(); > > + > > + // Element title (optional). > > + if (element->lookup("T", &obj)->isString()) { > > + s->title = new GooString(obj.getString()); > > Ditto. Changed as well. > @@ +324,5 @@ > > + > > + // Language (optional). > > + if (element->lookup("Lang", &obj)->isString()) { > > + s->language = obj.getString(); > > + obj.initNull(); // The StructElement takes ownership of the GooString > > I still don't understand why you are doing this here and not for the title > for example. And I still think the proper way of doing this is adding > Object::takeString() or something like that. Another way of avoid > duplicating the string is to save the Object instead of getting the String > contained. I have added Object::takeString(), and also modified the rest of the places where strings are used to use Object::takeString() if possible. > @@ +378,5 @@ > > + mcidObj.free(); > > + return NULL; > > + } > > + > > + child = new StructElement(mcidObj.getInt(), treeRoot, this); > > mcidObj.free() after this. Good catch, thanks. I have added it. > @@ +383,5 @@ > > + > > + if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) { > > + child->pageRef = pageRefObj; > > + } else { > > + pageRefObj.free(); > > I don't think you need the else, you can call .free unconditionally: > > if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) > child->pageRef = pageRefObj; > pageRefObj.free(); Note that “pageRefObj” is in Object, not a Ref. Note that “child->pageRef” is a pointer to the Object, so if it is deinitialized by calling .free() on it, it may be left in an undetermined state. The idea is to only call .free() if the object is *not* a Ref. If it is a Ref, “child->pageRef” is freed in the destructor for StructElement. > @@ +403,5 @@ > > + error(errSyntaxError, -1, "Obj object is wrong type ({0:s})", refObj.getTypeName()); > > + } > > + refObj.free(); > > + } else if (childObj->isDict()) { > > + assert(ref->isRef()); > > Why is this required to be a valid ref when child object is a dict? Because the reference number is used to populate the “seen” set (it is a “std::set<int>”) which tracks the seen nodes to avoid infinite loops in malformed PDFs. I have changed this so instead of asserting, an error is produced instead. According to the spec a properly formed /StructTree must contain references for the /StrucElem items (not dictionaries directly embedded in the tree), so causing an error should be reasonable. > ::: poppler/StructElement.h > @@ +63,3 @@ > > > > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } > > I don't think we should consider the cases of invalid objects, isOk should > return FALSE in all those cases Okay. > @@ +63,5 @@ > > > > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } > > + > > + int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; } > > This is not actually public API, so I think it's simpler to assume this is > only going to be called when type == MCID, so the caller should ensure that > by checking the type first. Yep. > @@ +64,5 @@ > > + inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); } > > + inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); } > > + > > + int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; } > > + Ref getObjectRef() const { return type == OBJR ? c->ref : InvalidRef; } > > Ditto. Yepyep. > ::: poppler/StructTreeRoot.cc > @@ +115,4 @@ > > for (int i = 0; i < kids.arrayGetLength(); i++) { > > Object obj, ref; > > kids.arrayGetNF(i, &ref); > > + assert(ref.isRef()); > > Why should this be a valid ref? Shouldn't we ignore invalid entries and > continue parsing instead of crashing? Same reason as earlier. In the case of the structure tree it is okay to skip adding the integer to the “seen” set, as the loop would be caught later on, so I have removed this assertion.
*** Bug 67710 has been marked as a duplicate of this bug. ***
Created attachment 86674 [details] [review] [PATCH v8 01/15] Tagged-PDF: Implement parsing of StructTreeRoot
Created attachment 86675 [details] [review] [PATCH v8 02/15] Implement Object::takeString() method
Created attachment 86676 [details] [review] [PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects
Created attachment 86677 [details] [review] [PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and
Created attachment 86678 [details] [review] [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
Created attachment 86730 [details] [review] [PATCH v8 03/15] Tagged-PDF: Implement parsing of StructElem objects
Comment on attachment 86675 [details] [review] [PATCH v8 02/15] Implement Object::takeString() method Review of attachment 86675 [details] [review]: ----------------------------------------------------------------- ::: poppler/Object.h @@ +199,5 @@ > double getNum() { OBJECT_3TYPES_CHECK(objInt, objInt64, objReal); > return type == objInt ? (double)intg : type == objInt64 ? (double)int64g : real; } > GooString *getString() { OBJECT_TYPE_CHECK(objString); return string; } > + GooString *takeString() { > + OBJECT_TYPE_CHECK(objString); GooString *s = string; initNull(); return s; } I don't think we should call initNull. We are just stealing the object contents, but the object type should not change. Also calling initNull here would unbalance the numAllocs array increments/decrements when memory debug is enabled. I think we should just set string to NULL, so that when free() is called it will do nothing.
The three first patches look good to me in general, but I would squash 1 and 3 and make 2 be the first one with the modifications I suggested. I've done this in the tagged-pdf branch after rebasing it. All other patches have been merged unreviewed. I'll continue reviewing the patches when I have time, but for now, if nobody objects I can merge the two first patches as they are in tagged-pdf branch into master. http://cgit.freedesktop.org/poppler/poppler/log/?h=tagged-pdf
Comment on attachment 86677 [details] [review] [PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and Review of attachment 86677 [details] [review]: ----------------------------------------------------------------- ::: poppler/Makefile.am @@ +299,2 @@ > StructElement.cc \ > + StructTreeRoot.cc \ This change looks unrelated. ::: poppler/StructElement.cc @@ +260,5 @@ > + Object Auto; > + Object Start; > + Object None; > + Object Before; > + Object Nat1; It's weird that struct members start with a capital letter, but I guess it looks better in the macros below. @@ +282,5 @@ > +static const AttributeDefaults attributeDefaults; > + > + > +#define ATTR_LIST_END { Attribute::Unknown, NULL, NULL, gFalse, NULL } > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c } What D stand for? Default? @@ +283,5 @@ > + > + > +#define ATTR_LIST_END { Attribute::Unknown, NULL, NULL, gFalse, NULL } > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c } > +#define ATTR_N(x, i, c) { Attribute::x, #x, NULL, i, c } And N? Normal? NotDefault? What about using ATTR and ATTR_WITH_DEFAULT? Also I would use words instead of abbreviations to improve the readability type, inheritable, check, defaultValue. @@ +288,5 @@ > + > +static const AttributeMapEntry attributeMapCommonShared[] = > +{ > + ATTR_D(Placement, gFalse, isPlacementName, Inline), > + ATTR_D(WritingMode, gFalse, isWritingModeName, LrTb), WritingMode is inheritable @@ +295,5 @@ > + ATTR_D(BorderStyle, gFalse, isBorderStyle, None), > + ATTR_N(BorderThickness, gTrue, isPositiveOrOptionalArray4), > + ATTR_D(Padding, gFalse, isPositiveOrArray4, Zero), > + ATTR_N(Color, gTrue, isRGBColor), > + ATTR_LIST_END Do we really need this being the array statically allocated? @@ +341,5 @@ > + ATTR_LIST_END > +}; > + > +static const AttributeMapEntry attributeMapCommonList[] = { > + ATTR_D(ListNumbering, gFalse, isListNumberingName, None), This is inheritable @@ +740,5 @@ > + formatted->Set(formattedA); > + else > + formatted = new GooString(formattedA); > + } else { > + delete formatted; You should set formatted to NULL here @@ +744,5 @@ > + delete formatted; > + } > +} > + > +GBool Attribute::typeCheck(StructElement *element) checkType sounds a bit better to me @@ +747,5 @@ > + > +GBool Attribute::typeCheck(StructElement *element) > +{ > + // If an element is passed, tighther type-checking can be done. > + if (element) { You could use an early return here, to save an indentation level. @@ +765,5 @@ > + > + return gTrue; > +} > + > +Attribute::Type Attribute::typeForName(const char *name, StructElement *element) You are using get for all other methods, use getTypeForName here for consistency. @@ +936,5 @@ > +{ > + if (isContent()) > + return parent->findAttribute(attributeType, inherit, attributeOwner); > + > + if (attributeType != Attribute::Unknown && attributeType != Attribute::UserProperty) { I would return early here. @@ +1132,5 @@ > + for (unsigned j = attrIndex; j < getNumAttributes(); j++) > + getAttribute(j)->setRevision(revision); > + } else { > + error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName()); > + } iobj.free() @@ +1133,5 @@ > + getAttribute(j)->setRevision(revision); > + } else { > + error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName()); > + } > + } attr.free() ::: poppler/StructElement.h @@ +27,2 @@ > class StructTreeRoot; > +class TextWordList; What do you need this for? @@ +116,5 @@ > + static Type typeForName(const char *name, StructElement *element = NULL); > + static Attribute *parseUserProperty(Dict *property); > + > + friend class StructElement; > +}; The Attribute class implementation is huge, you might consider moving it to its own file
Comment on attachment 86678 [details] [review] [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements Review of attachment 86678 [details] [review]: ----------------------------------------------------------------- Looks good in general, my main concern is that some code is duplicated with TextOutputDev, so I prefer if we could somehoe add support for marked content to TextOutputDev instead of inventing yet another output dev. All those abbreviations makes the code a bit more difficult to read for me. ::: poppler/MCOutputDev.cc @@ +34,5 @@ > +}; > + > + > +MCOutputDev::MCOutputDev(int mcid): > + p(new Priv(mcid)) Why do we need a heap allocated private struct? @@ +66,5 @@ > +void MCOutputDev::beginMarkedContent(char *name, Dict *properties) > +{ > + int id = -1; > + if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid) > + p->capturing = true; capturing sounds a bit confusing to me, I would use something like inMarkedContent @@ +149,5 @@ > + if (p->lastFlags != flags) { > + if (p->capturing) > + p->mcOps.push_back(MCOp(MCOp::Flags, flags)); > + p->lastFlags = flags; > + } It seems you are not adding any color "op". Either remove the color type or add it when it changes in ::drawChar (or beginWord) @@ +150,5 @@ > + if (p->capturing) > + p->mcOps.push_back(MCOp(MCOp::Flags, flags)); > + p->lastFlags = flags; > + } > +} I wonder if we can integrate all this with TextOutputDev somehow. There's some code duplication and this is about extracting text after all. Maybe we could add a MarkedContentText object, similar to TextPage or ActualText objects, and use it from the TextOutputDev. We could add a new constructor that only creates a MarkedContentText and receives the marked content id, and then you can get the text and attributes. ::: poppler/MCOutputDev.h @@ +17,5 @@ > +class GfxState; > +class GooString; > +class Dict; > + > +struct MCOp { What's op? operator? @@ +33,5 @@ > + return ((Guint) (r * 255) & 0xFF) << 16 > + | ((Guint) (g * 255) & 0xFF) << 8 > + | ((Guint) (b * 255) & 0xFF); > + } > + }; Why not using GfxColor? or even GfxRGB if only RGB is supported? @@ +67,5 @@ > + } > + MCOp(): type(FontName), value(NULL) {} > + MCOp(Unicode u): type(Unichar), unichar(u) {} > + MCOp(const char *s): type(FontName), value(strdup(s)) {} > + MCOp(Type t, Guint f = 0): type(t), flags(f) {} Why do you need to pass also the Type for the flags constructor? ::: poppler/StructElement.h @@ +233,5 @@ > + // > + // The text will be appended to the passed GooString. If NULL is passed, > + // a new string is returned, and the ownership passed to the caller. > + // > + GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const; appendText maybe?
(In reply to comment #49) > Comment on attachment 86677 [details] [review] [review] > [PATCH v8 04/15] Tagged-PDF: Parsing of StructElem standard types and > > Review of attachment 86677 [details] [review] [review]: > ----------------------------------------------------------------- > > ::: poppler/Makefile.am > @@ +299,2 @@ > > StructElement.cc \ > > + StructTreeRoot.cc \ > > This change looks unrelated. Yep. > ::: poppler/StructElement.cc > @@ +260,5 @@ > > + Object Auto; > > + Object Start; > > + Object None; > > + Object Before; > > + Object Nat1; > > It's weird that struct members start with a capital letter, but I guess it > looks better in the macros below. Yes, they do look easier in the eyes when used in the macros below (at least for me). Should I change them? > @@ +282,5 @@ > > +static const AttributeDefaults attributeDefaults; > > + > > + > > +#define ATTR_LIST_END { Attribute::Unknown, NULL, NULL, gFalse, NULL } > > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c } > > What D stand for? Default? Yes, it stands for “default”. ATTR_D defines an attribute with a default value. > @@ +283,5 @@ > > + > > + > > +#define ATTR_LIST_END { Attribute::Unknown, NULL, NULL, gFalse, NULL } > > +#define ATTR_D(x, i, c, v) { Attribute::x, #x, &attributeDefaults.v, i, c } > > +#define ATTR_N(x, i, c) { Attribute::x, #x, NULL, i, c } > > And N? Normal? NotDefault? What about using ATTR and ATTR_WITH_DEFAULT? Also > I would use words instead of abbreviations to improve the readability type, > inheritable, check, defaultValue. “N” is for “no-default“. Sure I will change this to use descriptive names. > @@ +288,5 @@ > > + > > +static const AttributeMapEntry attributeMapCommonShared[] = > > +{ > > + ATTR_D(Placement, gFalse, isPlacementName, Inline), > > + ATTR_D(WritingMode, gFalse, isWritingModeName, LrTb), > > WritingMode is inheritable Good catch. Fixing. > @@ +295,5 @@ > > + ATTR_D(BorderStyle, gFalse, isBorderStyle, None), > > + ATTR_N(BorderThickness, gTrue, isPositiveOrOptionalArray4), > > + ATTR_D(Padding, gFalse, isPositiveOrArray4, Zero), > > + ATTR_N(Color, gTrue, isRGBColor), > > + ATTR_LIST_END > > Do we really need this being the array statically allocated? With “this” I suppose you mean the ATTR_LIST_END entry. Removing the end marker would make it necessary to store the length of the array in the entries of the “attributeMapAll” (and related) arrays. The amount or static data is just a few bytes more having the end markers in the arrays, and makes the code simpler. If there is any good reason to remove the end markers, it could be done, but I would rather leave them there. > @@ +341,5 @@ > > + ATTR_LIST_END > > +}; > > + > > +static const AttributeMapEntry attributeMapCommonList[] = { > > + ATTR_D(ListNumbering, gFalse, isListNumberingName, None), > > This is inheritable You're right. Fixing. > @@ +740,5 @@ > > + formatted->Set(formattedA); > > + else > > + formatted = new GooString(formattedA); > > + } else { > > + delete formatted; > > You should set formatted to NULL here Good catch, thanks. > @@ +744,5 @@ > > + delete formatted; > > + } > > +} > > + > > +GBool Attribute::typeCheck(StructElement *element) > > checkType sounds a bit better to me Sure, I am changing it to checkType(). > @@ +747,5 @@ > > + > > +GBool Attribute::typeCheck(StructElement *element) > > +{ > > + // If an element is passed, tighther type-checking can be done. > > + if (element) { > > You could use an early return here, to save an indentation level. Sure, it will be fixed. > @@ +765,5 @@ > > + > > + return gTrue; > > +} > > + > > +Attribute::Type Attribute::typeForName(const char *name, StructElement *element) > > You are using get for all other methods, use getTypeForName here for > consistency. Sure. > @@ +936,5 @@ > > +{ > > + if (isContent()) > > + return parent->findAttribute(attributeType, inherit, attributeOwner); > > + > > + if (attributeType != Attribute::Unknown && attributeType != Attribute::UserProperty) { > > I would return early here. Okay. I am changing it. > @@ +1132,5 @@ > > + for (unsigned j = attrIndex; j < getNumAttributes(); j++) > > + getAttribute(j)->setRevision(revision); > > + } else { > > + error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName()); > > + } > > iobj.free() Fixed. > @@ +1133,5 @@ > > + getAttribute(j)->setRevision(revision); > > + } else { > > + error(errSyntaxWarning, -1, "C item is wrong type ({0:s})", iobj.getTypeName()); > > + } > > + } > > attr.free() Fixed as well. > ::: poppler/StructElement.h > @@ +27,2 @@ > > class StructTreeRoot; > > +class TextWordList; > > What do you need this for? It had sense at some point when the feature was WIP, but it ended up not being needed. I will remove it. > @@ +116,5 @@ > > + static Type typeForName(const char *name, StructElement *element = NULL); > > + static Attribute *parseUserProperty(Dict *property); > > + > > + friend class StructElement; > > +}; > > The Attribute class implementation is huge, you might consider moving it to > its own file Sure thing.
(In reply to comment #51) > (In reply to comment #49) > > > > The Attribute class implementation is huge, you might consider moving it to > > its own file > > Sure thing. ...with the problem that it is not that easy. The “typeMap” table and the private static functions at StructElement.cc that use the TypeMapEntry use AttributeMapEntry. So the tables would still need to be in StructElement.cc. Also, some Attribute methods use the static helper functions, and I would rather have the Attribute class in StructElement.cc than duplicating code for those.
Created attachment 87701 [details] [review] [PATCH v9 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes
Comment on attachment 87701 [details] [review] [PATCH v9 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes Review of attachment 87701 [details] [review]: ----------------------------------------------------------------- Looks good, I have just a few more comments ::: poppler/Makefile.am @@ +299,2 @@ > StructElement.cc \ > + StructTreeRoot.cc \ This is still unrelated to this patch. ::: poppler/StructElement.cc @@ +74,5 @@ > + || value->isName("End") > + || value->isName("Center"); > +} > + > +static GBool isNumber(Object *value); Simply move the implementation of isNumber before it's needed so that we don't need prototypes. You could move all isFoo methods for the basic types at the beginning. @@ +445,5 @@ > + attributeMapCommonList, > + NULL, > +}; > + > +static const AttributeMapEntry *attributeMapPrintField[] = { This is unused. @@ +493,5 @@ > + > + > +static GBool ownerHasMorePriority(Attribute::Owner a, Attribute::Owner b) > +{ > + unsigned a_index, b_index; aIndex, bIndex @@ +577,5 @@ > +//------------------------------------------------------------------------ > +// Helpers for the attribute and structure type tables > +//------------------------------------------------------------------------ > + > +static inline const AttributeMapEntry* AttributeMapEntry* -> AttributeMapEntry * @@ +594,5 @@ > + } > + return NULL; > +} > + > +static inline const AttributeMapEntry* AttributeMapEntry* -> AttributeMapEntry * @@ +635,5 @@ > + const OwnerMapEntry *entry = getOwnerMapEntry(owner); > + return entry ? entry->name : "UnknownOwner"; > +} > + > +Attribute::Owner nameToOwner(const char *name) This should be static @@ +695,5 @@ > + > + if (copyValue) > + valueA->copy(&value); > + else > + valueA->shallowCopy(&value); When does this happen? It seems that attributes are always created with copyValue = gFalse. @@ +712,5 @@ > + > + if (copyValue) > + valueA->copy(&value); > + else > + valueA->shallowCopy(&value); Ditto. @@ +716,5 @@ > + valueA->shallowCopy(&value); > + > + if (!checkType()) { > + type = Unknown; > + } You are not using braces for all other single statement ifs @@ +1322,5 @@ > + if (attribute->isOk() && (typeCheckOk = attribute->checkType(this))) { > + appendAttribute(attribute); > + } else { > + // It is not needed to free "value", the Attribute instance > + // owns the contents, so deleting "attribute" is enough. I think it's fine to always copy the value to make the code simpler. For heavy to copy objects we use reference counter.
(In reply to comment #50) > Comment on attachment 86678 [details] [review] [review] > [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements > > Review of attachment 86678 [details] [review] [review]: > ----------------------------------------------------------------- > > Looks good in general, my main concern is that some code is duplicated with > TextOutputDev, so I prefer if we could somehoe add support for marked > content to TextOutputDev instead of inventing yet another output dev. All > those abbreviations makes the code a bit more difficult to read for me. New version of the patch ditches the “MC” abbreviation and it appears expanded to “MarkedContent”. > ::: poppler/MCOutputDev.cc > @@ +34,5 @@ > > +}; > > + > > + > > +MCOutputDev::MCOutputDev(int mcid): > > + p(new Priv(mcid)) > > Why do we need a heap allocated private struct? We don't, I first wrote this before being aware that the Poppler low-level API is not considered public (as in “public stable API”). > @@ +66,5 @@ > > +void MCOutputDev::beginMarkedContent(char *name, Dict *properties) > > +{ > > + int id = -1; > > + if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid) > > + p->capturing = true; > > capturing sounds a bit confusing to me, I would use something like > inMarkedContent Changed. > @@ +149,5 @@ > > + if (p->lastFlags != flags) { > > + if (p->capturing) > > + p->mcOps.push_back(MCOp(MCOp::Flags, flags)); > > + p->lastFlags = flags; > > + } > > It seems you are not adding any color "op". Either remove the color type or > add it when it changes in ::drawChar (or beginWord) Ooops. New version of the patch adds the missing code. > @@ +150,5 @@ > > + if (p->capturing) > > + p->mcOps.push_back(MCOp(MCOp::Flags, flags)); > > + p->lastFlags = flags; > > + } > > +} > > I wonder if we can integrate all this with TextOutputDev somehow. There's > some code duplication and this is about extracting text after all. Maybe we > could add a MarkedContentText object, similar to TextPage or ActualText > objects, and use it from the TextOutputDev. We could add a new constructor > that only creates a MarkedContentText and receives the marked content id, > and then you can get the text and attributes. > > ::: poppler/MCOutputDev.h > @@ +17,5 @@ > > +class GfxState; > > +class GooString; > > +class Dict; > > + > > +struct MCOp { > > What's op? operator? Operation. > @@ +33,5 @@ > > + return ((Guint) (r * 255) & 0xFF) << 16 > > + | ((Guint) (g * 255) & 0xFF) << 8 > > + | ((Guint) (b * 255) & 0xFF); > > + } > > + }; > > Why not using GfxColor? or even GfxRGB if only RGB is supported? GfxRGB seems a resonable compromise. Next version will use it. > @@ +67,5 @@ > > + } > > + MCOp(): type(FontName), value(NULL) {} > > + MCOp(Unicode u): type(Unichar), unichar(u) {} > > + MCOp(const char *s): type(FontName), value(strdup(s)) {} > > + MCOp(Type t, Guint f = 0): type(t), flags(f) {} > > Why do you need to pass also the Type for the flags constructor? Type “Unicode” is also a “Guint”, so there would be two overloaded functions with one argument of the same type. The compiler cannot resolve the ambiguity and generates an error. Adding the extra “Type” argument is a workaround for that. > ::: poppler/StructElement.h > @@ +233,5 @@ > > + // > > + // The text will be appended to the passed GooString. If NULL is passed, > > + // a new string is returned, and the ownership passed to the caller. > > + // > > + GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const; > > appendText maybe? After thinking twice about this, it looks like it is wrong to have a single function that can work in two ways. The reason for those two behaviours is that client code will usually call “getText()”, without passing arguments, so internally the function will create a GooString and then iterate over the children recursively, passing that one string to avoid allocating one GooString per child element. IMHO the best is to split this in two functions: public: GooString* getText(GBool recursive = gTrue); private: void appendSubTreeText(GooString *string); The new “appendSubTreeText()” function would be a private helper function used when “getText()” is called with “recusive = gTrue”.
Created attachment 89046 [details] [review] [PATCH v10 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes
Created attachment 89047 [details] [review] [PATCH v10 02/12] Tagged-PDF: Text content extraction from structure elements
Comment on attachment 89046 [details] [review] [PATCH v10 01/12] Tagged-PDF: Parsing of StructElem standard types and attributes Review of attachment 89046 [details] [review]: ----------------------------------------------------------------- LGTM, pushed to git master. Thanks!
Created attachment 89917 [details] [review] Tagged-PDF: Text content extraction from structure elements
(In reply to comment #59) > Created attachment 89917 [details] [review] [review] > Tagged-PDF: Text content extraction from structure elements In this latest version of the patch the MCOp/MarkedContentOp code is removed, and the API is now more high-level: text spans are returned by MarkedContentOutputDev::getTextSpans(), instead of doing the conversion later in poppler-glib.
Comment on attachment 89917 [details] [review] Tagged-PDF: Text content extraction from structure elements Review of attachment 89917 [details] [review]: ----------------------------------------------------------------- This looks much better and simpler :-) ::: poppler/MarkedContentOutputDev.cc @@ +43,5 @@ > + > +void MarkedContentOutputDev::endSpan() > +{ > + if (currentText && currentText->getLength()) { > + TextSpan span(currentText, currentFont, currentLink); Add a comment here saying that TextSpan adopts the pointers, because at a first glance it looks like this method leaks memory. @@ +44,5 @@ > +void MarkedContentOutputDev::endSpan() > +{ > + if (currentText && currentText->getLength()) { > + TextSpan span(currentText, currentFont, currentLink); > + memcpy(&span.data->color, ¤tColor, sizeof(GfxRGB)); Why don't you pass the color to the constructor too? Can't you simply do span.data->color = currentColor instead of the memcpy in any case? @@ +73,5 @@ > +void MarkedContentOutputDev::beginMarkedContent(char *name, Dict *properties) > +{ > + int id = -1; > + if (properties && properties->lookupInt("MCID", NULL, &id) && id == mcid) > + inMarkedContent = true; I wonder what happens in case of nested marked contents, should we set inMarkedContent to false until the inner marked contents ends? or do we want also the contents of any nested marked content? @@ +144,5 @@ > + > + if (font == currentFont) > + return; > + > + endSpan(); Do we want to call endSpan even when inMarkedContent = false? @@ +158,5 @@ > + > + > +void MarkedContentOutputDev::updateFillColor(GfxState *state) > +{ > + GfxRGB color; Same here, should we ignore fill color updates when not in marked content? @@ +167,5 @@ > + color.b == currentColor.b) > + return; > + > + endSpan(); > + currentColor = color; Note that the color of the text depends on the current text rendering mode, for rendering mode 1 (stroke) what we want is the stroke color instead of the fill color. Maybe we could get the current fill/stroke color from the state depending on the current render mode in drawChar instead of using updateFillColor and updateStrokeColor ::: poppler/MarkedContentOutputDev.h @@ +25,5 @@ > + enum { > + Italic = 1 << 0, > + Bold = 1 << 1, > + Fixed = 1 << 2, > + }; This looks unused. @@ +46,5 @@ > + } > + > + GfxFont* getFont() const { return data->font; } > + GooString* getText() const { return data->text; } > + GooString* getLink() const { return data->link; } What's exactly this link? a named destination? It doesn't seem to be assigned anywhere. If this si something that will be used in a future patch I prefer to add it in that patch instead of adding unused code here. @@ +52,5 @@ > + > +private: > + // NOTE: Takes ownership of strings, increases refcount for font. > + TextSpan(GooString *text, GfxFont *font = NULL, GooString *link = NULL) > + : data(new Data) { Why do you need this heap allocated data struct instead of using members of the class directly? Why is this constructor private?
(In reply to comment #61) > ::: poppler/MarkedContentOutputDev.cc > @@ +43,5 @@ > > + > > +void MarkedContentOutputDev::endSpan() > > +{ > > + if (currentText && currentText->getLength()) { > > + TextSpan span(currentText, currentFont, currentLink); > > Add a comment here saying that TextSpan adopts the pointers, because at a > first glance it looks like this method leaks memory. Sure, I am adding a comment. There is already a note in the header file next to the TextSpan constructor but it will not hurt to have a note also here. > @@ +44,5 @@ > > +void MarkedContentOutputDev::endSpan() > > +{ > > + if (currentText && currentText->getLength()) { > > + TextSpan span(currentText, currentFont, currentLink); > > + memcpy(&span.data->color, ¤tColor, sizeof(GfxRGB)); > > Why don't you pass the color to the constructor too? Can't you simply do > span.data->color = currentColor instead of the memcpy in any case? Changed. > @@ +73,5 @@ > > +void MarkedContentOutputDev::beginMarkedContent(char *name, Dict *properties) > > +{ > > + int id = -1; > > + if (properties && properties->lookupInt("MCID", NULL, &id) && id == mcid) > > + inMarkedContent = true; > > I wonder what happens in case of nested marked contents, should we set > inMarkedContent to false until the inner marked contents ends? or do we want > also the contents of any nested marked content? I cannot think of an use-case where one would not be interested in the nested content. For the moment I think we can keep it as-is, and later on if the use-case surfaces, add the change. On a different note, this comment of yours made me notice that there was a bug with nested marked content sequences: beginMarkedContent() would be called again for the nested sequence, and when the inner sequence finishes endMarkedContent() would set inMarkedContent=gFalse even when the content from the outer marked content sequence has not been yet extracted. I have changed this so a “std::vector<int>” is used as a stack to keep track of the nested sequences. > @@ +144,5 @@ > > + > > + if (font == currentFont) > > + return; > > + > > + endSpan(); > > Do we want to call endSpan even when inMarkedContent = false? Well, the string with the text content would by NULL or empty, so it is harmless; but you are right that how this works is expressed better in the code by checking inMarkedContent before calling endSpan(). > @@ +158,5 @@ > > + > > + > > +void MarkedContentOutputDev::updateFillColor(GfxState *state) > > +{ > > + GfxRGB color; > > Same here, should we ignore fill color updates when not in marked content? Not really. Color changes can affect upcoming elements, so if a color changes before a marked content sequence, the new color also affects it. Or am I interpreting the spec wrong? > @@ +167,5 @@ > > + color.b == currentColor.b) > > + return; > > + > > + endSpan(); > > + currentColor = color; > > Note that the color of the text depends on the current text rendering mode, > for rendering mode 1 (stroke) what we want is the stroke color instead of > the fill color. Maybe we could get the current fill/stroke color from the > state depending on the current render mode in drawChar instead of using > updateFillColor and updateStrokeColor Moved into drawChar() as suggested. > ::: poppler/MarkedContentOutputDev.h > @@ +25,5 @@ > > + enum { > > + Italic = 1 << 0, > > + Bold = 1 << 1, > > + Fixed = 1 << 2, > > + }; > > This looks unused. Removed. > @@ +46,5 @@ > > + } > > + > > + GfxFont* getFont() const { return data->font; } > > + GooString* getText() const { return data->text; } > > + GooString* getLink() const { return data->link; } > > What's exactly this link? a named destination? It doesn't seem to be > assigned anywhere. If this si something that will be used in a future patch > I prefer to add it in that patch instead of adding unused code here. This is a leftover from the past, when I didn't have clear how links would end up in the structure tree. Links get their own StructElement so it is not needed to have this in the TextSpan in the end. I have removed the “data->link” member and the code accessing it. > @@ +52,5 @@ > > + > > +private: > > + // NOTE: Takes ownership of strings, increases refcount for font. > > + TextSpan(GooString *text, GfxFont *font = NULL, GooString *link = NULL) > > + : data(new Data) { > > Why do you need this heap allocated data struct instead of using members of > the class directly? Why is this constructor private? TextSpanArray does not contain pointers to TextSpans, but the TextSpans themselves. The Data member is to share the data across them when the TextSpans are copied/assigned.
Created attachment 90249 [details] [review] [PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements
Comment on attachment 90249 [details] [review] [PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements Review of attachment 90249 [details] [review]: ----------------------------------------------------------------- ::: poppler/MarkedContentOutputDev.cc @@ +135,5 @@ > + // the render mode (for mode 1 stroke color is used), so there is no need > + // to implement both updateFillColor() and updateStrokeColor(). > + GBool colorChange = gFalse; > + GfxRGB color; > + if (state->getRender() == 1) This is a mask, it should be if ((state->getRender() & 3) == 1)
(In reply to comment #64) > Comment on attachment 90249 [details] [review] [review] > [PATCH v12 01/11] Tagged-PDF: Text content extraction from structure elements > > Review of attachment 90249 [details] [review] [review]: > ----------------------------------------------------------------- > > ::: poppler/MarkedContentOutputDev.cc > @@ +135,5 @@ > > + // the render mode (for mode 1 stroke color is used), so there is no need > > + // to implement both updateFillColor() and updateStrokeColor(). > > + GBool colorChange = gFalse; > > + GfxRGB color; > > + if (state->getRender() == 1) > > This is a mask, it should be if ((state->getRender() & 3) == 1) Done. Also, the updated version I am uploading has better code for GfxFont comparison. (See http://lists.freedesktop.org/archives/poppler/2013-December/010702.html)
Created attachment 90311 [details] [review] [PATCH v13 01/11] Tagged-PDF: Text content extraction from structure elements
(In reply to comment #66) > Created attachment 90311 [details] [review] [review] > [PATCH v13 01/11] Tagged-PDF: Text content extraction from structure elements Pushed to git master! This was the last patch pending to land, so I'm closing this bug.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.