In poppler, TextWord has a lot of information about the text, like font family, color, etc. It could be great be able to get these attributes from glib interface to make applications like evince more accessibles.
Created attachment 42440 [details] [review] Get text attributes in glib Returns text attributes as an array of PangoAttrList for each character. Currently the information sended is: * Font size * Font name * Underlined * Foreground color
Review of attachment 42440 [details] [review]: We don't want to add a pango dependency, so I think we could create a PopplerTextInfo object containing the text attributes as exposed by the poppler core API. ::: glib/poppler-page.cc @@ +2220,3 @@ + * Return value: %TRUE if the page contains text, %FALSE otherwise + * + * Since: 0.16 This should be 0.18 @@ +2252,3 @@ + { + word = wordlist->get (i); + *n_attributes += word->getLength () + 1; Since text attributes are per word, why don't we return a list of words instead of a list of characters? @@ +2288,3 @@ + word->getColor (&r, &g, &b); + color_attr = pango_attr_foreground_new ((int)(r * 65535), (int)(g * 65535), (int)(b * 65535)); + I guess that in most of the cases a lot of words will share the text attributes, I think we could use a cache and simply add a reference to the existing text attrs objects to avoid duplicating the same info all the time.
(In reply to comment #2) > Review of attachment 42440 [details] [review]: > > @@ +2252,3 @@ > + { > + word = wordlist->get (i); > + *n_attributes += word->getLength () + 1; > > Since text attributes are per word, why don't we return a list of words instead > of a list of characters? > Because we need to know each character attributes, to be mapped over get_text output.
Created attachment 42859 [details] [review] Text attributes in glib using PopplerTextInfo Created a new type PopplerTextInfo to return text attributes and created only one instance per different PopplerTextInfo. The return value now is a GArray where each element is a PopplerTextinfo for each char returned by poppler_page_get_text.
Created attachment 44011 [details] [review] checking null pointer At some pdf, font name is null, so we need to check that.
Ping. :-) (What's the status of this?)
(In reply to comment #4) > Created an attachment (id=42859) [details] > Text attributes in glib using PopplerTextInfo > > Created a new type PopplerTextInfo to return text attributes and created only > one instance per different PopplerTextInfo. > > The return value now is a GArray where each element is a PopplerTextinfo for > each char returned by poppler_page_get_text. I still don't like the idea of returning an array of attrs for each character, would it be possible to return a list of attrs containing information about what characters the apply to? like start - end?
Created attachment 50405 [details] [review] Adds get text attributes function to poppler glib This patch changes the GArray of PopplerTextInfo for a GList of PopplerTextInfo and now PopplerTextInfo has start and end gint to indicate the start and end offset of the TextInfo group. I also modified poppler-glib-demo to show text attributes in text section.
Created attachment 50406 [details] [review] glib-Added-get-text-attributes GList at poppler_page_free_text_attributes in previous patch I forgot to change.
Review of attachment 50406 [details] [review]: Patch looks good to me, I have a few comments, though. Since we are in a hurry to have this new api in 0.18 I've modified the patch and pushed to git master. Thanks! ::: glib/poppler-page.cc @@ +1485,3 @@ + poppler_text_info_copy, + poppler_text_info_free) + The name of the struct is inconsistent with the api poppler_page_get_text_attributes(), I think the struct name should be PopplerTextAttributes. @@ +1493,3 @@ + * Returns: a new #PopplerTextInfo with + * poppler_text_info_free() to free it + */ Since: 0.18 tag is missing here. @@ +1498,3 @@ +{ + PopplerTextInfo *tinfo = g_slice_new0 (PopplerTextInfo); + tinfo->color = poppler_color_new (); Since the color is never null, we can allocate it in the stack instead. @@ +1538,3 @@ + fontname++; + } + This code to get the font name can be factored out, also you are iterating the string twice to look for +, if there's a subset, i is already the index of + in the string. @@ +2030,3 @@ +poppler_page_free_text_attributes (GList *list) +{ + g_list_free_full (list, (GDestroyNotify)poppler_text_info_free); This is new api added in glib 2.28, poppler depends on 2.18 @@ +2077,3 @@ + + // each char of the word has the same attributes + textinfo2 = poppler_text_info_new_from_word (word); Instead of always creating a new text info, we can compare TextWord objects to decide whether to create a new text info or not, so that we avoid unneeded create/destroy. @@ +2084,3 @@ + textinfo2->color->red != textinfo->color->red || + textinfo2->color->green != textinfo->color->green || + textinfo2->color->blue != textinfo->color->blue) This can be factored out into a method that compares attrs of two words @@ +2090,3 @@ + textinfo = textinfo2; + textinfo->start = offset; + attributes = g_list_append (attributes, textinfo); Using g_list_prepend() + g_list_reverse() is more efficient than g_list_append()
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.