Summary: | Expose text attributes in glib interface | ||
---|---|---|---|
Product: | poppler | Reporter: | Daniel Garcia <dani> |
Component: | glib frontend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | apinheiro |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Get text attributes in glib
Text attributes in glib using PopplerTextInfo checking null pointer Adds get text attributes function to poppler glib glib-Added-get-text-attributes |
Description
Daniel Garcia
2011-01-19 03:02:03 UTC
Created attachment 42440 [details] [review] Get text attributes in glib Returns text attributes as an array of PangoAttrList for each character. Currently the information sended is: * Font size * Font name * Underlined * Foreground color Review of attachment 42440 [details] [review]: We don't want to add a pango dependency, so I think we could create a PopplerTextInfo object containing the text attributes as exposed by the poppler core API. ::: glib/poppler-page.cc @@ +2220,3 @@ + * Return value: %TRUE if the page contains text, %FALSE otherwise + * + * Since: 0.16 This should be 0.18 @@ +2252,3 @@ + { + word = wordlist->get (i); + *n_attributes += word->getLength () + 1; Since text attributes are per word, why don't we return a list of words instead of a list of characters? @@ +2288,3 @@ + word->getColor (&r, &g, &b); + color_attr = pango_attr_foreground_new ((int)(r * 65535), (int)(g * 65535), (int)(b * 65535)); + I guess that in most of the cases a lot of words will share the text attributes, I think we could use a cache and simply add a reference to the existing text attrs objects to avoid duplicating the same info all the time. (In reply to comment #2) > Review of attachment 42440 [details] [review]: > > @@ +2252,3 @@ > + { > + word = wordlist->get (i); > + *n_attributes += word->getLength () + 1; > > Since text attributes are per word, why don't we return a list of words instead > of a list of characters? > Because we need to know each character attributes, to be mapped over get_text output. Created attachment 42859 [details] [review] Text attributes in glib using PopplerTextInfo Created a new type PopplerTextInfo to return text attributes and created only one instance per different PopplerTextInfo. The return value now is a GArray where each element is a PopplerTextinfo for each char returned by poppler_page_get_text. Created attachment 44011 [details] [review] checking null pointer At some pdf, font name is null, so we need to check that. Ping. :-) (What's the status of this?) (In reply to comment #4) > Created an attachment (id=42859) [details] > Text attributes in glib using PopplerTextInfo > > Created a new type PopplerTextInfo to return text attributes and created only > one instance per different PopplerTextInfo. > > The return value now is a GArray where each element is a PopplerTextinfo for > each char returned by poppler_page_get_text. I still don't like the idea of returning an array of attrs for each character, would it be possible to return a list of attrs containing information about what characters the apply to? like start - end? Created attachment 50405 [details] [review] Adds get text attributes function to poppler glib This patch changes the GArray of PopplerTextInfo for a GList of PopplerTextInfo and now PopplerTextInfo has start and end gint to indicate the start and end offset of the TextInfo group. I also modified poppler-glib-demo to show text attributes in text section. Created attachment 50406 [details] [review] glib-Added-get-text-attributes GList at poppler_page_free_text_attributes in previous patch I forgot to change. Review of attachment 50406 [details] [review]: Patch looks good to me, I have a few comments, though. Since we are in a hurry to have this new api in 0.18 I've modified the patch and pushed to git master. Thanks! ::: glib/poppler-page.cc @@ +1485,3 @@ + poppler_text_info_copy, + poppler_text_info_free) + The name of the struct is inconsistent with the api poppler_page_get_text_attributes(), I think the struct name should be PopplerTextAttributes. @@ +1493,3 @@ + * Returns: a new #PopplerTextInfo with + * poppler_text_info_free() to free it + */ Since: 0.18 tag is missing here. @@ +1498,3 @@ +{ + PopplerTextInfo *tinfo = g_slice_new0 (PopplerTextInfo); + tinfo->color = poppler_color_new (); Since the color is never null, we can allocate it in the stack instead. @@ +1538,3 @@ + fontname++; + } + This code to get the font name can be factored out, also you are iterating the string twice to look for +, if there's a subset, i is already the index of + in the string. @@ +2030,3 @@ +poppler_page_free_text_attributes (GList *list) +{ + g_list_free_full (list, (GDestroyNotify)poppler_text_info_free); This is new api added in glib 2.28, poppler depends on 2.18 @@ +2077,3 @@ + + // each char of the word has the same attributes + textinfo2 = poppler_text_info_new_from_word (word); Instead of always creating a new text info, we can compare TextWord objects to decide whether to create a new text info or not, so that we avoid unneeded create/destroy. @@ +2084,3 @@ + textinfo2->color->red != textinfo->color->red || + textinfo2->color->green != textinfo->color->green || + textinfo2->color->blue != textinfo->color->blue) This can be factored out into a method that compares attrs of two words @@ +2090,3 @@ + textinfo = textinfo2; + textinfo->start = offset; + attributes = g_list_append (attributes, textinfo); Using g_list_prepend() + g_list_reverse() is more efficient than g_list_append() |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.