Bug 33269

Summary: Expose text attributes in glib interface
Product: poppler Reporter: Daniel Garcia <dani>
Component: glib frontendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: apinheiro
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Get text attributes in glib
Text attributes in glib using PopplerTextInfo
checking null pointer
Adds get text attributes function to poppler glib
glib-Added-get-text-attributes

Description Daniel Garcia 2011-01-19 03:02:03 UTC
In poppler, TextWord has a lot of information about the text, like font family, color, etc. It could be great be able to get these attributes from glib interface to make applications like evince more accessibles.
Comment 1 Daniel Garcia 2011-01-25 02:01:12 UTC
Created attachment 42440 [details] [review]
Get text attributes in glib

Returns text attributes as an array of PangoAttrList for each character. Currently the information sended is:
 * Font size
 * Font name
 * Underlined
 * Foreground color
Comment 2 Carlos Garcia Campos 2011-01-29 04:27:27 UTC
Review of attachment 42440 [details] [review]:

We don't want to add a pango dependency, so I think we could create a PopplerTextInfo object containing the text attributes as exposed by the poppler core API.

::: glib/poppler-page.cc
@@ +2220,3 @@
+ * Return value: %TRUE if the page contains text, %FALSE otherwise
+ *
+ * Since: 0.16

This should be 0.18

@@ +2252,3 @@
+    {
+      word = wordlist->get (i);
+      *n_attributes += word->getLength () + 1;

Since text attributes are per word, why don't we return a list of words instead of a list of characters?

@@ +2288,3 @@
+      word->getColor (&r, &g, &b);
+      color_attr = pango_attr_foreground_new ((int)(r * 65535), (int)(g * 65535), (int)(b * 65535));
+

I guess that in most of the cases a lot of words will share the text attributes, I think we could use a cache and simply add a reference to the existing text attrs objects to avoid duplicating the same info all the time.
Comment 3 Daniel Garcia 2011-01-29 05:17:03 UTC
(In reply to comment #2)
> Review of attachment 42440 [details] [review]:
> 
> @@ +2252,3 @@
> +    {
> +      word = wordlist->get (i);
> +      *n_attributes += word->getLength () + 1;
> 
> Since text attributes are per word, why don't we return a list of words instead
> of a list of characters?
> 

Because we need to know each character attributes, to be mapped over get_text output.
Comment 4 Daniel Garcia 2011-02-02 08:44:26 UTC
Created attachment 42859 [details] [review]
Text attributes in glib using PopplerTextInfo

Created a new type PopplerTextInfo to return text attributes and created only one instance per different PopplerTextInfo.

The return value now is a GArray where each element is a PopplerTextinfo for each char returned by poppler_page_get_text.
Comment 5 Daniel Garcia 2011-03-02 02:10:24 UTC
Created attachment 44011 [details] [review]
checking null pointer

At some pdf, font name is null, so we need to check that.
Comment 6 Joanmarie Diggs 2011-08-14 09:19:09 UTC
Ping. :-) (What's the status of this?)
Comment 7 Carlos Garcia Campos 2011-08-18 05:58:48 UTC
(In reply to comment #4)
> Created an attachment (id=42859) [details]
> Text attributes in glib using PopplerTextInfo
> 
> Created a new type PopplerTextInfo to return text attributes and created only
> one instance per different PopplerTextInfo.
> 
> The return value now is a GArray where each element is a PopplerTextinfo for
> each char returned by poppler_page_get_text.

I still don't like the idea of returning an array of attrs for each character, would it be possible to return a list of attrs containing information about what characters the apply to? like start - end?
Comment 8 Daniel Garcia 2011-08-20 11:08:33 UTC
Created attachment 50405 [details] [review]
Adds get text attributes function to poppler glib

This patch changes the GArray of PopplerTextInfo for a GList of PopplerTextInfo and now PopplerTextInfo has start and end gint to indicate the start and end offset of the TextInfo group.

I also modified poppler-glib-demo to show text attributes in text section.
Comment 9 Daniel Garcia 2011-08-20 12:05:13 UTC
Created attachment 50406 [details] [review]
glib-Added-get-text-attributes

GList at poppler_page_free_text_attributes in previous patch I forgot to change.
Comment 10 Carlos Garcia Campos 2011-08-22 04:45:03 UTC
Review of attachment 50406 [details] [review]:

Patch looks good to me, I have a few comments, though. Since we are in a hurry to have this new api in 0.18 I've modified the patch and pushed to git master. Thanks!

::: glib/poppler-page.cc
@@ +1485,3 @@
+			   poppler_text_info_copy,
+			   poppler_text_info_free)
+

The name of the struct is inconsistent with the api poppler_page_get_text_attributes(), I think the struct name should be PopplerTextAttributes.

@@ +1493,3 @@
+ * Returns: a new #PopplerTextInfo with
+ * poppler_text_info_free() to free it
+ */

Since: 0.18 tag is missing here.

@@ +1498,3 @@
+{
+  PopplerTextInfo *tinfo = g_slice_new0 (PopplerTextInfo);
+  tinfo->color = poppler_color_new ();

Since the color is never null, we can allocate it in the stack instead.

@@ +1538,3 @@
+    fontname++;
+  }
+

This code to get the font name can be factored out, also you are iterating the string twice to look for +, if there's a subset, i is already the index of + in the string.

@@ +2030,3 @@
+poppler_page_free_text_attributes (GList *list)
+{
+  g_list_free_full (list, (GDestroyNotify)poppler_text_info_free);

This is new api added in glib 2.28, poppler depends on 2.18

@@ +2077,3 @@
+
+      // each char of the word has the same attributes
+      textinfo2 = poppler_text_info_new_from_word (word);

Instead of always creating a new text info, we can compare TextWord objects to decide whether to create a new text info or not, so that we avoid unneeded create/destroy.

@@ +2084,3 @@
+          textinfo2->color->red != textinfo->color->red ||
+          textinfo2->color->green != textinfo->color->green ||
+          textinfo2->color->blue != textinfo->color->blue)

This can be factored out into a method that compares attrs of two words

@@ +2090,3 @@
+          textinfo = textinfo2;
+          textinfo->start = offset;
+          attributes = g_list_append (attributes, textinfo);

Using g_list_prepend() + g_list_reverse() is more efficient than g_list_append()

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.