Bug 33269 - Expose text attributes in glib interface
Summary: Expose text attributes in glib interface
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-19 03:02 UTC by Daniel Garcia
Modified: 2011-08-22 04:45 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Get text attributes in glib (5.57 KB, patch)
2011-01-25 02:01 UTC, Daniel Garcia
Details | Splinter Review
Text attributes in glib using PopplerTextInfo (9.29 KB, patch)
2011-02-02 08:44 UTC, Daniel Garcia
Details | Splinter Review
checking null pointer (941 bytes, patch)
2011-03-02 02:10 UTC, Daniel Garcia
Details | Splinter Review
Adds get text attributes function to poppler glib (11.90 KB, patch)
2011-08-20 11:08 UTC, Daniel Garcia
Details | Splinter Review
glib-Added-get-text-attributes (11.90 KB, patch)
2011-08-20 12:05 UTC, Daniel Garcia
Details | Splinter Review

Description Daniel Garcia 2011-01-19 03:02:03 UTC
In poppler, TextWord has a lot of information about the text, like font family, color, etc. It could be great be able to get these attributes from glib interface to make applications like evince more accessibles.
Comment 1 Daniel Garcia 2011-01-25 02:01:12 UTC
Created attachment 42440 [details] [review]
Get text attributes in glib

Returns text attributes as an array of PangoAttrList for each character. Currently the information sended is:
 * Font size
 * Font name
 * Underlined
 * Foreground color
Comment 2 Carlos Garcia Campos 2011-01-29 04:27:27 UTC
Review of attachment 42440 [details] [review]:

We don't want to add a pango dependency, so I think we could create a PopplerTextInfo object containing the text attributes as exposed by the poppler core API.

::: glib/poppler-page.cc
@@ +2220,3 @@
+ * Return value: %TRUE if the page contains text, %FALSE otherwise
+ *
+ * Since: 0.16

This should be 0.18

@@ +2252,3 @@
+    {
+      word = wordlist->get (i);
+      *n_attributes += word->getLength () + 1;

Since text attributes are per word, why don't we return a list of words instead of a list of characters?

@@ +2288,3 @@
+      word->getColor (&r, &g, &b);
+      color_attr = pango_attr_foreground_new ((int)(r * 65535), (int)(g * 65535), (int)(b * 65535));
+

I guess that in most of the cases a lot of words will share the text attributes, I think we could use a cache and simply add a reference to the existing text attrs objects to avoid duplicating the same info all the time.
Comment 3 Daniel Garcia 2011-01-29 05:17:03 UTC
(In reply to comment #2)
> Review of attachment 42440 [details] [review]:
> 
> @@ +2252,3 @@
> +    {
> +      word = wordlist->get (i);
> +      *n_attributes += word->getLength () + 1;
> 
> Since text attributes are per word, why don't we return a list of words instead
> of a list of characters?
> 

Because we need to know each character attributes, to be mapped over get_text output.
Comment 4 Daniel Garcia 2011-02-02 08:44:26 UTC
Created attachment 42859 [details] [review]
Text attributes in glib using PopplerTextInfo

Created a new type PopplerTextInfo to return text attributes and created only one instance per different PopplerTextInfo.

The return value now is a GArray where each element is a PopplerTextinfo for each char returned by poppler_page_get_text.
Comment 5 Daniel Garcia 2011-03-02 02:10:24 UTC
Created attachment 44011 [details] [review]
checking null pointer

At some pdf, font name is null, so we need to check that.
Comment 6 Joanmarie Diggs 2011-08-14 09:19:09 UTC
Ping. :-) (What's the status of this?)
Comment 7 Carlos Garcia Campos 2011-08-18 05:58:48 UTC
(In reply to comment #4)
> Created an attachment (id=42859) [details]
> Text attributes in glib using PopplerTextInfo
> 
> Created a new type PopplerTextInfo to return text attributes and created only
> one instance per different PopplerTextInfo.
> 
> The return value now is a GArray where each element is a PopplerTextinfo for
> each char returned by poppler_page_get_text.

I still don't like the idea of returning an array of attrs for each character, would it be possible to return a list of attrs containing information about what characters the apply to? like start - end?
Comment 8 Daniel Garcia 2011-08-20 11:08:33 UTC
Created attachment 50405 [details] [review]
Adds get text attributes function to poppler glib

This patch changes the GArray of PopplerTextInfo for a GList of PopplerTextInfo and now PopplerTextInfo has start and end gint to indicate the start and end offset of the TextInfo group.

I also modified poppler-glib-demo to show text attributes in text section.
Comment 9 Daniel Garcia 2011-08-20 12:05:13 UTC
Created attachment 50406 [details] [review]
glib-Added-get-text-attributes

GList at poppler_page_free_text_attributes in previous patch I forgot to change.
Comment 10 Carlos Garcia Campos 2011-08-22 04:45:03 UTC
Review of attachment 50406 [details] [review]:

Patch looks good to me, I have a few comments, though. Since we are in a hurry to have this new api in 0.18 I've modified the patch and pushed to git master. Thanks!

::: glib/poppler-page.cc
@@ +1485,3 @@
+			   poppler_text_info_copy,
+			   poppler_text_info_free)
+

The name of the struct is inconsistent with the api poppler_page_get_text_attributes(), I think the struct name should be PopplerTextAttributes.

@@ +1493,3 @@
+ * Returns: a new #PopplerTextInfo with
+ * poppler_text_info_free() to free it
+ */

Since: 0.18 tag is missing here.

@@ +1498,3 @@
+{
+  PopplerTextInfo *tinfo = g_slice_new0 (PopplerTextInfo);
+  tinfo->color = poppler_color_new ();

Since the color is never null, we can allocate it in the stack instead.

@@ +1538,3 @@
+    fontname++;
+  }
+

This code to get the font name can be factored out, also you are iterating the string twice to look for +, if there's a subset, i is already the index of + in the string.

@@ +2030,3 @@
+poppler_page_free_text_attributes (GList *list)
+{
+  g_list_free_full (list, (GDestroyNotify)poppler_text_info_free);

This is new api added in glib 2.28, poppler depends on 2.18

@@ +2077,3 @@
+
+      // each char of the word has the same attributes
+      textinfo2 = poppler_text_info_new_from_word (word);

Instead of always creating a new text info, we can compare TextWord objects to decide whether to create a new text info or not, so that we avoid unneeded create/destroy.

@@ +2084,3 @@
+          textinfo2->color->red != textinfo->color->red ||
+          textinfo2->color->green != textinfo->color->green ||
+          textinfo2->color->blue != textinfo->color->blue)

This can be factored out into a method that compares attrs of two words

@@ +2090,3 @@
+          textinfo = textinfo2;
+          textinfo->start = offset;
+          attributes = g_list_append (attributes, textinfo);

Using g_list_prepend() + g_list_reverse() is more efficient than g_list_append()


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.