Created attachment 66623 [details] sample pdf for testing The indexes in a PopplerTextAttributes struct don't seem to map correctly to the respective characters in the string returned by poppler_page_get_text(). Trivial test program to reproduce the unexpected behavior can be found here: https://gist.github.com/3624542; test file attached.
Created attachment 70475 [details] [review] check if words end with spaces poppler_page_get_text_layout and poppler_page_get_text_attributes assume that each word ends with a space or newline, causing them to become mismatched from the text. This patch adds a check to TextWord::getSpaceAfter. This fixes the problem for this file, but there are still other situations where the indexes can become mismatched because of the way TextSelectionDumper::getText aligns tables. I don't know how to fix that unless you want to modify poppler_page_get_text to return something simpler, instead of calling TextPage::getSelectionText to get the physical layout.
Sorry for the delay reviewing this, patch looks great, I've just pushed it to git master (with some minor changes of coding style). Thanks!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.