12808 – form text input paced wrong

Bug 12808 - form text input paced wrong

Summary: form text input paced wrong

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	cairo backend (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-10-15 06:39 UTC by Sebastien Bacher
Modified:	2008-02-13 10:00 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Patch to add better Unicode support to form fields, fixing some alignment bugs (22.75 KB, patch) 2008-02-08 14:57 UTC, Michael Vrable	Details \| Splinter Review
View All

Description Sebastien Bacher 2007-10-15 06:39:11 UTC

The bug has been opened on https://bugs.launchpad.net/bugs/152929

"Binary package hint: evince

the new evince has forms to put input. this particular form (attached) has this problem (in windows, with acrobat, it is ok):
when I insert the letter in 1-caracter forms, the letter is displayed under the image "cell", it should be placed in the center of the cell instead.
I will attach a screenshot for a more accurate description.

http://launchpadlibrarian.net/10008166/F23.pdf
F23.pdf  (330.7 KiB, application/pdf)"

Comment 1 Michael Vrable 2008-01-29 20:48:07 UTC

I have run into the same problem, and can verify that it is still present in the most recent (as of 2008-01-29) development sources from git. I spent some time debugging, and have found what I think is the problem.

Form field contents are displayed in Annot::drawText. The text to display (passed as text) is a UTF-16 string, with byte-order mark (BOM). The field value is, I believe, set in FormWidgetText::setContent, which explicitly adds a BOM to the string.

When generating the appearance stream, the text is converted (in Annot::writeTextString) from Unicode to the appropriate 8-bit characters needed for the selected font. However, the string width is calculated before this, in the main body of Annot::drawText, treating the original unconverted UTF-16 string as an 8-bit string. The two bytes in the BOM (FE FF) are treated as characters to display, so the computed width is too large. This doesn't affect left-justified form fields, but centered and right-justified fields are placed incorrectly.

I currently have an ugly patch which works around this bug, and field alignment appears correct after applying it. But I don't yet handle anything other than a simple single-line form field, so there are other cases which are probably still buggy.

A larger issue, which I'm trying to figure out, is how the form field contents are supposed to be interpreted. Section 8.6.3 of the PDF 1.6 specification says "The field's text is held in a text string (or, beginning with PDF 1.5, a stream) in the V (value) entry of the field dictionary. The contents of this text string or stream are used to construct an appearance stream for displaying the field...". The phrase "text string" seems to imply that the string is either in PDFDocEncoding or UTF-16, which is what poppler seems to assume. However, from a little experimentation it seems Acrobat Reader (sorry, forget which version) simply treats the field value as a string to be interpreted according to whatever encoding is used by the font for the field, not PDFDocEncoding. I'm currently trying to make some sense of this, and figure out what the correct fix is for the problem.

Comment 2 Michael Vrable 2008-01-30 19:48:10 UTC

Some more investigation of the behavior of Adobe Reader 7.0.9 (Windows):

I'm not sure I should use Adobe Reader as a guide for proper behavior.  My test file is http://www.irs.gov/pub/irs-pdf/f1040.pdf.  Adobe Reader is exhibiting some rather strange behavior here: the default appearance string for most form fields specifies /HeBo (Helvetica-Bold) as a font, but when editing a field and saving the resulting file, it looks like Adobe Reader is using /Helvetica-Condensed-Bold as a font.  Additionally, the two fonts have a different encodings specified (WinAnsiEncoding vs. StandardEncoding) so it's not so surprising that some character encoding issues are coming up.

At the very least, it does seem that Adobe Reader will decode form field values that are encoded in UTF-16 (though it still displays them incorrectly).  So, using UTF-16 for form field values in poppler seems reasonable.

I'll see if I can't clean up my earlier patch a bit and post something.

Comment 3 Michael Vrable 2008-02-08 14:57:28 UTC

Created attachment 14224 [details] [review]
Patch to add better Unicode support to form fields, fixing some alignment bugs

This is a patch I've created which fixes (for me) the mis-alignment of text in form fields.  This has also been posted to the mailing list, but I'm including it here as well so there is a record with the bug itself.

This may not be committed to the poppler tree until after some reorganization (to split forms code out from the core annotations support), but the attached patch should apply cleanly to git commit 3e994e8586fa1c87ef7e7f82af1cdacf2cd36310.

Comment 4 Carlos Garcia Campos 2008-02-13 10:00:11 UTC

Patch has been committed to git master. Thank you very much.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.