Bug 53925

Summary: poppler feeds invalid UTF-8 to cairo
Product: poppler Reporter: Christian Persch (GNOME) <chpe>
Component: cairo backendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: increase tolerance for overlapping glyphs
move text to unicode conversion to a separate function

Description Christian Persch (GNOME) 2012-08-22 12:10:51 UTC
(Originally filed as https://bugzilla.gnome.org/show_bug.cgi?id=682451 )

Using poppler from git master and evince from git master (both updated today to check the bug still occurs), when trying to print pages 3-4 of http://www.unicode.org/Public/6.2.0/charts/blocks/U25A0.pdf I get two

Internal Error: cairo context error: input string not valid UTF-8

errors on console (and no output from the printer...). 

Breaking on _cairo_error in gdb, this is the first error:

Breakpoint 1, _cairo_error (status=CAIRO_STATUS_INVALID_STRING) at cairo.c:171
171	{
(gdb) where
#0  _cairo_error (status=CAIRO_STATUS_INVALID_STRING) at cairo.c:171
#1  0xb60c5119 in _cairo_validate_text_clusters (
    utf8=utf8@entry=0x82c7480 "\355\240\275\355\264\273@\b\020", utf8_len=utf8_len@entry=6, 
    glyphs=glyphs@entry=0x84b2500, num_glyphs=num_glyphs@entry=1, 
    clusters=clusters@entry=0x87d13e8, num_clusters=num_clusters@entry=1, 
    cluster_flags=cluster_flags@entry=(unknown: 0)) at cairo-misc.c:319
#2  0xb60ad0cb in cairo_show_text_glyphs (cr=0xb614b460, 
    utf8=0x82c7480 "\355\240\275\355\264\273@\b\020", utf8_len=6, glyphs=0x84b2500, 
    num_glyphs=1, clusters=0x87d13e8, num_clusters=1, cluster_flags=(unknown: 0))
    at cairo.c:3593
#3  0xa022937f in CairoOutputDev::endString (this=0x83a9d48, state=0x8748718)
    at CairoOutputDev.cc:1222
#4  0x9f7b2f0d in Gfx::doShowText (this=this@entry=0x874a400, s=0x87d41d8) at Gfx.cc:4036
#5  0x9f7b3ef9 in Gfx::opShowText (this=0x874a400, args=0xa00fe784, numArgs=1) at Gfx.cc:3737
#6  0x9f7a4366 in Gfx::execOp (this=this@entry=0x874a400, cmd=cmd@entry=0xa00fe764, 
    args=args@entry=0xa00fe784, numArgs=numArgs@entry=1) at Gfx.cc:857
#7  0x9f7aba5e in Gfx::go (this=this@entry=0x874a400, topLevel=topLevel@entry=false)
    at Gfx.cc:716
#8  0x9f7abf05 in Gfx::display (this=this@entry=0x874a400, obj=obj@entry=0xa00feb84, 
    topLevel=topLevel@entry=false) at Gfx.cc:682
#9  0x9f7ac30b in Gfx::drawForm (this=this@entry=0x874a400, str=str@entry=0xa00feb84, 
    resDict=resDict@entry=0x83f72f0, matrix=matrix@entry=0xa00feb00, 
    bbox=bbox@entry=0xa00feae0, transpGroup=transpGroup@entry=false, 
    softMask=softMask@entry=false, blendingColorSpace=blendingColorSpace@entry=0x0, 
    isolated=isolated@entry=false, knockout=knockout@entry=false, alpha=alpha@entry=false, 
    transferFunc=transferFunc@entry=0x0, backdropColor=backdropColor@entry=0x0) at Gfx.cc:4830
#10 0x9f7ad3e5 in Gfx::doForm (this=this@entry=0x874a400, str=str@entry=0xa00feb84)
    at Gfx.cc:4753
#11 0x9f7b0123 in Gfx::opXObject (this=0x874a400, args=0xa00feca4, numArgs=1) at Gfx.cc:4127
#12 0x9f7a4366 in Gfx::execOp (this=this@entry=0x874a400, cmd=cmd@entry=0xa00fec84, 
    args=args@entry=0xa00feca4, numArgs=numArgs@entry=1) at Gfx.cc:857
#13 0x9f7aba5e in Gfx::go (this=this@entry=0x874a400, topLevel=topLevel@entry=true)
    at Gfx.cc:716
#14 0x9f7abf05 in Gfx::display (this=0x874a400, obj=0xa00fef34, topLevel=true) at Gfx.cc:682
#15 0x9f7f0b93 in Page::displaySlice (this=0x834fd80, out=0x83a9d48, hDPI=72, vDPI=72, 
    rotate=0, useMediaBox=false, crop=true, sliceX=-1, sliceY=-1, sliceW=-1, sliceH=-1, 
    printing=true, abortCheckCbk=0, abortCheckCbkData=0x0, 
    annotDisplayDecideCbk=0xa021da60 <poppler_print_annot_cb(Annot*, void*)>, 
    annotDisplayDecideCbkData=0x1) at Page.cc:520
#16 0xa021e3f5 in _poppler_page_render (page=0x833b700, cairo=0xb614b460, printing=true, 
    print_flags=POPPLER_PRINT_MARKUP_ANNOTS) at poppler-page.cc:358
#17 0xa0247167 in pdf_document_print_print_page (document=0x8316b50, page=0x833bc30, 
    cr=0xb614b460) at ev-poppler.cc:1934
#18 0xb7f719cb in ev_document_print_print_page (document_print=0x8316b50, page=0x833bc30, 
    cr=0xb614b460) at ev-document-print.c:40
#19 0xb7f274ed in ev_job_print_run (job=0x8669c60) at ev-jobs.c:1866
#20 0xb7f232ba in ev_job_run (job=0x8669c60) at ev-jobs.c:215
#21 0xb7f27c2b in ev_job_thread (job=0x8669c60) at ev-job-scheduler.c:184
#22 0xb7f27d38 in ev_job_thread_proxy (data=0x0) at ev-job-scheduler.c:217
#23 0xb5c4e47f in g_thread_proxy (data=0x832d720) at gthread.c:801
#24 0xb5bc4adf in start_thread (arg=0xa00ffb40) at pthread_create.c:309
#25 0xb5ab754e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133
Comment 1 Adrian Johnson 2012-08-24 13:46:37 UTC
The problem is that surrogate pairs are not decoded before converting to utf8. The patch https://bugs.freedesktop.org/attachment.cgi?id=58178 (bug 46603 "convert utf-16 to ucs-4 when reading ToUnicode") fixes this issue by moving all instances of the surrogate pair handling to where the UTF-16 characters are read to ensure that the internal Unicode type contains only UTF-32 values.
Comment 2 Albert Astals Cid 2012-08-26 22:40:29 UTC
I'll have a look to see if integrating that patch Adrian mention breaks something, "soon" by some definition of "soon" :D
Comment 3 Albert Astals Cid 2012-08-27 20:47:36 UTC
Regression in pdftotext output in

https://bugs.freedesktop.org/attachment.cgi?id=58045

-In our disk model ̃
-𝑃 of the projective plane, we have obtained four bundles of half
+̃ of the projective plane, we have obtained four bundles of half
+In our disk model 𝑃

It is true that the original is not perfect, but at least it is in the correct order, your new one exchanges the order of the text (i.e. "In our disk model" has to be before "of the projective plane", not after)
Comment 4 Adrian Johnson 2012-08-28 13:05:04 UTC
Created attachment 66222 [details] [review]
increase tolerance for overlapping glyphs

This patch fixes the regression.
Comment 5 Adrian Johnson 2012-08-28 13:10:01 UTC
Created attachment 66223 [details] [review]
move text to unicode conversion to a separate function

As a result of the first patch, ActualText also needs to convert UTF-16 to UCS-4. This patch (from bug 46603 with a small fix) factors out the duplicated code in ActualText and pdfinfo for converting text to unicode.
Comment 6 Albert Astals Cid 2012-08-30 20:37:41 UTC
I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change pdftotext output for a lot of files (around 400 in my test suite). It is true that mostly are improvements but with such a huge change i don't feel like putting it in 0.20.x

P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs
Comment 7 Carlos Garcia Campos 2012-08-31 07:05:31 UTC
(In reply to comment #6)
> I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change
> pdftotext output for a lot of files (around 400 in my test suite). It is true
> that mostly are improvements but with such a huge change i don't feel like
> putting it in 0.20.x
> 
> P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs

Thanks both!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.