(Originally filed as https://bugzilla.gnome.org/show_bug.cgi?id=682451 ) Using poppler from git master and evince from git master (both updated today to check the bug still occurs), when trying to print pages 3-4 of http://www.unicode.org/Public/6.2.0/charts/blocks/U25A0.pdf I get two Internal Error: cairo context error: input string not valid UTF-8 errors on console (and no output from the printer...). Breaking on _cairo_error in gdb, this is the first error: Breakpoint 1, _cairo_error (status=CAIRO_STATUS_INVALID_STRING) at cairo.c:171 171 { (gdb) where #0 _cairo_error (status=CAIRO_STATUS_INVALID_STRING) at cairo.c:171 #1 0xb60c5119 in _cairo_validate_text_clusters ( utf8=utf8@entry=0x82c7480 "\355\240\275\355\264\273@\b\020", utf8_len=utf8_len@entry=6, glyphs=glyphs@entry=0x84b2500, num_glyphs=num_glyphs@entry=1, clusters=clusters@entry=0x87d13e8, num_clusters=num_clusters@entry=1, cluster_flags=cluster_flags@entry=(unknown: 0)) at cairo-misc.c:319 #2 0xb60ad0cb in cairo_show_text_glyphs (cr=0xb614b460, utf8=0x82c7480 "\355\240\275\355\264\273@\b\020", utf8_len=6, glyphs=0x84b2500, num_glyphs=1, clusters=0x87d13e8, num_clusters=1, cluster_flags=(unknown: 0)) at cairo.c:3593 #3 0xa022937f in CairoOutputDev::endString (this=0x83a9d48, state=0x8748718) at CairoOutputDev.cc:1222 #4 0x9f7b2f0d in Gfx::doShowText (this=this@entry=0x874a400, s=0x87d41d8) at Gfx.cc:4036 #5 0x9f7b3ef9 in Gfx::opShowText (this=0x874a400, args=0xa00fe784, numArgs=1) at Gfx.cc:3737 #6 0x9f7a4366 in Gfx::execOp (this=this@entry=0x874a400, cmd=cmd@entry=0xa00fe764, args=args@entry=0xa00fe784, numArgs=numArgs@entry=1) at Gfx.cc:857 #7 0x9f7aba5e in Gfx::go (this=this@entry=0x874a400, topLevel=topLevel@entry=false) at Gfx.cc:716 #8 0x9f7abf05 in Gfx::display (this=this@entry=0x874a400, obj=obj@entry=0xa00feb84, topLevel=topLevel@entry=false) at Gfx.cc:682 #9 0x9f7ac30b in Gfx::drawForm (this=this@entry=0x874a400, str=str@entry=0xa00feb84, resDict=resDict@entry=0x83f72f0, matrix=matrix@entry=0xa00feb00, bbox=bbox@entry=0xa00feae0, transpGroup=transpGroup@entry=false, softMask=softMask@entry=false, blendingColorSpace=blendingColorSpace@entry=0x0, isolated=isolated@entry=false, knockout=knockout@entry=false, alpha=alpha@entry=false, transferFunc=transferFunc@entry=0x0, backdropColor=backdropColor@entry=0x0) at Gfx.cc:4830 #10 0x9f7ad3e5 in Gfx::doForm (this=this@entry=0x874a400, str=str@entry=0xa00feb84) at Gfx.cc:4753 #11 0x9f7b0123 in Gfx::opXObject (this=0x874a400, args=0xa00feca4, numArgs=1) at Gfx.cc:4127 #12 0x9f7a4366 in Gfx::execOp (this=this@entry=0x874a400, cmd=cmd@entry=0xa00fec84, args=args@entry=0xa00feca4, numArgs=numArgs@entry=1) at Gfx.cc:857 #13 0x9f7aba5e in Gfx::go (this=this@entry=0x874a400, topLevel=topLevel@entry=true) at Gfx.cc:716 #14 0x9f7abf05 in Gfx::display (this=0x874a400, obj=0xa00fef34, topLevel=true) at Gfx.cc:682 #15 0x9f7f0b93 in Page::displaySlice (this=0x834fd80, out=0x83a9d48, hDPI=72, vDPI=72, rotate=0, useMediaBox=false, crop=true, sliceX=-1, sliceY=-1, sliceW=-1, sliceH=-1, printing=true, abortCheckCbk=0, abortCheckCbkData=0x0, annotDisplayDecideCbk=0xa021da60 <poppler_print_annot_cb(Annot*, void*)>, annotDisplayDecideCbkData=0x1) at Page.cc:520 #16 0xa021e3f5 in _poppler_page_render (page=0x833b700, cairo=0xb614b460, printing=true, print_flags=POPPLER_PRINT_MARKUP_ANNOTS) at poppler-page.cc:358 #17 0xa0247167 in pdf_document_print_print_page (document=0x8316b50, page=0x833bc30, cr=0xb614b460) at ev-poppler.cc:1934 #18 0xb7f719cb in ev_document_print_print_page (document_print=0x8316b50, page=0x833bc30, cr=0xb614b460) at ev-document-print.c:40 #19 0xb7f274ed in ev_job_print_run (job=0x8669c60) at ev-jobs.c:1866 #20 0xb7f232ba in ev_job_run (job=0x8669c60) at ev-jobs.c:215 #21 0xb7f27c2b in ev_job_thread (job=0x8669c60) at ev-job-scheduler.c:184 #22 0xb7f27d38 in ev_job_thread_proxy (data=0x0) at ev-job-scheduler.c:217 #23 0xb5c4e47f in g_thread_proxy (data=0x832d720) at gthread.c:801 #24 0xb5bc4adf in start_thread (arg=0xa00ffb40) at pthread_create.c:309 #25 0xb5ab754e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133
The problem is that surrogate pairs are not decoded before converting to utf8. The patch https://bugs.freedesktop.org/attachment.cgi?id=58178 (bug 46603 "convert utf-16 to ucs-4 when reading ToUnicode") fixes this issue by moving all instances of the surrogate pair handling to where the UTF-16 characters are read to ensure that the internal Unicode type contains only UTF-32 values.
I'll have a look to see if integrating that patch Adrian mention breaks something, "soon" by some definition of "soon" :D
Regression in pdftotext output in https://bugs.freedesktop.org/attachment.cgi?id=58045 -In our disk model ̃ -𝑃 of the projective plane, we have obtained four bundles of half +̃ of the projective plane, we have obtained four bundles of half +In our disk model 𝑃 It is true that the original is not perfect, but at least it is in the correct order, your new one exchanges the order of the text (i.e. "In our disk model" has to be before "of the projective plane", not after)
Created attachment 66222 [details] [review] increase tolerance for overlapping glyphs This patch fixes the regression.
Created attachment 66223 [details] [review] move text to unicode conversion to a separate function As a result of the first patch, ActualText also needs to convert UTF-16 to UCS-4. This patch (from bug 46603 with a small fix) factors out the duplicated code in ActualText and pdfinfo for converting text to unicode.
I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change pdftotext output for a lot of files (around 400 in my test suite). It is true that mostly are improvements but with such a huge change i don't feel like putting it in 0.20.x P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs
(In reply to comment #6) > I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change > pdftotext output for a lot of files (around 400 in my test suite). It is true > that mostly are improvements but with such a huge change i don't feel like > putting it in 0.20.x > > P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs Thanks both!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.