Created attachment 55439 [details] text rendered with wrong background (read after write hazard?) When runnig the SwingSet2 demo located at the URL mentioned below, I get two different artifacts: - The color-chooser demo flickers - Several hard-to-reproduce rendering errors, like shown on the screenshot. Sometimes those errors happen when scrolling on tabs containing text, sometimes I also get similar looking errors for other intel: 2.17.0-341-gc05e90a URL: http://java.sun.com/products/plugin/1.5.0/demos/plugin/jfc/SwingSet2/SwingSet2.html
Created attachment 55440 [details] artifact while scrolling text
Created attachment 55441 [details] spontaneous corruption in java2demo
Created attachment 55442 [details] color_chooser_demo flickering (reproduceable)
forgot to mention, this is on my i945GM.
Yes, lots of little bugs all coming together in one big hit... (Increasing the use of upload buffers which I had neglected to damage along with the extra fallback avoidance code suddenly started causing fallbacks and trying to get damage that wasn't there...) Can you pull the latest code and retest? Quick inspection shows that things look back to normal.
With 2.17.0-378-gb82851e I get still corruptions (even more) and it reliably hangs the GPU after a short time. [ 160.080107] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 160.080119] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [ 160.081329] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 47204 at 47201, next 47205)
Created attachment 55463 [details] reg dump of wedged gpu
Created attachment 55464 [details] intel error decode log
So I found the second bug that caused random corruption in the icons and images, and is likely the cause of the GPU hang. commit 112b8959261712aaa82f92af0aca4b97fa7c7f03 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jan 12 01:38:18 2012 +0000 sna: Prevent shrinking a partial buffer stolen for a read If we reuse a partial buffer for a read, we cannot shrink it during upload to the device as we do not track how many bytes we actually need for the read operation. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Unfourtunatly I still sometimes see corruptions like shown in https://bugs.freedesktop.org/attachment.cgi?id=55440. Additionally I now get glyph corruptions in firefox sometimes. I turned of semaphores/rc6/fbc I had enabled previously to make sure its not caused by my settings, but that didn't help.
Created attachment 55479 [details] glyph corruption in firefox
I've seen the glyph cache corruption, but only on gen4 so far -- I have not nailed down precisely which path it is creeping in yet nor why it is so sporadic. I've not seen the damage corruption at all since the first update.
Of course, just the mere act of mentioning it on a bug report, I've now found examples on gen3/4/5. ;-) At least that means it is on a common path!
Found a much nastier damage bug: commit 983b755313df8a0d256c59c32ec4106e35f237aa Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jan 12 21:05:16 2012 +0000 sna/damage: Fix union of extents with dirty damage but no region By failing to account for certain paths which would create a damage elt without fully initialisating the damage region (only the damage extents), we would later overwrite the damage extents with only the extents for this operation (rather than the union of this operation with the current damage). This fixes a regression from 098592ca5d, (sna: Remove the independent tracking of elts from boxes). Include the associated damage migration debugging code of the callers. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
still get the text-corruption when scrolling, like the one shown in "artifact while scrolling text". I found out the corruption only occurs when dragging the scrollbars, when scrolling using the scroll-wheel everything works as expected.
Can you refresh me on your setup? At the moment, I have just one issue with SwingSet2 and that appears that I get a planeshift in the images (I suspect an XYPixmap issue). I haven't seen that text corruption which looks like the damage issue from so long ago... If everything is working correctly, we should not be hitting the CPU at all for SwingSet2. If you can find the time to grab an --enable-debug=full log, that will help me understand why we have pixmap migration at all. Thanks.
My setup is a Toshiba Tecra A8 running linux-3.1.7, as well as the latest pixmap and whatever Xorg shipped currently in Fedora-16-updates. The "traditional" Java/X11 backend doesn't use Xrender, so for a lot operations (e.g. antialiased text) it falls back to its own software fallbacks - issuing a lot of GetImage/PutImage calls. With debug=full, Xorg unfourtunatly dies very quickly: #0 0x00249416 in __kernel_vsyscall () #1 0x49d4998f in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #2 0x49d4b2d5 in __GI_abort () at abort.c:91 #3 0x49d426a5 in __assert_fail_base (fmt=0x49e82c48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x63c0d6 "(size & (4096-1)) == 0", file=0x63bf5c "kgem.c", line=328, function=0x63de58 "gem_create") at assert.c:94 #4 0x49d42757 in __GI___assert_fail (assertion=0x63c0d6 "(size & (4096-1)) == 0", file=0x63bf5c "kgem.c", line=328, function=0x63de58 "gem_create") at assert.c:103 #5 0x00567f13 in gem_create (fd=11, size=262148) at kgem.c:328 #6 0x0057082f in kgem_create_buffer (kgem=0xa0c6e54, size=140268, flags=1, ret=0xbfde43c8) at kgem.c:3145 #7 0x005adaf6 in sna_write_boxes (sna=0xa0c6d18, dst=0xa341530, dst_bo=0xa3415b8, dst_dx=0, dst_dy=0, src=0xb5b95008, stride=7680, src_dx=0, src_dy=0, box=0xa4c8310, nbox=0) at sna_io.c:475 #8 0x005774ef in sna_pixmap_move_to_gpu (pixmap=0xa341530, flags=2) at sna_accel.c:1855 #9 0x00590dc4 in sna_accel_flush (sna=0xa0c6d18) at sna_accel.c:9904 #10 0x0059180d in sna_accel_block_handler (sna=0xa0c6d18) at sna_accel.c:10157 #11 0x005a5b21 in sna_block_handler (i=0, data=0xa0c6d18, timeout=0xbfde47cc, read_mask=0x8235c20) at sna_driver.c:593 #12 0x0807a438 in BlockHandler (pTimeout=0xbfde47cc, pReadmask=0x8235c20) at dixutils.c:385 #13 0x080a5891 in WaitForSomething (pClientsReady=0xa295c98) at WaitFor.c:219 #14 0x08075eae in Dispatch () at dispatch.c:381 #15 0x0806439a in main (argc=6, argv=0xbfde4934, envp=0xbfde4950) at main.c:287
Created attachment 55586 [details] log (not involved in the crash logged below)
As with quickly hit assertions, the oversight was rather simple. In this case, I had assumed that everybody would have a power-of-two aperture. In your case, some of that space is probably reserved through the PCI config. commit 421ee0bb53321e9fb2f17c850c51b80cc385bf57 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jan 15 00:46:47 2012 +0000 sna: Do not assume that the mappable aperture size is a power of two And instead derive a power-of-two alignment value for partial buffer sizes from the mappable aperture size and use that during kgem_create_buffer() Reported-by: Clemens Eisserer <linuxhippy@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=44682 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> As you said, the log is simply GetImage followed by PutImage, nothing too scary. It is a path for which I had hoped to get better kernel support for by now, so I had not put much effort into tuning it. But that is a digression to the issue at hand...
Thanks to your fix, --enable-debug=full now works properly. I switched back from a VT to X at 5169.79 and immediatly started scrolling the text, where I was able to reproduce the artifacts. Note that sometimes the artifacts occur, and sometimes I can scroll for a few seconds with everything looking fine.
file is too large for bugzilla, please download it from here: http://93.83.133.214/Xorg.0.log.old.bz2
Thank you, I've downloaded the log file.
Valgrind found something that looks suspiciously like what you describe: commit 4b5c9affd4968fd5b3d85551cf5a0b4d600a7892 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 17 20:23:26 2012 +0000 sna: Restore orginal shadow pointer before uploading CPU damage Detected by valgrind: ==22012== Source and destination overlap in memcpy(0xd101000, 0xd101000, 783360) ==22012== at 0x402A180: memcpy (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22012== by 0x89BD4ED: memcpy_blt (blt.c:209) ==22012== by 0x89F2921: sna_write_boxes (sna_io.c:364) ==22012== by 0x89CFABF: sna_pixmap_move_to_gpu (sna_accel.c:1900) ==22012== by 0x89F49B0: sna_render_pixmap_bo (sna_render.c:571) ==22012== by 0x8A268CE: gen5_composite_picture (gen5_render.c:1908) ==22012== by 0x8A29B8A: gen5_render_composite (gen5_render.c:2252) ==22012== by 0x89E6762: sna_composite (sna_composite.c:485) ==22012== by 0x21D3C3: damageComposite (damage.c:569) ==22012== by 0x215963: ProcRenderComposite (render.c:728) ==22012== by 0x159FB0: Dispatch (dispatch.c:437) ==22012== by 0x1491D9: main (main.c:287) ==22012== Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Unfourtunatly no change - I still can easily reproduce the artifacts when scrolling. I also noticed some odd rendered button-borders (black), which only sometimes crop up when the applet initializes (I have the java logging console enabled, thats the window on top) - and are drawn fine when causing a repaint.
Created attachment 55703 [details] black border arround buttons after initialization
Created attachment 55704 [details] artifacts on desktop, when resizing misbehaving window I was able to even trigger a rendering corruption on my desktop by resizing a misbehaving window. The rendering-errors (including the black button borders) seem to only happen when resizing the window - when moving the mouse over the buttons to highlight/dehighlight them everything is painted fine.
ARGH.
Was able to reproduce the problem also with Java2Demo, where it occurs in some animations, sometimes rendering is fast and no artifacts occur, sometimes it degrades to a state where it performs very slow and rendering artifacts occur. I digged a bit in Xorg's log: [ 2516.044] region_inplace: (32x1), inplace? 0 [ 2516.044] sna_put_zpixmap_blt: marking damage [ 2516.045] _sna_damage_subtract([[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: ((0, 0), (722, 101)), ((0, 101), (569, 128)), ((704, 101), (722, 128)), ((0, 128), (192, 130)), ((288, 128), (569, 130)), ((704, 128), (722, 130)), ((0, 130), (569, 160)), ((704, 130), (722, 160)), ((21, 160), (463, 173)), ((480, 160), (569, 173)), ((704, 160), (722, 173)), ((21, 173), (463, 201)), ((480, 173), (722, 201)), ((21, 201), (82, 202)), ((83, 201), (463, 202)), ((480, 201), (722, 202)), ((21, 202), (79, 203)), ((87, 202), (463, 203)), ...] -[ ...]] - [(294, 403), (326, 404)])... [ 2516.234] _sna_damage_create_elt: prev=(remain 7), count=1 [ 2516.234] = [[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: (( Can it be _sna_damage_subtract really took 200ms? Maybe this is damage related after all?
Ok, I've hit the issue. And it reproduced first time (xfce4-session + SwingSet2), so hopefully I'll be able to start narrowing down the suspect code paths.
fwiw, now I have the recipe, it also fails on SNB. So that rules out the gen3 specific paths and the tiny CPU paths.
(In reply to comment #28) > Was able to reproduce the problem also with Java2Demo, where it occurs in some > animations, sometimes rendering is fast and no artifacts occur, sometimes it > degrades to a state where it performs very slow and rendering artifacts occur. > > I digged a bit in Xorg's log: > > [ 2516.044] region_inplace: (32x1), inplace? 0 > [ 2516.044] sna_put_zpixmap_blt: marking damage > [ 2516.045] _sna_damage_subtract([[(0, 0), (722, 544)]: [(0, 0), (722, 544) x > 929: ((0, 0), (722, 101)), ((0, 101), (569, 128)), ((704, 101), (722, 128)), > ((0, 128), (192, 130)), ((288, 128), (569, 130)), ((704, 128), (722, 130)), > ((0, 130), (569, 160)), ((704, 130), (722, 160)), ((21, 160), (463, 173)), > ((480, 160), (569, 173)), ((704, 160), (722, 173)), ((21, 173), (463, 201)), > ((480, 173), (722, 201)), ((21, 201), (82, 202)), ((83, 201), (463, 202)), > ((480, 201), (722, 202)), ((21, 202), (79, 203)), ((87, 202), (463, 203)), ...] > -[ ...]] - [(294, 403), (326, 404)])... > [ 2516.234] _sna_damage_create_elt: prev=(remain 7), count=1 > [ 2516.234] = [[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: (( > > Can it be _sna_damage_subtract really took 200ms? > Maybe this is damage related after all? No, it should not have taken 0.2s, the whole point of the sna_damage layer is to try and avoid having to compute subtractions and instead stores the damage region plus list of rectangles to remove later if necessary. In this case, it should have just checked that the region to remove overlaps with the dirty region (by checking the extents against each other) and then copy the actual box. The 0.2s could have been a side-effect of logging. I hope it was...
So the actual bug turns out to have been quite trivial and embarrassing: commit 35f81005f91d294e61bb4ced7cbddd1a76ccb324 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jan 19 00:41:08 2012 +0000 sna/damage: Always mark the damage as dirty when recording new boxes A few of the create_elts() routines missed marking the damage as dirty so that if only part of the emebbed box was used (i.e. the damage contained less than 8 rectangles that needed to included in the damage region) then those were being ignored during migration and testing. Reported-by: Clemens Eisserer <linuxhippy@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=44682 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I do hope that's the last of the damage tracking bugs for a while!
I'm going to guess the missing border is likely the same damage problem. However, the desktop corruption looks like it might involve tiling, and I fixed a related problem earlier: commit ab387a89cf64ace9f7d7fa52b7743e78b625f47d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jan 18 18:32:51 2012 +0000 sna: Update bo->tiling during search_linear_cache search_linear_cache() was updated to track the first good match whilst it continued to search for a better match. This resulted in the first good bo being modified and a record of those modifications lost, in particular the change in tiling. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
I can confirm the artifacts are gone :) Thanks!
Moving on to the next bug... Thanks!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.