Created attachment 90327 [details] dockapp with black squares Hardware: Thinkpad T420s, Sandybridge CPU/graphics. LVDS + HDMI (through DP -> HDMI dumb adapter); LVDS = 1600x900, HDMI = 1920x1200, configured above LVDS. Driver: xf86-video-intel-2.99.906, SNA accel Xorg 1.14.3 [622671.551] (**) intel(0): Framebuffer tiled [622671.551] (**) intel(0): Pixmaps tiled [622671.551] (**) intel(0): "Tear free" disabled [622671.551] (**) intel(0): Forcing per-crtc-pixmaps? no [622671.552] (II) intel(0): SNA initialized with Sandybridge (gen6, gt2) backend [622671.552] (==) intel(0): Backing store disabled [622671.552] (==) intel(0): Silken mouse enabled [622671.552] (II) intel(0): HW Cursor enabled I'm not sure if this is the only time this happens, but after running xlock, a particular dockapp has black squares in it (see screenshot). This happens every time, but the precise corruption pattern differs from time to time. Seemingly only that dockapp is affected, although I've seen it happen elsewhere in the past (but I would just eventually notice it, whereas here it's 100% reproducible). I'm using plain windowmaker-0.92, no funny compositor stuff going on. Dragging a window over the dockapp makes it redraw properly. The dockapp is wmauda-0.9. In case it matters, I'm using xlockmore-5.43. The screen is black when it asks for the password.
Nothing obvious. It could be a missing flush or an invalid command. If you haven't already, please do test with the latest version from git and attach your Xorg.0.log. I will set about trying to reproduce tomorrow.
Oh wow, lots of commits since .906 in git! I'm more used to nouveau's DDX change rate :) I'll be able to test this out tomorrow, will report back then. BTW, it should be noted that the screenshot that I made was using 'xwd' applied to the relevant X window, so the corruption existed in whatever buffer it used. (I'm not that familiar with these things, perhaps this was fully expected, but thought I'd mention it.)
Created attachment 90365 [details] X log with git intel ddx Hrmph. Well, try as I might, I was unable to repro this with the latest code. It also didn't want to repro easily with .906 -- turns out if I run xlock from the second workspace rather than the first it does repro (?? why would anything care or even be able to detect it? they both just had an identically-configured aterm on them). But with latest git (46256fa5a0ca) I couldn't get it to have the little black squares. I'll switch over to it as soon as is convenient, and see if I get any more artifacts. I'm attaching a Xorg log with the git DDX code for posterity, but I guess this bug can be closed. Sorry for the trouble, and thanks for improving the intel DDX :) [But don't worry, I'll file more issues.]
Ok, let's attack this again when/if it reappears.
Created attachment 91476 [details] Xorg.0.log with .907 Well, I just tested out .907, and while the corruption is harder to trigger, I can still do it. Just not as deterministically. Happy to do any debugging/tracing/whatever, just provide instructions.
Still using a 3.10? Please also try updating your kernel. Is the corruption appearing the same, does it occur over a larger area? (is it possible to grab a clear screenshot or photograph?)
The other task is to make sure you have a reasonable success rate at reproducing the bug. My suspicion is that it is a pipeline flush issue like bug 68410, but if we can not be sure of reproducing the corruption, we can be sure if any patch fixes it. (Also if I am able to reproduce it myself, that speeds up debugging immensely.)
The corruption I've seen so far is limited to the wmauda app. I hope that it's not something that it's doing wrong -- but I don't get the same corruption with nouveau on my desktop and everything else the same (except only one screen, whereas I had the corruption with 2 screens on the laptop). My setup is windowmaker + wmauda (running, so you need audacious as well) + a few workspaces. Now, in one workspace, place a window over the wmauda app, so that it is hidden. Have a bunch more windows open in the various workspaces too, I think that helps reproduce. Then flip between the workspaces. After a flip from the workspace with the wmauda app hidden by the other window to another where it is visible, it will show the black squares over the volume bar (see the first attachment) some of the time (like 1/2 to 1/3rd of the time). It used to be 100% deterministic, but not anymore. In the meanwhile I'll grab 3.13-rc6. Or should I look at drm-next or some intel-specific tree?
The obvious corruption bug I'm thinking of in the kernel was definitely fixed in 3.11 (but I'm never sure which backports it ended up in). As always, the later the better when testing kernels...
Running with 3.13-rc7+ (git sha ef350bb7c5e), and xf86-video-intel 2.99.907 (same as before), I can still reproduce the issue. Although the reproducibility has decreased... now there can be long spurts of "it happens repeatedly" and "it doesn't happen at all". But it definitely does still happen :(
There was a dangling pointer fixed in commit 5f3ee21a307a4ff4db189bd53e58a70ec01ee6bc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 17 08:40:34 2014 +0000 sna: Nullify pixmap->devPrivate.ptr after promoting CPU bo to GPU which might be relevant?
It's invincible. Same deal as before. I just tested it with 32010ed86 ("sna: Assert that the fill box is within bounds"), 3.13-rc7. From the log, just to be sure it loaded the right version: [ 117.088] (II) intel(0): SNA compiled from 2.99.907-48-g32010ed
Perhaps commit e916c922ce3913712cd8a9b76ab037840b7f07f1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jan 23 17:30:29 2014 +0000 sna: Avoid erroneous discarding operations for partial composites Composite operations were presumed to cover their entire width x height area. However, a few paths submit boxes that do not cover the clip region and so the optimisation made during prepare to discard completely overwritten data is incorrect (and leads to corruption - stale data is seen which the client expected to have been overdrawn). So along these more unusual paths, we must add a flag to prevent the overzealous discard. Notably, xfce4 triggers this as it uses a lot of unantialiased trapezoids in its theme drawing. References: https://bugs.freedesktop.org/show_bug.cgi?id=69528 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> ?
Spotted one last issue with CPU damage for partial redraws before tagging 2.99.908. Maybe that helps?
That last little fix before .908 opened a can of worms. Please, please try testing xf86-video-intel.git master.
[ 139.343] (II) intel(0): SNA compiled from 2.99.909-7-g1cbc59a Still there. If anything, it was easier to hit it. But that could just be coincidence. I did glance at wmauda source -- it's pretty simple. Not 100% sure it's the latest, but I doubt that this aspect has changed much: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/wmauda/saucy/view/head:/wmauda.c The issue where it's most visible is the volslider. Not 100% sure whether other bits are affected -- they're generally much darker (or black) so little black squares aren't as apparent. Looks like it's just blitting some pixmap into the gc -- nothing fancy. I think this is the pixmap in case it matters: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/wmauda/saucy/view/head:/wmauda.xpm As a reminder, I'm not actually touching the volslider when this happens (in fact, touching it fixes everything since the affected areas are redrawn).
I've confirmed that other parts of the window are also affected (e.g. the play/pause/etc buttons). Based on no knowledge of the hw or driver structure (aka feel free to ignore), I'd guess this has something to do with either drawing to off-screen pixmaps or saving of pixmaps when they go off-screen or reloading them. (e.g. what happens if someone's drawing to a pixmap that's off-screen and then it gets shown, while the drawing is happening, etc. Perhaps that's a super-common case that you've thought of already though.) The only times I've seen this happen is if the window is hidden by something (e.g. xlock, or another window), and then becomes visible (e.g. as a result of a desktop switch).
Mind updating the Xorg.0.log? (and please, please make sure you are using a post-3.10 kernel :).
Created attachment 93505 [details] Xorg.0.log with .909-1cbc59a Still running 3.13-rc7. Log attached.
Do you have any recollection of when you first saw this? Do you mind trying 2.21.15 to see if that is also affected? If you feel confident, perhaps bisecting?
2.21.15 is fine (looks like it also defaults to UXA, but I turned that off and enabled SNA :) ). I did the bisect, this is the result: 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 is the first bad commit commit 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Oct 31 13:35:59 2013 +0000 sna/gen6: Tweak flush around CC state changes In order to fix some font corruption, it appears that we need an extra flush in the Sandybridge pipeline when we change the CC stage and the render cache is dirty. We previously triggered a full pipeline stall for this case. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> :040000 040000 0dd3c4fbc25c10ba3baf194321493a32234c31d6 c7d59848db85bec0600d0731f4b548ad37c90ea3 M src I'm going to go double-check the bisection result, but I was usually able to repro corruption pretty quickly for the 'bad' commits. Here is the full log btw: # bad: [7468a6b740af14d95e8f9bacd2e352ec98a9acf2] 2.99.906 snapshot # good: [f57ce6ef9ca735d5cb428b2f12f1f9413a70506a] 2.21.15 release git bisect start '2.99.906' 'HEAD' # good: [394b2ce51d8378893f14896e600713544d473925] sna: Add a few more DBG to track refcnts git bisect good 394b2ce51d8378893f14896e600713544d473925 # good: [f0da01aa907d488ae32dfda206ea8a66564bc430] sna: Remove stale mappings when replacing GPU bo git bisect good f0da01aa907d488ae32dfda206ea8a66564bc430 # bad: [8f6e227ba8127a2ca034271f2a660c24abbe056f] sna: Apply the BLT source offset for individual copies git bisect bad 8f6e227ba8127a2ca034271f2a660c24abbe056f # good: [c6b0e3fe0c299488932ba0392847f1faf298d079] sna: Detect and handle mi recursion git bisect good c6b0e3fe0c299488932ba0392847f1faf298d079 # bad: [1c34ea0894e42bf344b0edad62cf2e142840af88] sna: Tweak estimate of maximum usable aperture for strict fencing git bisect bad 1c34ea0894e42bf344b0edad62cf2e142840af88 # bad: [d580a30aaf97687f9669ea72fbc3310c2fea26f1] sna/gen7: Flush render cache when changing CC state git bisect bad d580a30aaf97687f9669ea72fbc3310c2fea26f1 # good: [ed16e34c00d5eb5ca4ec643f66fedbf1a5112e90] sna: Allow limited recursion within sigtrapped routines git bisect good ed16e34c00d5eb5ca4ec643f66fedbf1a5112e90 # bad: [82e6d41c2f4f343bd1854d3d8ee4b624b5d68971] sna/gen6: Tweak flush around CC state changes git bisect bad 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 # good: [5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b] sna: Tweak deletion of used buffers git bisect good 5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b # first bad commit: [82e6d41c2f4f343bd1854d3d8ee4b624b5d68971] sna/gen6: Tweak flush around CC state changes
OK, I'm _pretty_ confident in that bisection result. 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 is definitely bad. I have tried several times and couldn't get 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971^ (i.e. 5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b) to show the issue.
Oh, and a couple more things... (a) I pulled the very latest and greatest: [ 3005.798] (II) intel(0): SNA compiled from 2.99.909-17-g823382d Still repro's the issue. (b) The bisect and (a) above were done with a kernel booted with i915.i915_enable_rc6=0 (from the other bug). Not sure if this affects anything.
Ok, that looks like a plausible commit for that type of error. Can you please try: diff --git a/src/sna/gen6_render.c b/src/sna/gen6_render.c index aadc6f7..64707ce 100644 --- a/src/sna/gen6_render.c +++ b/src/sna/gen6_render.c @@ -870,15 +870,19 @@ gen6_emit_state(struct sna *sna, assert(op->dst.bo->exec); - need_flush = - gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags)) && - wm_binding_table & 1; + need_stall = need_flush = false; + if (gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags))) { + if (wm_binding_table & 1) + need_flush = true; + else + need_stall = true; + } gen6_emit_sampler(sna, GEN6_SAMPLER(op->u.gen6.flags)); gen6_emit_sf(sna, GEN6_VERTEX(op->u.gen6.flags) >> 2); gen6_emit_wm(sna, GEN6_KERNEL(op->u.gen6.flags), GEN6_VERTEX(op->u.gen6.flags) >> 2); gen6_emit_vertex_elements(sna, op); - need_stall = gen6_emit_binding_table(sna, wm_binding_table & ~1); + need_stall |= gen6_emit_binding_table(sna, wm_binding_table & ~1); if (gen6_emit_drawing_rectangle(sna, op)) need_stall = false; if (need_flush || kgem_bo_is_dirty(op->src.bo) || kgem_bo_is_dirty(op->mask.bo)) { which will restore the stall we had previously in the cases where we don't emit the flush.
That didn't work. I won't pretend to even know what a flush or a stall is, however by pure code analysis, I think your patch isn't restoring the old behaviour. I changed it to - need_flush = - gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags)) && - wm_binding_table & 1; + need_stall = need_flush = false; + if (gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags))) { + if (wm_binding_table & 1) + need_flush = true; + } + else + need_stall = wm_binding_table & 1; and that appears to work fine. Note that before, need_stall was always set, and if gen6_emit_cc(), it was reset to false. I think a cleaner way to write this might be need_stall = wm_binding_table & 1; if (gen6_emit_cc(...)) need_flush = need_stall; [and then the flush logic clears need_stall, which is nice]
commit 37d8566ee78c67647b159a96ddb2675d1506b967 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Feb 9 12:28:27 2014 +0000 sna/gen6: Restore stall dropped when not flushing instead commit 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Oct 31 13:35:59 2013 +0000 sna/gen6: Tweak flush around CC state changes Replaced the pipeline stall with a flush - but only when the target was dirty. The missing stall however seems to be required as well. v2: Actually emit the stall for all CC state changes [Ilia Mirkin] Reported-by: Ilia Mirkin <imirkin@alum.mit.edu> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=72375 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Double-checked that the committed version also works. Well that was a fun ride. Hopefully never again :) Thanks for sticking with it!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.