72375 – Corruption in some windows after returning from xlock

Bug 72375 - Corruption in some windows after returning from xlock

Summary: Corruption in some windows after returning from xlock

Status:	VERIFIED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-12-05 22:30 UTC by Ilia Mirkin
Modified:	2014-02-10 09:08 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
dockapp with black squares (1.48 KB, image/png) 2013-12-05 22:30 UTC, Ilia Mirkin	no flags	Details
X log with git intel ddx (21.67 KB, text/plain) 2013-12-06 15:12 UTC, Ilia Mirkin	no flags	Details
Xorg.0.log with .907 (18.09 KB, text/plain) 2014-01-03 23:45 UTC, Ilia Mirkin	no flags	Details
Xorg.0.log with .909-1cbc59a (19.09 KB, text/plain) 2014-02-06 04:23 UTC, Ilia Mirkin	no flags	Details
View All

Description Ilia Mirkin 2013-12-05 22:30:21 UTC

Created attachment 90327 [details]
dockapp with black squares

Hardware: Thinkpad T420s, Sandybridge CPU/graphics. LVDS + HDMI (through DP -> HDMI dumb adapter); LVDS = 1600x900, HDMI = 1920x1200, configured above LVDS.
Driver: xf86-video-intel-2.99.906, SNA accel
Xorg 1.14.3

[622671.551] (**) intel(0): Framebuffer tiled
[622671.551] (**) intel(0): Pixmaps tiled
[622671.551] (**) intel(0): "Tear free" disabled
[622671.551] (**) intel(0): Forcing per-crtc-pixmaps? no
[622671.552] (II) intel(0): SNA initialized with Sandybridge (gen6, gt2) backend
[622671.552] (==) intel(0): Backing store disabled
[622671.552] (==) intel(0): Silken mouse enabled
[622671.552] (II) intel(0): HW Cursor enabled

I'm not sure if this is the only time this happens, but after running xlock, a particular dockapp has black squares in it (see screenshot). This happens every time, but the precise corruption pattern differs from time to time. Seemingly only that dockapp is affected, although I've seen it happen elsewhere in the past (but I would just eventually notice it, whereas here it's 100% reproducible). I'm using plain windowmaker-0.92, no funny compositor stuff going on. Dragging a window over the dockapp makes it redraw properly. The dockapp is wmauda-0.9.

In case it matters, I'm using xlockmore-5.43. The screen is black when it asks for the password.

Comment 1 Chris Wilson 2013-12-05 22:59:49 UTC

Nothing obvious. It could be a missing flush or an invalid command. If you haven't already, please do test with the latest version from git and attach your Xorg.0.log. I will set about trying to reproduce tomorrow.

Comment 2 Ilia Mirkin 2013-12-06 00:25:05 UTC

Oh wow, lots of commits since .906 in git! I'm more used to nouveau's DDX change rate :) I'll be able to test this out tomorrow, will report back then.

BTW, it should be noted that the screenshot that I made was using 'xwd' applied to the relevant X window, so the corruption existed in whatever buffer it used. (I'm not that familiar with these things, perhaps this was fully expected, but thought I'd mention it.)

Comment 3 Ilia Mirkin 2013-12-06 15:12:43 UTC

Created attachment 90365 [details]
X log with git intel ddx

Hrmph. Well, try as I might, I was unable to repro this with the latest code. It also didn't want to repro easily with .906 -- turns out if I run xlock from the second workspace rather than the first it does repro (?? why would anything care or even be able to detect it? they both just had an identically-configured aterm on them). But with latest git (46256fa5a0ca) I couldn't get it to have the little black squares. I'll switch over to it as soon as is convenient, and see if I get any more artifacts.

I'm attaching a Xorg log with the git DDX code for posterity, but I guess this bug can be closed. Sorry for the trouble, and thanks for improving the intel DDX :) [But don't worry, I'll file more issues.]

Comment 4 Chris Wilson 2013-12-06 17:27:29 UTC

Ok, let's attack this again when/if it reappears.

Comment 5 Ilia Mirkin 2014-01-03 23:45:01 UTC

Created attachment 91476 [details]
Xorg.0.log with .907

Well, I just tested out .907, and while the corruption is harder to trigger, I can still do it. Just not as deterministically. Happy to do any debugging/tracing/whatever, just provide instructions.

Comment 6 Chris Wilson 2014-01-04 09:46:20 UTC

Still using a 3.10? Please also try updating your kernel. Is the corruption appearing the same, does it occur over a larger area? (is it possible to grab a clear screenshot or photograph?)

Comment 7 Chris Wilson 2014-01-04 11:39:17 UTC

The other task is to make sure you have a reasonable success rate at reproducing the bug. My suspicion is that it is a pipeline flush issue like bug 68410, but if we can not be sure of reproducing the corruption, we can be sure if any patch fixes it. (Also if I am able to reproduce it myself, that speeds up debugging immensely.)

Comment 8 Ilia Mirkin 2014-01-04 19:52:48 UTC

The corruption I've seen so far is limited to the wmauda app. I hope that it's not something that it's doing wrong -- but I don't get the same corruption with nouveau on my desktop and everything else the same (except only one screen, whereas I had the corruption with 2 screens on the laptop).

My setup is windowmaker + wmauda (running, so you need audacious as well) + a few workspaces. Now, in one workspace, place a window over the wmauda app, so that it is hidden. Have a bunch more windows open in the various workspaces too, I think that helps reproduce. Then flip between the workspaces. After a flip from the workspace with the wmauda app hidden by the other window to another where it is visible, it will show the black squares over the volume bar (see the first attachment) some of the time (like 1/2 to 1/3rd of the time). It used to be 100% deterministic, but not anymore.

In the meanwhile I'll grab 3.13-rc6. Or should I look at drm-next or some intel-specific tree?

Comment 9 Chris Wilson 2014-01-06 14:46:09 UTC

The obvious corruption bug I'm thinking of in the kernel was definitely fixed in 3.11 (but I'm never sure which backports it ended up in). As always, the later the better when testing kernels...

Comment 10 Ilia Mirkin 2014-01-07 19:49:49 UTC

Running with 3.13-rc7+ (git sha ef350bb7c5e), and xf86-video-intel 2.99.907 (same as before), I can still reproduce the issue. Although the reproducibility has decreased... now there can be long spurts of "it happens repeatedly" and "it doesn't happen at all". But it definitely does still happen :(

Comment 11 Chris Wilson 2014-01-17 11:01:21 UTC

There was a dangling pointer fixed in 

commit 5f3ee21a307a4ff4db189bd53e58a70ec01ee6bc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 17 08:40:34 2014 +0000

    sna: Nullify pixmap->devPrivate.ptr after promoting CPU bo to GPU

which might be relevant?

Comment 12 Ilia Mirkin 2014-01-22 17:56:04 UTC

It's invincible. Same deal as before. I just tested it with 32010ed86 ("sna: Assert that the fill box is within bounds"), 3.13-rc7. From the log, just to be sure it loaded the right version:

[   117.088] (II) intel(0): SNA compiled from 2.99.907-48-g32010ed

Comment 13 Chris Wilson 2014-01-23 17:40:05 UTC

Perhaps

commit e916c922ce3913712cd8a9b76ab037840b7f07f1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 23 17:30:29 2014 +0000

    sna: Avoid erroneous discarding operations for partial composites
    
    Composite operations were presumed to cover their entire width x height
    area. However, a few paths submit boxes that do not cover the clip
    region and so the optimisation made during prepare to discard completely
    overwritten data is incorrect (and leads to corruption - stale data is
    seen which the client expected to have been overdrawn). So along these
    more unusual paths, we must add a flag to prevent the overzealous
    discard. Notably, xfce4 triggers this as it uses a lot of unantialiased
    trapezoids in its theme drawing.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=69528
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

?

Comment 14 Chris Wilson 2014-02-01 11:17:39 UTC

Spotted one last issue with CPU damage for partial redraws before tagging 2.99.908. Maybe that helps?

Comment 15 Chris Wilson 2014-02-01 23:06:56 UTC

That last little fix before .908 opened a can of worms. Please, please try testing xf86-video-intel.git master.

Comment 16 Ilia Mirkin 2014-02-03 19:14:40 UTC

[   139.343] (II) intel(0): SNA compiled from 2.99.909-7-g1cbc59a

Still there. If anything, it was easier to hit it. But that could just be coincidence.

I did glance at wmauda source -- it's pretty simple. Not 100% sure it's the latest, but I doubt that this aspect has changed much:

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/wmauda/saucy/view/head:/wmauda.c

The issue where it's most visible is the volslider. Not 100% sure whether other bits are affected -- they're generally much darker (or black) so little black squares aren't as apparent. Looks like it's just blitting some pixmap into the gc -- nothing fancy. I think this is the pixmap in case it matters:

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/wmauda/saucy/view/head:/wmauda.xpm

As a reminder, I'm not actually touching the volslider when this happens (in fact, touching it fixes everything since the affected areas are redrawn).

Comment 17 Ilia Mirkin 2014-02-03 19:48:12 UTC

I've confirmed that other parts of the window are also affected (e.g. the play/pause/etc buttons).

Based on no knowledge of the hw or driver structure (aka feel free to ignore), I'd guess this has something to do with either drawing to off-screen pixmaps or saving of pixmaps when they go off-screen or reloading them. (e.g. what happens if someone's drawing to a pixmap that's off-screen and then it gets shown, while the drawing is happening, etc. Perhaps that's a super-common case that you've thought of already though.) The only times I've seen this happen is if the window is hidden by something (e.g. xlock, or another window), and then becomes visible (e.g. as a result of a desktop switch).

Comment 18 Chris Wilson 2014-02-05 10:53:51 UTC

Mind updating the Xorg.0.log? (and please, please make sure you are using a post-3.10 kernel :).

Comment 19 Ilia Mirkin 2014-02-06 04:23:58 UTC

Created attachment 93505 [details]
Xorg.0.log with .909-1cbc59a

Still running 3.13-rc7. Log attached.

Comment 20 Chris Wilson 2014-02-06 10:22:41 UTC

Do you have any recollection of when you first saw this? Do you mind trying 2.21.15 to see if that is also affected? If you feel confident, perhaps bisecting?

Comment 21 Ilia Mirkin 2014-02-09 07:32:51 UTC

2.21.15 is fine (looks like it also defaults to UXA, but I turned that off and enabled SNA :) ). I did the bisect, this is the result:

82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 is the first bad commit
commit 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Oct 31 13:35:59 2013 +0000

    sna/gen6: Tweak flush around CC state changes
    
    In order to fix some font corruption, it appears that we need an extra
    flush in the Sandybridge pipeline when we change the CC stage and the
    render cache is dirty. We previously triggered a full pipeline stall
    for this case.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

:040000 040000 0dd3c4fbc25c10ba3baf194321493a32234c31d6 c7d59848db85bec0600d0731f4b548ad37c90ea3 M      src

I'm going to go double-check the bisection result, but I was usually able to repro corruption pretty quickly for the 'bad' commits. Here is the full log btw:

# bad: [7468a6b740af14d95e8f9bacd2e352ec98a9acf2] 2.99.906 snapshot
# good: [f57ce6ef9ca735d5cb428b2f12f1f9413a70506a] 2.21.15 release
git bisect start '2.99.906' 'HEAD'
# good: [394b2ce51d8378893f14896e600713544d473925] sna: Add a few more DBG to track refcnts
git bisect good 394b2ce51d8378893f14896e600713544d473925
# good: [f0da01aa907d488ae32dfda206ea8a66564bc430] sna: Remove stale mappings when replacing GPU bo
git bisect good f0da01aa907d488ae32dfda206ea8a66564bc430
# bad: [8f6e227ba8127a2ca034271f2a660c24abbe056f] sna: Apply the BLT source offset for individual copies
git bisect bad 8f6e227ba8127a2ca034271f2a660c24abbe056f
# good: [c6b0e3fe0c299488932ba0392847f1faf298d079] sna: Detect and handle mi recursion
git bisect good c6b0e3fe0c299488932ba0392847f1faf298d079
# bad: [1c34ea0894e42bf344b0edad62cf2e142840af88] sna: Tweak estimate of maximum usable aperture for strict fencing
git bisect bad 1c34ea0894e42bf344b0edad62cf2e142840af88
# bad: [d580a30aaf97687f9669ea72fbc3310c2fea26f1] sna/gen7: Flush render cache when changing CC state
git bisect bad d580a30aaf97687f9669ea72fbc3310c2fea26f1
# good: [ed16e34c00d5eb5ca4ec643f66fedbf1a5112e90] sna: Allow limited recursion within sigtrapped routines
git bisect good ed16e34c00d5eb5ca4ec643f66fedbf1a5112e90
# bad: [82e6d41c2f4f343bd1854d3d8ee4b624b5d68971] sna/gen6: Tweak flush around CC state changes
git bisect bad 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971
# good: [5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b] sna: Tweak deletion of used buffers
git bisect good 5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b
# first bad commit: [82e6d41c2f4f343bd1854d3d8ee4b624b5d68971] sna/gen6: Tweak flush around CC state changes

Comment 22 Ilia Mirkin 2014-02-09 07:38:51 UTC

OK, I'm _pretty_ confident in that bisection result. 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971 is definitely bad. I have tried several times and couldn't get 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971^ (i.e. 5cdc2bbc9c66d4c8c6fdb1b552c32177d070bf7b) to show the issue.

Comment 23 Ilia Mirkin 2014-02-09 07:44:30 UTC

Oh, and a couple more things...

(a) I pulled the very latest and greatest:

[  3005.798] (II) intel(0): SNA compiled from 2.99.909-17-g823382d

Still repro's the issue.

(b) The bisect and (a) above were done with a kernel booted with i915.i915_enable_rc6=0 (from the other bug). Not sure if this affects anything.

Comment 24 Chris Wilson 2014-02-09 12:30:07 UTC

Ok, that looks like a plausible commit for that type of error.

Can you please try:

diff --git a/src/sna/gen6_render.c b/src/sna/gen6_render.c
index aadc6f7..64707ce 100644
--- a/src/sna/gen6_render.c
+++ b/src/sna/gen6_render.c
@@ -870,15 +870,19 @@ gen6_emit_state(struct sna *sna,
 
        assert(op->dst.bo->exec);
 
-       need_flush =
-               gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags)) &&
-               wm_binding_table & 1;
+       need_stall = need_flush = false;
+       if (gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags))) {
+               if (wm_binding_table & 1)
+                       need_flush = true;
+               else
+                       need_stall = true;
+       }
        gen6_emit_sampler(sna, GEN6_SAMPLER(op->u.gen6.flags));
        gen6_emit_sf(sna, GEN6_VERTEX(op->u.gen6.flags) >> 2);
        gen6_emit_wm(sna, GEN6_KERNEL(op->u.gen6.flags), GEN6_VERTEX(op->u.gen6.flags) >> 2);
        gen6_emit_vertex_elements(sna, op);
 
-       need_stall = gen6_emit_binding_table(sna, wm_binding_table & ~1);
+       need_stall |= gen6_emit_binding_table(sna, wm_binding_table & ~1);
        if (gen6_emit_drawing_rectangle(sna, op))
                need_stall = false;
        if (need_flush || kgem_bo_is_dirty(op->src.bo) || kgem_bo_is_dirty(op->mask.bo)) {


which will restore the stall we had previously in the cases where we don't emit the flush.

Comment 25 Ilia Mirkin 2014-02-09 22:29:58 UTC

That didn't work. I won't pretend to even know what a flush or a stall is, however by pure code analysis, I think your patch isn't restoring the old behaviour. I changed it to

-       need_flush =
-               gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags)) &&
-               wm_binding_table & 1;
+       need_stall = need_flush = false;
+       if (gen6_emit_cc(sna, GEN6_BLEND(op->u.gen6.flags))) {
+               if (wm_binding_table & 1)
+                       need_flush = true;
+       }
+               else
+                       need_stall = wm_binding_table & 1;

and that appears to work fine. Note that before, need_stall was always set, and if gen6_emit_cc(), it was reset to false.

I think a cleaner way to write this might be

need_stall = wm_binding_table & 1;
if (gen6_emit_cc(...))
   need_flush = need_stall;

[and then the flush logic clears need_stall, which is nice]

Comment 26 Chris Wilson 2014-02-10 08:59:17 UTC

commit 37d8566ee78c67647b159a96ddb2675d1506b967
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Feb 9 12:28:27 2014 +0000

    sna/gen6: Restore stall dropped when not flushing instead
    
    commit 82e6d41c2f4f343bd1854d3d8ee4b624b5d68971
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Thu Oct 31 13:35:59 2013 +0000
    
        sna/gen6: Tweak flush around CC state changes
    
    Replaced the pipeline stall with a flush - but only when the target was
    dirty. The missing stall however seems to be required as well.
    
    v2: Actually emit the stall for all CC state changes [Ilia Mirkin]
    
    Reported-by: Ilia Mirkin <imirkin@alum.mit.edu>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=72375
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 27 Ilia Mirkin 2014-02-10 09:08:25 UTC

Double-checked that the committed version also works. Well that was a fun ride. Hopefully never again :) Thanks for sticking with it!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.