Bug 94125 - [SNA HSW bisected] git commit "sna: Skip sync if we have an idle coherent bo" causes corruption
Summary: [SNA HSW bisected] git commit "sna: Skip sync if we have an idle coherent bo"...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-12 19:35 UTC by Matti Hämäläinen
Modified: 2016-02-18 10:17 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Example screenshot of corruption. (513.42 KB, image/png)
2016-02-12 19:35 UTC, Matti Hämäläinen
no flags Details

Description Matti Hämäläinen 2016-02-12 19:35:46 UTC
Created attachment 121723 [details]
Example screenshot of corruption.

DDX git commit 963b5bb1be874623f06caa421e06cdd2134204dc causes partially invalid screen updates / weirdness in some applications ("xv" image viewer and (paused) html5 video in Firefox at least).

Most easily reproduced when switching between virtual desktops (in WindowMaker) and having "xv" on one desktop and something else, like Firefox on another vdesktop. Not sure if so "complex" setup is required, but it seemed to trigger the issue more easily.

Reverting this commit on top of git head fixes the issue.


-- Window manager: WindowMaker 0.95.7-3
-- chipset: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
-- system architecture: x86-64 / 64bit
-- xf86-video-intel: GIT head/master de7407d20703615c5bf0b1d9f522e93dcc13ba82
-- xserver: X.Org X Server 1.18.0 from latest Debian testing
-- mesa: 11.1.1-2 from Debian testing
-- libpixman: 0.33.6-1
-- libdrm version: 2.4.66-2
-- kernel version: 4.3.5 (vanilla+grsec)
-- Linux distribution: current Debian Testing
-- Machine or mobo model: Asus H97M-PLUS, core i7-4770
-- Display connector: DVI
Comment 1 Chris Wilson 2016-02-12 21:30:14 UTC
Do you mind reverting that commit chunk-by-chunk to narrow down which bo are suspect? It implies the request tracking is wrong either at the ddx or kernel level, but most likely it is a foriegn bo which the ddx has mistakenly assumed is idle. Along those lines

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 8e5ec51..f593abd 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -7003,8 +7003,10 @@ void kgem_bo_sync__cpu(struct kgem *kgem, struct kgem_bo *bo)
        assert(!bo->scanout);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL && (kgem->has_llc || bo->snoop))
+       if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) {
+               assert(!__kgem_busy(kgem, bo->handle));
                return;
+       }
 
        kgem_bo_submit(kgem, bo);
 
@@ -7043,8 +7045,10 @@ void kgem_bo_sync__cpu_full(struct kgem *kgem, struct kgem_bo *bo, bool write)
        assert(!bo->scanout || !write);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL && (kgem->has_llc || bo->snoop))
+       if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) {
+               assert(!__kgem_busy(kgem, bo->handle));
                return;
+       }
 
        if (write || bo->needs_flush)
                kgem_bo_submit(kgem, bo);
@@ -7093,8 +7097,10 @@ void kgem_bo_sync__gtt(struct kgem *kgem, struct kgem_bo *bo)
        assert(bo->proxy == NULL);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL)
+       if (bo->rq == NULL) {
+               assert(!__kgem_busy(kgem, bo->handle));
                return;
+       }
 
        kgem_bo_submit(kgem, bo);

plus a full-debug log would help.

Also would be nice to check if

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 8e5ec51..0b621b6 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -7003,7 +7003,7 @@ void kgem_bo_sync__cpu(struct kgem *kgem, struct kgem_bo *bo)
        assert(!bo->scanout);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL && (kgem->has_llc || bo->snoop))
+       if (bo->rq == NULL && (kgem->has_llc || bo->snoop) && !bo->flush)
                return;
 
        kgem_bo_submit(kgem, bo);
@@ -7043,7 +7043,7 @@ void kgem_bo_sync__cpu_full(struct kgem *kgem, struct kgem_bo *bo, bool write)
        assert(!bo->scanout || !write);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL && (kgem->has_llc || bo->snoop))
+       if (bo->rq == NULL && (kgem->has_llc || bo->snoop) && !bo->flush)
                return;
 
        if (write || bo->needs_flush)
@@ -7093,7 +7093,7 @@ void kgem_bo_sync__gtt(struct kgem *kgem, struct kgem_bo *bo)
        assert(bo->proxy == NULL);
        assert_tiling(kgem, bo);
 
-       if (bo->rq == NULL)
+       if (bo->rq == NULL && !bo->flush)
                return;
 
        kgem_bo_submit(kgem, bo);

affected the rendering.
Comment 2 Matti Hämäläinen 2016-02-12 21:58:12 UTC
Tested with the first patch, built with debug=full (and as expected, hit one of the asserts) and produced this log: 

http://tnsp.org/~ccr/intel-gfx/sna_skip_sync/Xorg.0.log.old.xz

With the latter patch the problem seems to disappear. Do you want me to test with partial reverts in addition to those? Or something else?
Comment 3 Chris Wilson 2016-02-13 08:34:49 UTC
No, the full-debug should be enough to trace back the origin of the busy handle and confirm if it is an external bo (and verify if the mitigation I have in mind will work).
Comment 4 Chris Wilson 2016-02-13 08:46:32 UTC
commit 9df8119e1e8216d11a116fdc28c1b499a521ef8b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Feb 13 08:44:15 2016 +0000

    sna: Reorder busyness checks after finding the proxy's parents
    
    Due to the way request tracking on proxies work, we need to inspect the
    parent and not the proxy itself.
    
    Reported-by: Matti Hämäläinen <ccr@tnsp.org>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=94125
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Probably needs another pass through with the asserts though.
Comment 5 Matti Hämäläinen 2016-02-13 14:36:54 UTC
Pulled and now testing with added asserts. The issue is not presenting (with 1h of testing), nor am I hitting the asserts - so it is working, but that kinda raises the question whether I _should_ hit the asserts at some point? E.g. how to hit the situation the original changeset was meant to address?
Comment 6 Chris Wilson 2016-02-13 16:24:44 UTC
(In reply to Matti Hämäläinen from comment #5)
> Pulled and now testing with added asserts. The issue is not presenting (with
> 1h of testing), nor am I hitting the asserts - so it is working, but that
> kinda raises the question whether I _should_ hit the asserts at some point?
> E.g. how to hit the situation the original changeset was meant to address?

The original regressing patch was a micro-optimisation to avoid a syscall and  walking a potentially long list of requests in the kernel (trying to keep that list alive to reduce future work).

Any time we read back from a pixmap, I expect (at least the second time) to hit the short-circuit so given a workload where you saw a failure, I would expect it still be taking the short-circuit, just less often (having avoided the bug).
Comment 7 Matti Hämäläinen 2016-02-14 00:14:23 UTC
Thanks for the explanation. I realized that as I was using the assert() conditions from your original patch, they might not trigger, so added some appropriate printf-debugs and those did show that the short-circuit paths are working. ( http://tnsp.org/~ccr/intel-gfx/sna_skip_sync/test_patch.diff )

In conclusion, I guess this works as expected now? :)
Comment 8 Chris Wilson 2016-02-18 10:17:58 UTC
Decided to revert the shortcircuit on the write side:

commit aacc3442bb8c280a6ebc1fc1e34451613a72d7c4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Feb 18 10:15:11 2016 +0000

    sna: Restrict set-domain short-circuit to readonly
    
    If we write to the bo, we do want the kernel to mark it as dirty (for
    the purposes of swap invalidation and the like), so we can't apply the
    shortcircuit.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

that should almost completely nerf the commit (except iirc you will still hit the shortcircuit!).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.