Summary: | [SNA HSW bisected] git commit "sna: Skip sync if we have an idle coherent bo" causes corruption | ||||||
---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Matti Hämäläinen <ccr> | ||||
Component: | Driver/intel | Assignee: | Chris Wilson <chris> | ||||
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | normal | ||||||
Priority: | medium | ||||||
Version: | unspecified | ||||||
Hardware: | x86-64 (AMD64) | ||||||
OS: | Linux (All) | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Do you mind reverting that commit chunk-by-chunk to narrow down which bo are suspect? It implies the request tracking is wrong either at the ddx or kernel level, but most likely it is a foriegn bo which the ddx has mistakenly assumed is idle. Along those lines diff --git a/src/sna/kgem.c b/src/sna/kgem.c index 8e5ec51..f593abd 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -7003,8 +7003,10 @@ void kgem_bo_sync__cpu(struct kgem *kgem, struct kgem_bo *bo) assert(!bo->scanout); assert_tiling(kgem, bo); - if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) + if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) { + assert(!__kgem_busy(kgem, bo->handle)); return; + } kgem_bo_submit(kgem, bo); @@ -7043,8 +7045,10 @@ void kgem_bo_sync__cpu_full(struct kgem *kgem, struct kgem_bo *bo, bool write) assert(!bo->scanout || !write); assert_tiling(kgem, bo); - if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) + if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) { + assert(!__kgem_busy(kgem, bo->handle)); return; + } if (write || bo->needs_flush) kgem_bo_submit(kgem, bo); @@ -7093,8 +7097,10 @@ void kgem_bo_sync__gtt(struct kgem *kgem, struct kgem_bo *bo) assert(bo->proxy == NULL); assert_tiling(kgem, bo); - if (bo->rq == NULL) + if (bo->rq == NULL) { + assert(!__kgem_busy(kgem, bo->handle)); return; + } kgem_bo_submit(kgem, bo); plus a full-debug log would help. Also would be nice to check if diff --git a/src/sna/kgem.c b/src/sna/kgem.c index 8e5ec51..0b621b6 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -7003,7 +7003,7 @@ void kgem_bo_sync__cpu(struct kgem *kgem, struct kgem_bo *bo) assert(!bo->scanout); assert_tiling(kgem, bo); - if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) + if (bo->rq == NULL && (kgem->has_llc || bo->snoop) && !bo->flush) return; kgem_bo_submit(kgem, bo); @@ -7043,7 +7043,7 @@ void kgem_bo_sync__cpu_full(struct kgem *kgem, struct kgem_bo *bo, bool write) assert(!bo->scanout || !write); assert_tiling(kgem, bo); - if (bo->rq == NULL && (kgem->has_llc || bo->snoop)) + if (bo->rq == NULL && (kgem->has_llc || bo->snoop) && !bo->flush) return; if (write || bo->needs_flush) @@ -7093,7 +7093,7 @@ void kgem_bo_sync__gtt(struct kgem *kgem, struct kgem_bo *bo) assert(bo->proxy == NULL); assert_tiling(kgem, bo); - if (bo->rq == NULL) + if (bo->rq == NULL && !bo->flush) return; kgem_bo_submit(kgem, bo); affected the rendering. Tested with the first patch, built with debug=full (and as expected, hit one of the asserts) and produced this log: http://tnsp.org/~ccr/intel-gfx/sna_skip_sync/Xorg.0.log.old.xz With the latter patch the problem seems to disappear. Do you want me to test with partial reverts in addition to those? Or something else? No, the full-debug should be enough to trace back the origin of the busy handle and confirm if it is an external bo (and verify if the mitigation I have in mind will work). commit 9df8119e1e8216d11a116fdc28c1b499a521ef8b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Feb 13 08:44:15 2016 +0000 sna: Reorder busyness checks after finding the proxy's parents Due to the way request tracking on proxies work, we need to inspect the parent and not the proxy itself. Reported-by: Matti Hämäläinen <ccr@tnsp.org> References: https://bugs.freedesktop.org/show_bug.cgi?id=94125 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Probably needs another pass through with the asserts though. Pulled and now testing with added asserts. The issue is not presenting (with 1h of testing), nor am I hitting the asserts - so it is working, but that kinda raises the question whether I _should_ hit the asserts at some point? E.g. how to hit the situation the original changeset was meant to address? (In reply to Matti Hämäläinen from comment #5) > Pulled and now testing with added asserts. The issue is not presenting (with > 1h of testing), nor am I hitting the asserts - so it is working, but that > kinda raises the question whether I _should_ hit the asserts at some point? > E.g. how to hit the situation the original changeset was meant to address? The original regressing patch was a micro-optimisation to avoid a syscall and walking a potentially long list of requests in the kernel (trying to keep that list alive to reduce future work). Any time we read back from a pixmap, I expect (at least the second time) to hit the short-circuit so given a workload where you saw a failure, I would expect it still be taking the short-circuit, just less often (having avoided the bug). Thanks for the explanation. I realized that as I was using the assert() conditions from your original patch, they might not trigger, so added some appropriate printf-debugs and those did show that the short-circuit paths are working. ( http://tnsp.org/~ccr/intel-gfx/sna_skip_sync/test_patch.diff ) In conclusion, I guess this works as expected now? :) Decided to revert the shortcircuit on the write side: commit aacc3442bb8c280a6ebc1fc1e34451613a72d7c4 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Feb 18 10:15:11 2016 +0000 sna: Restrict set-domain short-circuit to readonly If we write to the bo, we do want the kernel to mark it as dirty (for the purposes of swap invalidation and the like), so we can't apply the shortcircuit. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> that should almost completely nerf the commit (except iirc you will still hit the shortcircuit!). |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 121723 [details] Example screenshot of corruption. DDX git commit 963b5bb1be874623f06caa421e06cdd2134204dc causes partially invalid screen updates / weirdness in some applications ("xv" image viewer and (paused) html5 video in Firefox at least). Most easily reproduced when switching between virtual desktops (in WindowMaker) and having "xv" on one desktop and something else, like Firefox on another vdesktop. Not sure if so "complex" setup is required, but it seemed to trigger the issue more easily. Reverting this commit on top of git head fixes the issue. -- Window manager: WindowMaker 0.95.7-3 -- chipset: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06) -- system architecture: x86-64 / 64bit -- xf86-video-intel: GIT head/master de7407d20703615c5bf0b1d9f522e93dcc13ba82 -- xserver: X.Org X Server 1.18.0 from latest Debian testing -- mesa: 11.1.1-2 from Debian testing -- libpixman: 0.33.6-1 -- libdrm version: 2.4.66-2 -- kernel version: 4.3.5 (vanilla+grsec) -- Linux distribution: current Debian Testing -- Machine or mobo model: Asus H97M-PLUS, core i7-4770 -- Display connector: DVI