Continuing with the series of "Initial findings" with Intel-GFX-CI and i915 selftests. igt@drv_selftest@live_coherency occasionally fails on GDG with dmesg: [ 414.340496] Setting dangerous option live_selftests - tainting kernel [ 414.651331] Value[10/13] mismatch, (overwrite with gtt) wrote [wc] 9be7af9d read [gtt] 64185062 (inverse 64185062), at offset a64 [ 414.655515] i915/i915_gem_coherency_live_selftests: igt_gem_coherency failed with error -22 [ 414.713948] i915: probe of 0000:00:02.0 failed with error -22 Full trace at: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4444/fi-gdg-551/igt@drv_selftest@live_coherency.html
Two bugs here, this is the dmesg splat from the earlier WC one. commit add00e6d896fab882e6115ed4908b2456f1b3a85 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 6 12:54:02 2018 +0100 drm/i915: Flush the WCB following a WC write If we have just completed a WC write, we must ensure that the WCB (Write Combining Buffer) is flushed out to main memory before we can expect to see the results. This is especially important when mixing WC with GTT as the physical paths are different and cachelines are not naturally flushed. Testcase: igt/drv_selftests/live_coherency #gdg Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180706115402.18547-1-chris@chris-wilson.co.uk commit 3a32497f0dbe170794e1506deb41dc44c4fea8d9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 6 18:49:26 2018 +0100 drm/i915/selftests: Provide full mb() around clflush clflush is an unserialised instruction and the IA manual strongly advises you to serialise it with a mb. To be cautious, apply one before and one after, so that it is serialised with both writes and reads without worrying too much about the required direction. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180706174926.4712-1-chris@chris-wilson.co.uk
It looks like this issue is still happening sporadically on CI - last failure was on July 16th - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4499/fi-gdg-551/igt@drv_selftest@live_coherency.html dmesg: [ 525.761393] Setting dangerous option live_selftests - tainting kernel [ 526.007603] Value[14/23] mismatch, (overwrite with gtt) wrote [cpu] 75bc439e read [gtt] 8a43bc61 (inverse 8a43bc61), at offset 598 [ 526.011501] i915/i915_gem_coherency_live_selftests: igt_gem_coherency failed with error -22 [ 526.058075] i915: probe of 0000:00:02.0 failed with error -22 12 out of the last 60 CI runs had this failure signature, so around a 20% failure rate currently
Back for another soak. commit a8bd3b884dd79dcc9a89dedd0e24b7554de4fe79 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jul 17 10:26:55 2018 +0100 drm/i915: Flush chipset caches after GGTT writes Our I915g (early gen3, the oldest machine we have in the farm) is still reporting occasional incoherency performing the following operations: 1) write through GGTT (indirect write into memory) 2) write through either CPU or WC (direct write into memory) 3) read from GGTT (indirect read) Instead of reporting the value from (2), the read from GGTT reports the earlier value written via the GGTT. We have made sure that the writes are flushed from the CPU (commit 3a32497f0dbe ("drm/i915/selftests: Provide full mb() around clflush") and commit add00e6d896f ("drm/i915: Flush the WCB following a WC write")), but still see the error, just less frequently. The only remaining cache that might be affected here is a chipset cache, so flush that as well. Testcase: igt/drv_selftest/live_coherency #gdg Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180717092655.28417-1-chris@chris-wilson.co.uk
Tomi, is it OK to close this?
Still seeing the issue: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4616/fi-gdg-551/igt@drv_selftest@live_coherency.html
Also failed with noclflush: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5369/fi-gdg-551/igt@i915_selftest@live_coherency.html Value[0/1] mismatch, (overwrite with gtt) wrote [wc] ff9499e4 read [gtt] 6b661b (inverse 6b661b), at offset 2ec
It's been a while, I proclaim this fixed! (That should be enough for it spontaneously occur again.) commit a679f58d051025db6fa86226c4d35650b75e990f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 21 16:19:07 2019 +0000 drm/i915: Flush pages on acquisition When we return pages to the system, we ensure that they are marked as being in the CPU domain since any external access is uncontrolled and we must assume the worst. This means that we need to always flush the pages on acquisition if we need to use them on the GPU, and from the beginning have used set-domain. Set-domain is overkill for the purpose as it is a general synchronisation barrier, but our intent is to only flush the pages being swapped in. If we move that flush into the pages acquisition phase, we know then that when we have obj->mm.pages, they are coherent with the GPU and need only maintain that status without resorting to heavy handed use of set-domain.
Closing this issue as Fixed.
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.