107164 – [CI][GDG] igt@drv_selftest@live_coherency igt_gem_coherency failed with error -22

Bug 107164 - [CI][GDG] igt@drv_selftest@live_coherency igt_gem_coherency failed with error -22

Summary: [CI][GDG] igt@drv_selftest@live_coherency igt_gem_coherency failed with error...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-07-09 08:07 UTC by Tomi Sarvela
Modified:	2019-07-31 13:30 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	I915G
i915 features:	GEM/Other

Attachments

Description Tomi Sarvela 2018-07-09 08:07:35 UTC

Continuing with the series of "Initial findings" with Intel-GFX-CI and i915 selftests.

igt@drv_selftest@live_coherency occasionally fails on GDG with dmesg:

[  414.340496] Setting dangerous option live_selftests - tainting kernel
[  414.651331] Value[10/13] mismatch, (overwrite with gtt) wrote [wc] 9be7af9d read [gtt] 64185062 (inverse 64185062), at offset a64
[  414.655515] i915/i915_gem_coherency_live_selftests: igt_gem_coherency failed with error -22
[  414.713948] i915: probe of 0000:00:02.0 failed with error -22

Full trace at:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4444/fi-gdg-551/igt@drv_selftest@live_coherency.html

Comment 1 Chris Wilson 2018-07-09 08:24:53 UTC

Two bugs here, this is the dmesg splat from the earlier WC one.

commit add00e6d896fab882e6115ed4908b2456f1b3a85
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 6 12:54:02 2018 +0100

    drm/i915: Flush the WCB following a WC write
    
    If we have just completed a WC write, we must ensure that the WCB (Write
    Combining Buffer) is flushed out to main memory before we can expect to
    see the results. This is especially important when mixing WC with GTT as
    the physical paths are different and cachelines are not naturally flushed.
    
    Testcase: igt/drv_selftests/live_coherency #gdg
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180706115402.18547-1-chris@chris-wilson.co.uk


commit 3a32497f0dbe170794e1506deb41dc44c4fea8d9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 6 18:49:26 2018 +0100

    drm/i915/selftests: Provide full mb() around clflush
    
    clflush is an unserialised instruction and the IA manual strongly advises
    you to serialise it with a mb. To be cautious, apply one before and one
    after, so that it is serialised with both writes and reads without
    worrying too much about the required direction.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180706174926.4712-1-chris@chris-wilson.co.uk

Comment 2 James Ausmus 2018-07-18 00:54:53 UTC

It looks like this issue is still happening sporadically on CI - last failure was on July 16th - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4499/fi-gdg-551/igt@drv_selftest@live_coherency.html

dmesg: 

[  525.761393] Setting dangerous option live_selftests - tainting kernel
[  526.007603] Value[14/23] mismatch, (overwrite with gtt) wrote [cpu] 75bc439e read [gtt] 8a43bc61 (inverse 8a43bc61), at offset 598
[  526.011501] i915/i915_gem_coherency_live_selftests: igt_gem_coherency failed with error -22
[  526.058075] i915: probe of 0000:00:02.0 failed with error -22


12 out of the last 60 CI runs had this failure signature, so around a 20% failure rate currently

Comment 3 Chris Wilson 2018-07-18 07:25:10 UTC

Back for another soak.

commit a8bd3b884dd79dcc9a89dedd0e24b7554de4fe79 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jul 17 10:26:55 2018 +0100

    drm/i915: Flush chipset caches after GGTT writes
    
    Our I915g (early gen3, the oldest machine we have in the farm) is still
    reporting occasional incoherency performing the following operations:
    
      1) write through GGTT (indirect write into memory)
      2) write through either CPU or WC (direct write into memory)
      3) read from GGTT (indirect read)
    
    Instead of reporting the value from (2), the read from GGTT reports the
    earlier value written via the GGTT. We have made sure that the writes are
    flushed from the CPU (commit 3a32497f0dbe ("drm/i915/selftests: Provide
    full mb() around clflush") and commit add00e6d896f ("drm/i915: Flush the
    WCB following a WC write")), but still see the error, just less
    frequently. The only remaining cache that might be affected here is a
    chipset cache, so flush that as well.
    
    Testcase: igt/drv_selftest/live_coherency #gdg
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180717092655.28417-1-chris@chris-wilson.co.uk

Comment 4 Francesco Balestrieri 2018-08-04 09:23:18 UTC

Tomi, is it OK to close this?

Comment 5 Tomi Sarvela 2018-08-06 07:39:26 UTC

Still seeing the issue:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4616/fi-gdg-551/igt@drv_selftest@live_coherency.html

Comment 6 Chris Wilson 2019-01-07 22:13:27 UTC

Also failed with noclflush: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5369/fi-gdg-551/igt@i915_selftest@live_coherency.html

Value[0/1] mismatch, (overwrite with gtt) wrote [wc] ff9499e4 read [gtt] 6b661b (inverse 6b661b), at offset 2ec

Comment 7 Chris Wilson 2019-04-11 10:10:32 UTC

It's been a while, I proclaim this fixed! (That should be enough for it spontaneously occur again.)


commit a679f58d051025db6fa86226c4d35650b75e990f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 21 16:19:07 2019 +0000

    drm/i915: Flush pages on acquisition
    
    When we return pages to the system, we ensure that they are marked as
    being in the CPU domain since any external access is uncontrolled and we
    must assume the worst. This means that we need to always flush the pages
    on acquisition if we need to use them on the GPU, and from the beginning
    have used set-domain. Set-domain is overkill for the purpose as it is a
    general synchronisation barrier, but our intent is to only flush the
    pages being swapped in. If we move that flush into the pages acquisition
    phase, we know then that when we have obj->mm.pages, they are coherent
    with the GPU and need only maintain that status without resorting to
    heavy handed use of set-domain.

Comment 8 Lakshmi 2019-07-31 13:28:54 UTC

Closing this issue as Fixed.

Comment 9 CI Bug Log 2019-07-31 13:29:02 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.