107769 – [CI][DRMTIP] igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - fail - Failed assertion: test == 0xdeadbeef

Bug 107769 - [CI][DRMTIP] igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - fail - Failed assertion: test == 0xdeadbeef

Summary: [CI][DRMTIP] igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - fail...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Duplicates (2):	107715 107759 (view as bug list)
Depends on:
Blocks:

Reported:	2018-08-31 12:37 UTC by Martin Peres
Modified:	2018-10-14 14:36 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BYT, IVB
i915 features:	GEM/PPGTT

Attachments

Description Martin Peres 2018-08-31 12:37:31 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_91/fi-byt-j1900/igt@gem_persistent_relocs@forked-faulting-reloc-thrashing.html

(gem_persistent_relocs:1233) CRITICAL: Test assertion failure function do_test, file ../tests/gem_persistent_relocs.c:256:
(gem_persistent_relocs:1233) CRITICAL: Failed assertion: test == 0xdeadbeef
(gem_persistent_relocs:1233) CRITICAL: mismatch in buffer 0: 0x00000000 instead of 0xdeadbeef at offset 0
Subtest forked-faulting-reloc-thrashing failed.
No log.


https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_94/fi-ivb-3770/igt@gem_persistent_relocs@forked-interruptible-faulting-reloc-thrashing.html

(gem_persistent_relocs:1400) CRITICAL: Test assertion failure function do_test, file ../tests/gem_persistent_relocs.c:256:
(gem_persistent_relocs:1400) CRITICAL: Failed assertion: test == 0xdeadbeef
(gem_persistent_relocs:1400) CRITICAL: mismatch in buffer 0: 0x00000000 instead of 0xdeadbeef at offset 0
Subtest forked-interruptible-faulting-reloc-thrashing failed.
No log.

Comment 1 Chris Wilson 2018-08-31 12:56:28 UTC

I couldn't find fault in the reloc test (it only relies on reloc.presumed_offset which is maintained by the kernel while applying the reloc); so why now, why a pair of gen7? If we assume some significance to gen7 => cmdparser. Could there be a race with maintaining the reloc (inside a user GTT mmap) and running the cmadparser?

Comment 2 Chris Wilson 2018-08-31 15:24:05 UTC

while sudo ./tests/gem_persistent_relocs --run forked-interruptible-thrashing ; do :; done

fails within an hour. Let's see what we can see.

Comment 3 Chris Wilson 2018-08-31 15:38:23 UTC

For the negative checklist; not cmdparser related.

Comment 4 Chris Wilson 2018-08-31 15:59:57 UTC

Hmm, gem_set_domain(GTT, GTT) seems to be hiding it. So, my guess is that we are not flagging the object as dirty on the execbuf write into the reloc, and so losing the update if it gets thrashed.

Comment 5 Chris Wilson 2018-09-01 10:50:58 UTC

Another note for why gen7 now: full-ppgtt. Could be either slight variation in relocation path or the relocation itself not being coherent with the ppGTT.

Comment 6 Chris Wilson 2018-09-03 08:37:20 UTC

i915.enable_ppgtt=1 /* disable full-ppgtt */ ftw

Comment 7 Chris Wilson 2018-09-03 11:25:04 UTC

*** Bug 107759 has been marked as a duplicate of this bug. ***

Comment 8 Chris Wilson 2018-09-03 11:26:47 UTC

*** Bug 107715 has been marked as a duplicate of this bug. ***

Comment 9 Chris Wilson 2018-09-04 13:34:28 UTC

More fingers crossed, worksforme at least,

commit 06348d3086a3b34f2db6c7692b4327fb7fc0b6c7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Sep 4 07:38:02 2018 +0100

    drm/i915/ringbuffer: Move double invalidate to after pd flush
    
    Continuing the fun of trying to find exactly the delay that is
    sufficient to ensure that the page directory is fully loaded between
    context switches, move the extra flush added in commit 70b73f9ac113
    ("drm/i915/ringbuffer: Delay after invalidating gen6+ xcs") to just
    after we flush the pd. Entirely based on the empirical data of running
    failing tests in a loop until we survive a day (before the mtbf is 10-30
    minutes).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107769
    References: 70b73f9ac113 ("drm/i915/ringbuffer: Delay after invalidating gen6+ xcs")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180904063802.13880-1-chris@chris-wilson.co.uk

Let's see what test case remains troublesome after that!

Comment 10 Lakshmi 2018-10-14 14:36:39 UTC

Closing this as fixed.
This issue was seen twice 2 months 3 weeks ago with drm-tip execution with different platforms (ivb/byt).

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.