https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_91/fi-byt-j1900/igt@gem_persistent_relocs@forked-faulting-reloc-thrashing.html (gem_persistent_relocs:1233) CRITICAL: Test assertion failure function do_test, file ../tests/gem_persistent_relocs.c:256: (gem_persistent_relocs:1233) CRITICAL: Failed assertion: test == 0xdeadbeef (gem_persistent_relocs:1233) CRITICAL: mismatch in buffer 0: 0x00000000 instead of 0xdeadbeef at offset 0 Subtest forked-faulting-reloc-thrashing failed. No log. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_94/fi-ivb-3770/igt@gem_persistent_relocs@forked-interruptible-faulting-reloc-thrashing.html (gem_persistent_relocs:1400) CRITICAL: Test assertion failure function do_test, file ../tests/gem_persistent_relocs.c:256: (gem_persistent_relocs:1400) CRITICAL: Failed assertion: test == 0xdeadbeef (gem_persistent_relocs:1400) CRITICAL: mismatch in buffer 0: 0x00000000 instead of 0xdeadbeef at offset 0 Subtest forked-interruptible-faulting-reloc-thrashing failed. No log.
I couldn't find fault in the reloc test (it only relies on reloc.presumed_offset which is maintained by the kernel while applying the reloc); so why now, why a pair of gen7? If we assume some significance to gen7 => cmdparser. Could there be a race with maintaining the reloc (inside a user GTT mmap) and running the cmadparser?
while sudo ./tests/gem_persistent_relocs --run forked-interruptible-thrashing ; do :; done fails within an hour. Let's see what we can see.
For the negative checklist; not cmdparser related.
Hmm, gem_set_domain(GTT, GTT) seems to be hiding it. So, my guess is that we are not flagging the object as dirty on the execbuf write into the reloc, and so losing the update if it gets thrashed.
Another note for why gen7 now: full-ppgtt. Could be either slight variation in relocation path or the relocation itself not being coherent with the ppGTT.
i915.enable_ppgtt=1 /* disable full-ppgtt */ ftw
*** Bug 107759 has been marked as a duplicate of this bug. ***
*** Bug 107715 has been marked as a duplicate of this bug. ***
More fingers crossed, worksforme at least, commit 06348d3086a3b34f2db6c7692b4327fb7fc0b6c7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Sep 4 07:38:02 2018 +0100 drm/i915/ringbuffer: Move double invalidate to after pd flush Continuing the fun of trying to find exactly the delay that is sufficient to ensure that the page directory is fully loaded between context switches, move the extra flush added in commit 70b73f9ac113 ("drm/i915/ringbuffer: Delay after invalidating gen6+ xcs") to just after we flush the pd. Entirely based on the empirical data of running failing tests in a loop until we survive a day (before the mtbf is 10-30 minutes). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107769 References: 70b73f9ac113 ("drm/i915/ringbuffer: Delay after invalidating gen6+ xcs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180904063802.13880-1-chris@chris-wilson.co.uk Let's see what test case remains troublesome after that!
Closing this as fixed. This issue was seen twice 2 months 3 weeks ago with drm-tip execution with different platforms (ivb/byt).
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.