https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_59/fi-cfl-u2/igt@gem_ppgtt@blt-vs-render-ctxn.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_60/fi-cfl-u2/igt@gem_ppgtt@blt-vs-render-ctxn.html
Also seen on WHL: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_65/fi-whl-u/igt@gem_ppgtt@blt-vs-render-ctxn.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_64/fi-whl-u/igt@gem_ppgtt@blt-vs-render-ctxn.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_64/fi-whl-u/igt@gem_ppgtt@blt-vs-render-ctx0.html
Also seen on SNB: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_64/fi-snb-2600/igt@gem_ppgtt@blt-vs-render-ctx0.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_65/fi-snb-2600/igt@gem_ppgtt@blt-vs-render-ctx0.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_105/fi-bsw-n3050/igt@gem_ppgtt@blt-vs-render-ctxn.html
Also on SKL: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4952/shard-skl2/igt@gem_ppgtt@blt-vs-render-ctxn.html
Also seen on KBL: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_133/fi-kbl-8809g/igt@gem_ppgtt@blt-vs-render-ctxn.html
*** Bug 106023 has been marked as a duplicate of this bug. ***
These are the steps to manually reproduce the bug (from the igt directory): ./build/tests/gem_exec_capture --run-subtest userptr ./build/tests/gem_ppgtt --run-subtest blt-vs-render-ctxn It fails always. The failure happens in the gen9_render_copyfunc() function (because I'm using Kaby Lake). The test *might* fail randomly out of 8 times 32768 (0x8000).
Just to clarify what I wrote above: The function gen9_render_copyfunc() is called under a for loop that runs (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In my tests I always got a failure. To simplify the debugging I run it in a single thread by applying the following: diff --git a/tests/i915/gem_ppgtt.c b/tests/i915/gem_ppgtt.c index af5e3e07..66b71a68 100644 --- a/tests/i915/gem_ppgtt.c +++ b/tests/i915/gem_ppgtt.c @@ -294,7 +294,7 @@ static void flink_and_exit(void) close(fd); } -#define N_CHILD 8 +#define N_CHILD 1 int main(int argc, char **argv) { igt_subtest_init(argc, argv); that makes the loop iterate exactly 32768 times in a single process. NOTE: in a single process the gen9_render_copyfunc() doesn't always fail, but it does with a high rate.
(In reply to Andi from comment #8) > Just to clarify what I wrote above: > > The function gen9_render_copyfunc() is called under a for loop that runs > (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In > my tests I always got a failure. You still haven't said what failure.
(In reply to Chris Wilson from comment #9) > (In reply to Andi from comment #8) > > Just to clarify what I wrote above: > > > > The function gen9_render_copyfunc() is called under a for loop that runs > > (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In > > my tests I always got a failure. > > You still haven't said what failure. the system crashes and I need to reboot by removing the power.
Could it be commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 8 08:17:38 2018 +0000 drm/i915/execlists: Force write serialisation into context image vs execution Ensure that the writes into the context image are completed prior to the register mmio to trigger execution. Although previously we were assured by the SDM that all writes are flushed before an uncached memory transaction (our mmio write to submit the context to HW for execution), we have empirical evidence to believe that this is not actually the case. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656 References: https://bugs.freedesktop.org/show_bug.cgi?id=108315 References: https://bugs.freedesktop.org/show_bug.cgi?id=106887 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk Cc: stable@vger.kernel.org ?
(In reply to Chris Wilson from comment #11) > Could it be > > commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> > drm-intel-next-queued, drm-intel/for-linux-next, > drm-intel/drm-intel-next-queued) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Nov 8 08:17:38 2018 +0000 > > drm/i915/execlists: Force write serialisation into context image vs > execution Nope, struck again after. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5108/shard-kbl6/igt@gem_ppgtt@blt-vs-render-ctx0.html
Optimistically, commit 4a15c75c42460252a63d30f03b4766a52945fb47 Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Mon Dec 3 13:33:41 2018 +0000 drm/i915: Introduce per-engine workarounds We stopped re-applying the GT workarounds after engine reset since commit 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds"). Issue with this is that some of the GT workarounds live in the MMIO space which gets lost during engine resets. So far the registers in 0x2xxx and 0xbxxx address range have been identified to be affected. This losing of applied workarounds has obvious negative effects and can even lead to hard system hangs (see the linked Bugzilla). Rather than just restoring this re-application, because we have also observed that it is not safe to just re-write all GT workarounds after engine resets (GPU might be live and weird hardware states can happen), we introduce a new class of per-engine workarounds and move only the affected GT workarounds over. Using the framework introduced in the previous patch, we therefore after engine reset, re-apply only the workarounds living in the affected MMIO address ranges. v2: * Move Wa_1406609255:icl to engine workarounds as well. * Rename API. (Chris Wilson) * Drop redundant IS_KABYLAKE. (Chris Wilson) * Re-order engine wa/ init so latest platforms are first. (Rodrigo Vivi) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Bugzilla: https://bugzilla.freedesktop.org/show_bug.cgi?id=107945 Fixes: 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds") Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: intel-gfx@lists.freedesktop.org Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20181203133341.10258-1-tvrtko.ursulin@linux.intel.com
Also seen on ICL https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/igt@gem_ppgtt@blt-vs-render-ctx0.html Is this a different issue?
Latest occurrence ~4 weeks ago: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/igt@gem_ppgtt@blt-vs-render-ctx0.html Let's keep monitoring...
(In reply to Lakshmi from comment #14) > Also seen on ICL > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/ > igt@gem_ppgtt@blt-vs-render-ctx0.html > > Is this a different issue? Yes. icl is beset by a number of problems, but there is a definite system hang (hard machine lockup) on kbl that is very likely kbl specific. Any complete system lockup is likely machine specific and is best kept separate until root caused.
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.