https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html Starting subtest: reset (pm_rps:1245) ioctl_wrappers-CRITICAL: Test assertion failure function gem_execbuf_wr, file ../lib/ioctl_wrappers.c:641: (pm_rps:1245) ioctl_wrappers-CRITICAL: Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0 (pm_rps:1245) ioctl_wrappers-CRITICAL: error: -5 != 0 Subtest reset failed.
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0\n[^\n]+error: -5 != 0 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html
<7>[ 104.332734] [IGT] pm_rps: starting subtest reset <5>[ 104.333284] Setting dangerous option reset - tainting kernel <6>[ 105.594685] i915 0000:00:02.0: GPU HANG: ecode 11:0:0x00000000, Manually set wedged engine mask = ffffffffffffffff <6>[ 105.594790] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. <6>[ 105.594793] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel <6>[ 105.594795] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. <6>[ 105.594797] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. <6>[ 105.594800] [drm] GPU crash dump saved to /sys/class/drm/card0/error <5>[ 105.594849] i915 0000:00:02.0: Resetting rcs0 for Manually set wedged engine mask = ffffffffffffffff <5>[ 105.596370] i915 0000:00:02.0: Resetting bcs0 for Manually set wedged engine mask = ffffffffffffffff <5>[ 105.596494] i915 0000:00:02.0: Resetting vcs0 for Manually set wedged engine mask = ffffffffffffffff <5>[ 105.596627] i915 0000:00:02.0: Resetting vcs2 for Manually set wedged engine mask = ffffffffffffffff <5>[ 105.596756] i915 0000:00:02.0: Resetting vecs0 for Manually set wedged engine mask = ffffffffffffffff <7>[ 107.207564] [drm:edp_panel_vdd_off_sync [i915]] Turning eDP port A VDD off <7>[ 107.207783] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067 <7>[ 120.391709] hangcheck rcs0 <7>[ 120.391740] hangcheck current seqno 9eb, last a1d, hangcheck 9eb [14016 ms] <7>[ 120.391745] hangcheck Reset count: 1 (global 0) <7>[ 120.391751] hangcheck Requests: <7>[ 120.391773] hangcheck first a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0 <7>[ 120.391781] hangcheck last a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.391806] hangcheck RING_START: 0x0000b000 <7>[ 120.391813] hangcheck RING_HEAD: 0x000000c8 <7>[ 120.391820] hangcheck RING_TAIL: 0x00001b10 <7>[ 120.391829] hangcheck RING_CTL: 0x00003001 <7>[ 120.391838] hangcheck RING_MODE: 0x00000000 <7>[ 120.391844] hangcheck RING_IMR: 00000000 <7>[ 120.391855] hangcheck ACTHD: 0x00000005_443a9d90 <7>[ 120.391866] hangcheck BBADDR: 0x00000005_443aec41 <7>[ 120.391878] hangcheck DMA_FADDR: 0x00000005_443b3980 <7>[ 120.391884] hangcheck IPEIR: 0x00000000 <7>[ 120.391891] hangcheck IPEHR: 0x18800101 <7>[ 120.391900] hangcheck Execlist status: 0x00202098 00000040 <7>[ 120.391908] hangcheck Execlist CSB read 5, write 5 [mmio:5], tasklet queued? no (enabled) <7>[ 120.391918] hangcheck ELSP[0] count=1, ring:{start:0000b000, hwsp:fffee280}, rq: a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.391923] hangcheck ELSP[1] idle <7>[ 120.391927] hangcheck HW active? 0x5 <7>[ 120.391983] hangcheck E a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0 <7>[ 120.392047] hangcheck E a0d [27:1404] prio=1 @ 13793ms: pm_rps[1244]/0 <7>[ 120.392054] hangcheck E a0e [27:1406] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392061] hangcheck E a0f [27:1408] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392068] hangcheck E a10 [27:140a] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392074] hangcheck E a11 [27:140c] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392081] hangcheck E a12 [27:140e] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392087] hangcheck ...skipping 10 executing requests... <7>[ 120.392094] hangcheck E a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0 <7>[ 120.392098] hangcheck Queue priority hint: 1 <7>[ 120.392102] hangcheck HWSP: <7>[ 120.392111] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 120.392115] hangcheck * <7>[ 120.392123] hangcheck [0040] 10008002 00000040 10008002 00000040 10008002 00000040 10008002 00000040 <7>[ 120.392130] hangcheck [0060] 10008002 00000040 10008002 00000040 00000000 00000000 00000000 00000000 <7>[ 120.392137] hangcheck [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 120.392144] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000005 <7>[ 120.392151] hangcheck [00c0] 000009eb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 120.392158] hangcheck [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 120.392162] hangcheck * <7>[ 120.392167] hangcheck Idle? no <7>[ 120.392171] hangcheck Signals: <7>[ 120.392200] hangcheck [27:1424] @ 13792ms <5>[ 120.392420] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0 is peculiar as our writes into the global HWSP simply vanish, and we quite rightly conclude that we are unable to recover. That error seems related to #109605
Fwiw, this issue is fixed by removing the global_seqno itself, e.g. https://patchwork.freedesktop.org/patch/286898/
commit 89531e7d8ee8602b2723431a581250d5d0ec2913 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 26 09:49:19 2019 +0000 drm/i915: Replace global_seqno with a hangcheck heartbeat seqno To determine whether an engine has 'stuck', we simply check whether or not is still on the same seqno for several seconds. To keep this simple mechanism intact over the loss of a global seqno, we can simply add a new global heartbeat seqno instead. As we cannot know the sequence in which requests will then be completed, we use a primitive random number generator instead (with a cycle long enough to not matter over an interval of a few thousand requests between hangcheck samples). The alternative to using a dedicated seqno on every request is to issue a heartbeat request and query its progress through the system. Sadly this requires us to reduce struct_mutex so that we can issue requests without requiring that bkl. v2: And without the extra CS_STALL for the hangcheck seqno -- we don't need strict serialisation with what comes later, we just need to be sure we don't write the hangcheck seqno before our batch is flushed. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-1-chris@chris-wilson.co.uk
(In reply to Chris Wilson from comment #4) > commit 89531e7d8ee8602b2723431a581250d5d0ec2913 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Feb 26 09:49:19 2019 +0000 > > drm/i915: Replace global_seqno with a hangcheck heartbeat seqno > > To determine whether an engine has 'stuck', we simply check whether or > not is still on the same seqno for several seconds. To keep this simple > mechanism intact over the loss of a global seqno, we can simply add a > new global heartbeat seqno instead. As we cannot know the sequence in > which requests will then be completed, we use a primitive random number > generator instead (with a cycle long enough to not matter over an > interval of a few thousand requests between hangcheck samples). > > The alternative to using a dedicated seqno on every request is to issue > a heartbeat request and query its progress through the system. Sadly > this requires us to reduce struct_mutex so that we can issue requests > without requiring that bkl. > > v2: And without the extra CS_STALL for the hangcheck seqno -- we don't > need strict serialisation with what comes later, we just need to be sure > we don't write the hangcheck seqno before our batch is flushed. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-1- > chris@chris-wilson.co.uk Fair-enough! Thanks!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.