Bug 109606

Summary: [CI][DRMTIP] igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features:

Description Lakshmi 2019-02-11 11:30:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html

Starting subtest: reset
(pm_rps:1245) ioctl_wrappers-CRITICAL: Test assertion failure function gem_execbuf_wr, file ../lib/ioctl_wrappers.c:641:
(pm_rps:1245) ioctl_wrappers-CRITICAL: Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0
(pm_rps:1245) ioctl_wrappers-CRITICAL: error: -5 != 0
Subtest reset failed.
Comment 1 CI Bug Log 2019-02-11 11:33:59 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0\n[^\n]+error: -5 != 0
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html
Comment 2 Chris Wilson 2019-02-11 13:12:32 UTC
<7>[  104.332734] [IGT] pm_rps: starting subtest reset
<5>[  104.333284] Setting dangerous option reset - tainting kernel
<6>[  105.594685] i915 0000:00:02.0: GPU HANG: ecode 11:0:0x00000000, Manually set wedged engine mask = ffffffffffffffff
<6>[  105.594790] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[  105.594793] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[  105.594795] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[  105.594797] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[  105.594800] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<5>[  105.594849] i915 0000:00:02.0: Resetting rcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596370] i915 0000:00:02.0: Resetting bcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596494] i915 0000:00:02.0: Resetting vcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596627] i915 0000:00:02.0: Resetting vcs2 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596756] i915 0000:00:02.0: Resetting vecs0 for Manually set wedged engine mask = ffffffffffffffff
<7>[  107.207564] [drm:edp_panel_vdd_off_sync [i915]] Turning eDP port A VDD off
<7>[  107.207783] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067
<7>[  120.391709] hangcheck rcs0
<7>[  120.391740] hangcheck 	current seqno 9eb, last a1d, hangcheck 9eb [14016 ms]
<7>[  120.391745] hangcheck 	Reset count: 1 (global 0)
<7>[  120.391751] hangcheck 	Requests:
<7>[  120.391773] hangcheck 		first  a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0
<7>[  120.391781] hangcheck 		last   a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.391806] hangcheck 	RING_START: 0x0000b000
<7>[  120.391813] hangcheck 	RING_HEAD:  0x000000c8
<7>[  120.391820] hangcheck 	RING_TAIL:  0x00001b10
<7>[  120.391829] hangcheck 	RING_CTL:   0x00003001
<7>[  120.391838] hangcheck 	RING_MODE:  0x00000000
<7>[  120.391844] hangcheck 	RING_IMR: 00000000
<7>[  120.391855] hangcheck 	ACTHD:  0x00000005_443a9d90
<7>[  120.391866] hangcheck 	BBADDR: 0x00000005_443aec41
<7>[  120.391878] hangcheck 	DMA_FADDR: 0x00000005_443b3980
<7>[  120.391884] hangcheck 	IPEIR: 0x00000000
<7>[  120.391891] hangcheck 	IPEHR: 0x18800101
<7>[  120.391900] hangcheck 	Execlist status: 0x00202098 00000040
<7>[  120.391908] hangcheck 	Execlist CSB read 5, write 5 [mmio:5], tasklet queued? no (enabled)
<7>[  120.391918] hangcheck 		ELSP[0] count=1, ring:{start:0000b000, hwsp:fffee280}, rq: a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.391923] hangcheck 		ELSP[1] idle
<7>[  120.391927] hangcheck 		HW active? 0x5
<7>[  120.391983] hangcheck 		E a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0
<7>[  120.392047] hangcheck 		E a0d [27:1404] prio=1 @ 13793ms: pm_rps[1244]/0
<7>[  120.392054] hangcheck 		E a0e [27:1406] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392061] hangcheck 		E a0f [27:1408] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392068] hangcheck 		E a10 [27:140a] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392074] hangcheck 		E a11 [27:140c] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392081] hangcheck 		E a12 [27:140e] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392087] hangcheck 		...skipping 10 executing requests...
<7>[  120.392094] hangcheck 		E a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392098] hangcheck 		Queue priority hint: 1
<7>[  120.392102] hangcheck HWSP:
<7>[  120.392111] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392115] hangcheck *
<7>[  120.392123] hangcheck [0040] 10008002 00000040 10008002 00000040 10008002 00000040 10008002 00000040
<7>[  120.392130] hangcheck [0060] 10008002 00000040 10008002 00000040 00000000 00000000 00000000 00000000
<7>[  120.392137] hangcheck [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392144] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000005
<7>[  120.392151] hangcheck [00c0] 000009eb 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392158] hangcheck [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392162] hangcheck *
<7>[  120.392167] hangcheck Idle? no
<7>[  120.392171] hangcheck Signals:
<7>[  120.392200] hangcheck 	[27:1424] @ 13792ms
<5>[  120.392420] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0

is peculiar as our writes into the global HWSP simply vanish, and we quite rightly conclude that we are unable to recover. That error seems related to #109605
Comment 3 Chris Wilson 2019-02-20 21:49:45 UTC
Fwiw, this issue is fixed by removing the global_seqno itself, e.g. https://patchwork.freedesktop.org/patch/286898/
Comment 4 Chris Wilson 2019-02-26 10:43:02 UTC
commit 89531e7d8ee8602b2723431a581250d5d0ec2913
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 26 09:49:19 2019 +0000

    drm/i915: Replace global_seqno with a hangcheck heartbeat seqno
    
    To determine whether an engine has 'stuck', we simply check whether or
    not is still on the same seqno for several seconds. To keep this simple
    mechanism intact over the loss of a global seqno, we can simply add a
    new global heartbeat seqno instead. As we cannot know the sequence in
    which requests will then be completed, we use a primitive random number
    generator instead (with a cycle long enough to not matter over an
    interval of a few thousand requests between hangcheck samples).
    
    The alternative to using a dedicated seqno on every request is to issue
    a heartbeat request and query its progress through the system. Sadly
    this requires us to reduce struct_mutex so that we can issue requests
    without requiring that bkl.
    
    v2: And without the extra CS_STALL for the hangcheck seqno -- we don't
    need strict serialisation with what comes later, we just need to be sure
    we don't write the hangcheck seqno before our batch is flushed.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-1-chris@chris-wilson.co.uk
Comment 5 Martin Peres 2019-03-06 16:16:49 UTC
(In reply to Chris Wilson from comment #4)
> commit 89531e7d8ee8602b2723431a581250d5d0ec2913
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Feb 26 09:49:19 2019 +0000
> 
>     drm/i915: Replace global_seqno with a hangcheck heartbeat seqno
>     
>     To determine whether an engine has 'stuck', we simply check whether or
>     not is still on the same seqno for several seconds. To keep this simple
>     mechanism intact over the loss of a global seqno, we can simply add a
>     new global heartbeat seqno instead. As we cannot know the sequence in
>     which requests will then be completed, we use a primitive random number
>     generator instead (with a cycle long enough to not matter over an
>     interval of a few thousand requests between hangcheck samples).
>     
>     The alternative to using a dedicated seqno on every request is to issue
>     a heartbeat request and query its progress through the system. Sadly
>     this requires us to reduce struct_mutex so that we can issue requests
>     without requiring that bkl.
>     
>     v2: And without the extra CS_STALL for the hangcheck seqno -- we don't
>     need strict serialisation with what comes later, we just need to be sure
>     we don't write the hangcheck seqno before our batch is flushed.
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-1-
> chris@chris-wilson.co.uk

Fair-enough! Thanks!
Comment 6 CI Bug Log 2019-03-06 16:17:00 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.