Bug 109606 - [CI][DRMTIP] igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0
Summary: [CI][DRMTIP] igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-11 11:30 UTC by Lakshmi
Modified: 2019-02-20 21:49 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-02-11 11:30:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html

Starting subtest: reset
(pm_rps:1245) ioctl_wrappers-CRITICAL: Test assertion failure function gem_execbuf_wr, file ../lib/ioctl_wrappers.c:641:
(pm_rps:1245) ioctl_wrappers-CRITICAL: Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0
(pm_rps:1245) ioctl_wrappers-CRITICAL: error: -5 != 0
Subtest reset failed.
Comment 1 CI Bug Log 2019-02-11 11:33:59 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@pm_rps@reset - dmesg-fail - Failed assertion: __gem_execbuf_wr(fd, execbuf) == 0\n[^\n]+error: -5 != 0
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-icl-u3/igt@pm_rps@reset.html
Comment 2 Chris Wilson 2019-02-11 13:12:32 UTC
<7>[  104.332734] [IGT] pm_rps: starting subtest reset
<5>[  104.333284] Setting dangerous option reset - tainting kernel
<6>[  105.594685] i915 0000:00:02.0: GPU HANG: ecode 11:0:0x00000000, Manually set wedged engine mask = ffffffffffffffff
<6>[  105.594790] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[  105.594793] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[  105.594795] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[  105.594797] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[  105.594800] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<5>[  105.594849] i915 0000:00:02.0: Resetting rcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596370] i915 0000:00:02.0: Resetting bcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596494] i915 0000:00:02.0: Resetting vcs0 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596627] i915 0000:00:02.0: Resetting vcs2 for Manually set wedged engine mask = ffffffffffffffff
<5>[  105.596756] i915 0000:00:02.0: Resetting vecs0 for Manually set wedged engine mask = ffffffffffffffff
<7>[  107.207564] [drm:edp_panel_vdd_off_sync [i915]] Turning eDP port A VDD off
<7>[  107.207783] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067
<7>[  120.391709] hangcheck rcs0
<7>[  120.391740] hangcheck 	current seqno 9eb, last a1d, hangcheck 9eb [14016 ms]
<7>[  120.391745] hangcheck 	Reset count: 1 (global 0)
<7>[  120.391751] hangcheck 	Requests:
<7>[  120.391773] hangcheck 		first  a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0
<7>[  120.391781] hangcheck 		last   a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.391806] hangcheck 	RING_START: 0x0000b000
<7>[  120.391813] hangcheck 	RING_HEAD:  0x000000c8
<7>[  120.391820] hangcheck 	RING_TAIL:  0x00001b10
<7>[  120.391829] hangcheck 	RING_CTL:   0x00003001
<7>[  120.391838] hangcheck 	RING_MODE:  0x00000000
<7>[  120.391844] hangcheck 	RING_IMR: 00000000
<7>[  120.391855] hangcheck 	ACTHD:  0x00000005_443a9d90
<7>[  120.391866] hangcheck 	BBADDR: 0x00000005_443aec41
<7>[  120.391878] hangcheck 	DMA_FADDR: 0x00000005_443b3980
<7>[  120.391884] hangcheck 	IPEIR: 0x00000000
<7>[  120.391891] hangcheck 	IPEHR: 0x18800101
<7>[  120.391900] hangcheck 	Execlist status: 0x00202098 00000040
<7>[  120.391908] hangcheck 	Execlist CSB read 5, write 5 [mmio:5], tasklet queued? no (enabled)
<7>[  120.391918] hangcheck 		ELSP[0] count=1, ring:{start:0000b000, hwsp:fffee280}, rq: a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.391923] hangcheck 		ELSP[1] idle
<7>[  120.391927] hangcheck 		HW active? 0x5
<7>[  120.391983] hangcheck 		E a0c [27:1402] prio=2 @ 14797ms: pm_rps[1244]/0
<7>[  120.392047] hangcheck 		E a0d [27:1404] prio=1 @ 13793ms: pm_rps[1244]/0
<7>[  120.392054] hangcheck 		E a0e [27:1406] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392061] hangcheck 		E a0f [27:1408] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392068] hangcheck 		E a10 [27:140a] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392074] hangcheck 		E a11 [27:140c] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392081] hangcheck 		E a12 [27:140e] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392087] hangcheck 		...skipping 10 executing requests...
<7>[  120.392094] hangcheck 		E a1d+ [27:1424] prio=1 @ 13792ms: pm_rps[1244]/0
<7>[  120.392098] hangcheck 		Queue priority hint: 1
<7>[  120.392102] hangcheck HWSP:
<7>[  120.392111] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392115] hangcheck *
<7>[  120.392123] hangcheck [0040] 10008002 00000040 10008002 00000040 10008002 00000040 10008002 00000040
<7>[  120.392130] hangcheck [0060] 10008002 00000040 10008002 00000040 00000000 00000000 00000000 00000000
<7>[  120.392137] hangcheck [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392144] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000005
<7>[  120.392151] hangcheck [00c0] 000009eb 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392158] hangcheck [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  120.392162] hangcheck *
<7>[  120.392167] hangcheck Idle? no
<7>[  120.392171] hangcheck Signals:
<7>[  120.392200] hangcheck 	[27:1424] @ 13792ms
<5>[  120.392420] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0

is peculiar as our writes into the global HWSP simply vanish, and we quite rightly conclude that we are unable to recover. That error seems related to #109605
Comment 3 Chris Wilson 2019-02-20 21:49:45 UTC
Fwiw, this issue is fixed by removing the global_seqno itself, e.g. https://patchwork.freedesktop.org/patch/286898/


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.