Bug 106684

Summary:	[CI] igt@.* - dmesg-warn/dmesg-fail - ERROR rcs0: reset request timeout
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs, petri.latvala, tomi.p.sarvela
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	KBL	i915 features:	GEM/Other

Description Martin Peres 2018-05-28 11:51:38 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_38/fi-kbl-r/igt@kms_vblank@pipe-c-ts-continuation-suspend.html

[  164.306394] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  164.306417] ------------[ cut here ]------------
[  164.306423] WARN_ON(intel_gpu_reset(i915, (~0)))
[  164.306536] WARNING: CPU: 4 PID: 1370 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  164.306540] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core btusb btrtl asix btbcm usbnet btintel mii x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul bluetooth crc32_pclmul ghash_clmulni_intel snd_pcm e1000e ecdh_generic mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel
[  164.306653] CPU: 4 PID: 1370 Comm: kworker/u16:31 Tainted: G     U  W         4.17.0-rc4-gfe5bde58dca5-drmtip_38+ #1
[  164.306657] Hardware name: Intel Corporation Kabylake Client platform/Kabylake R DDR4 RVP, BIOS KBLSE2R1.R00.X078.P02.1703030515 03/03/2017
[  164.306665] Workqueue: events_unbound async_run_entry_fn
[  164.306734] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915]
[  164.306738] RSP: 0018:ffffb658009d7cc8 EFLAGS: 00010286
[  164.306746] RAX: 0000000000000000 RBX: ffff99fb26300000 RCX: 0000000000000001
[  164.306750] RDX: 0000000080000001 RSI: ffffffff990fb2c9 RDI: 00000000ffffffff
[  164.306755] RBP: ffff99fb26300068 R08: 000000007b7b9455 R09: 0000000000000000
[  164.306759] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99fb263089b0
[  164.306763] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  164.306768] FS:  0000000000000000(0000) GS:ffff99fb3ed00000(0000) knlGS:0000000000000000
[  164.306772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  164.306777] CR2: 000055eeee61bd80 CR3: 0000000218210005 CR4: 00000000003606e0
[  164.306781] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  164.306785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  164.306788] Call Trace:
[  164.306857]  i915_gem_suspend+0xec/0x140 [i915]
[  164.306913]  i915_drm_suspend+0x5f/0x160 [i915]
[  164.306925]  pci_pm_suspend+0x7c/0x130
[  164.306934]  ? pci_pm_freeze+0xc0/0xc0
[  164.306941]  dpm_run_callback+0x5d/0x2f0
[  164.306953]  __device_suspend+0x11f/0x600
[  164.306965]  ? dpm_watchdog_set+0x60/0x60
[  164.306982]  async_suspend+0x15/0x90
[  164.306990]  async_run_entry_fn+0x34/0x160
[  164.306999]  process_one_work+0x229/0x6a0
[  164.307017]  worker_thread+0x35/0x380
[  164.307028]  ? process_one_work+0x6a0/0x6a0
[  164.307033]  kthread+0x119/0x130
[  164.307040]  ? _kthread_create_on_node+0x60/0x60
[  164.307052]  ret_from_fork+0x3a/0x50
[  164.307075] Code: e0 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 5a de 03 00 85 c0 74 e0 48 c7 c6 c0 13 84 c0 48 c7 c7 1d 90 82 c0 e8 93 c8 96 d7 <0f> 0b eb c9 48 8d 6f 68 31 f6 48 89 ef e8 81 88 22 d8 48 89 df 
[  164.307290] irq event stamp: 466
[  164.307297] hardirqs last  enabled at (465): [<ffffffff980fc757>] vprintk_emit+0x4b7/0x4d0
[  164.307304] hardirqs last disabled at (466): [<ffffffff98a0111c>] error_entry+0x7c/0x100
[  164.307310] softirqs last  enabled at (438): [<ffffffff98c0032b>] __do_softirq+0x32b/0x4e1
[  164.307317] softirqs last disabled at (419): [<ffffffff9808f6f4>] irq_exit+0xa4/0xb0
[  164.307382] WARNING: CPU: 4 PID: 1370 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  164.307387] ---[ end trace ea2cd5ae9e8d1a54 ]---
[  179.976608] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  179.976613] ------------[ cut here ]------------
[  179.976615] WARN_ON(intel_gpu_reset(i915, (~0)))
[  179.976652] WARNING: CPU: 5 PID: 1400 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  179.976654] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core btusb btrtl asix btbcm usbnet btintel mii x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul bluetooth crc32_pclmul ghash_clmulni_intel snd_pcm e1000e ecdh_generic mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel
[  179.976692] CPU: 5 PID: 1400 Comm: kworker/u16:61 Tainted: G     U  W         4.17.0-rc4-gfe5bde58dca5-drmtip_38+ #1
[  179.976693] Hardware name: Intel Corporation Kabylake Client platform/Kabylake R DDR4 RVP, BIOS KBLSE2R1.R00.X078.P02.1703030515 03/03/2017
[  179.976697] Workqueue: events_unbound async_run_entry_fn
[  179.976720] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915]
[  179.976722] RSP: 0018:ffffb65800b67d90 EFLAGS: 00010282
[  179.976724] RAX: 0000000000000000 RBX: ffff99fb26300000 RCX: 0000000000000001
[  179.976726] RDX: 0000000080000001 RSI: ffffffff990fb2c9 RDI: 00000000ffffffff
[  179.976727] RBP: 0000000000000000 R08: 000000001362165f R09: 0000000000000000
[  179.976728] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99fb34ce33f8
[  179.976730] R13: ffffffffc06b8e90 R14: 0000000000000000 R15: 0000000000000000
[  179.976731] FS:  0000000000000000(0000) GS:ffff99fb3ed40000(0000) knlGS:0000000000000000
[  179.976733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  179.976734] CR2: 00007fddd6e93320 CR3: 0000000218210006 CR4: 00000000003606e0
[  179.976735] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  179.976737] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  179.976738] Call Trace:
[  179.976757]  i915_pm_resume_early+0xba/0x150 [i915]
[  179.976762]  dpm_run_callback+0x5d/0x2f0
[  179.976766]  device_resume_early+0xa6/0xe0
[  179.976769]  async_resume_early+0x14/0x40
[  179.976772]  async_run_entry_fn+0x34/0x160
[  179.976775]  process_one_work+0x229/0x6a0
[  179.976780]  worker_thread+0x35/0x380
[  179.976784]  ? process_one_work+0x6a0/0x6a0
[  179.976786]  kthread+0x119/0x130
[  179.976788]  ? _kthread_create_on_node+0x60/0x60
[  179.976793]  ret_from_fork+0x3a/0x50
[  179.976799] Code: e0 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 5a de 03 00 85 c0 74 e0 48 c7 c6 c0 13 84 c0 48 c7 c7 1d 90 82 c0 e8 93 c8 96 d7 <0f> 0b eb c9 48 8d 6f 68 31 f6 48 89 ef e8 81 88 22 d8 48 89 df 
[  179.976865] irq event stamp: 2034
[  179.976868] hardirqs last  enabled at (2033): [<ffffffff980fc757>] vprintk_emit+0x4b7/0x4d0
[  179.976871] hardirqs last disabled at (2034): [<ffffffff98a0111c>] error_entry+0x7c/0x100
[  179.976873] softirqs last  enabled at (1990): [<ffffffff98c0032b>] __do_softirq+0x32b/0x4e1
[  179.976876] softirqs last disabled at (1981): [<ffffffff9808f6f4>] irq_exit+0xa4/0xb0
[  179.976898] WARNING: CPU: 5 PID: 1400 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  179.976899] ---[ end trace ea2cd5ae9e8d1a55 ]---
[  180.308829] atkbd serio0: Failed to deactivate keyboard on isa0060/serio0
[  180.692902] atkbd serio0: Failed to enable keyboard on isa0060/serio0
[  184.174560] i915 0000:00:02.0: Resetting bcs0 for hang on bcs0, vcs0, vecs0
[  184.175977] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
[  184.176204] i915 0000:00:02.0: Resetting vcs0 for hang on bcs0, vcs0, vecs0
[  184.177631] [drm:gen8_reset_engines [i915]] *ERROR* vcs0: reset request timeout
[  184.177877] i915 0000:00:02.0: Resetting vecs0 for hang on bcs0, vcs0, vecs0
[  184.179202] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout
[  184.179643] i915 0000:00:02.0: Resetting chip for hang on bcs0, vcs0, vecs0
[  184.182779] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  184.287322] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  184.391364] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  184.491896] i915 0000:00:02.0: Failed to reset chip
[  184.498061] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 1 Chris Wilson 2018-05-29 19:36:01 UTC

The GPU totally died, and I mean totally! It's kind of the reason why we have:

taint:
        /*
         * History tells us that if we cannot reset the GPU now, we
         * never will. This then impacts everything that is run
         * subsequently. On failing the reset, we mark the driver
         * as wedged, preventing further execution on the GPU.
         * We also want to go one step further and add a taint to the
         * kernel so that any subsequent faults can be traced back to
         * this failure. This is important for CI, where if the
         * GPU/driver fails we would like to reboot and restart testing
         * rather than continue on into oblivion. For everyone else,
         * the system should still plod along, but they have been warned!
         */
        add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

for this condition with the expectation that CI was going to reboot the machine when it occurs.

Comment 2 Martin Peres 2018-05-29 20:05:19 UTC

(In reply to Chris Wilson from comment #1)
> The GPU totally died, and I mean totally! It's kind of the reason why we
> have:
> 
> taint:
>         /*
>          * History tells us that if we cannot reset the GPU now, we
>          * never will. This then impacts everything that is run
>          * subsequently. On failing the reset, we mark the driver
>          * as wedged, preventing further execution on the GPU.
>          * We also want to go one step further and add a taint to the
>          * kernel so that any subsequent faults can be traced back to
>          * this failure. This is important for CI, where if the
>          * GPU/driver fails we would like to reboot and restart testing
>          * rather than continue on into oblivion. For everyone else,
>          * the system should still plod along, but they have been warned!
>          */
>         add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> 
> for this condition with the expectation that CI was going to reboot the
> machine when it occurs.

OK! Adding Tomi and Petri to see what we can do here :)

Comment 3 Chris Wilson 2018-05-30 07:56:06 UTC

If doing a reboot cycle is still too slow, a full suspend/resume should be enough. And we can also check whether the device recovered immediately, going to a reboot afterwards as required.

Comment 4 Martin Peres 2018-07-12 12:00:47 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4426/shard-kbl7/igt@gem_eio@wait-wedge-immediate.html

[  331.691790] Setting dangerous option reset - tainting kernel
[  331.692131] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[  331.693980] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  331.801322] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  331.905263] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  332.007147] i915 0000:00:02.0: Failed to reset chip
[  332.011489] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 5 Chris Wilson 2018-08-13 15:51:11 UTC

The behaviour should have substantially changed with

commit f4e60c5cfbf217cc9faa3aeb63742860154fcfef (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Mon Aug 13 16:01:16 2018 +0300

    drm/i915: Force reset on unready engine
    
    If engine reports that it is not ready for reset, we
    give up. Evidence shows that forcing a per engine reset
    on an engine which is not reporting to be ready for reset,
    can bring it back into a working order. There is risk that
    we corrupt the context image currently executing on that
    engine. But that is a risk worth taking as if we unblock
    the engine, we prevent a whole device wedging in a case
    of full gpu reset.
    
    Reset individual engine even if it reports that it is not
    prepared for reset, but only if we aim for full gpu reset
    and not on first reset attempt.
    
    v2: force reset only on later attempts, readability (Chris)
    v3: simplify with adequate caffeine levels (Chris)
    v4: comment about risks and migitations (Chris)
    
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180813130116.7250-1-mika.kuoppala@linux.intel.com

Comment 6 Martin Peres 2018-09-28 13:41:56 UTC

(In reply to Chris Wilson from comment #5)
> The behaviour should have substantially changed with
> 
> commit f4e60c5cfbf217cc9faa3aeb63742860154fcfef (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Mon Aug 13 16:01:16 2018 +0300
> 
>     drm/i915: Force reset on unready engine
>     
>     If engine reports that it is not ready for reset, we
>     give up. Evidence shows that forcing a per engine reset
>     on an engine which is not reporting to be ready for reset,
>     can bring it back into a working order. There is risk that
>     we corrupt the context image currently executing on that
>     engine. But that is a risk worth taking as if we unblock
>     the engine, we prevent a whole device wedging in a case
>     of full gpu reset.
>     
>     Reset individual engine even if it reports that it is not
>     prepared for reset, but only if we aim for full gpu reset
>     and not on first reset attempt.
>     
>     v2: force reset only on later attempts, readability (Chris)
>     v3: simplify with adequate caffeine levels (Chris)
>     v4: comment about risks and migitations (Chris)
>     
>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>     Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180813130116.7250-1-mika.
> kuoppala@linux.intel.com

Seems like it fixed it. Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.