92774 – [BSW] GPU reset fails after GPU HANG: *ERROR* Failed to reset chip: -5

Bug 92774 - [BSW] GPU reset fails after GPU HANG: *ERROR* Failed to reset chip: -5

Summary: [BSW] GPU reset fails after GPU HANG: *ERROR* Failed to reset chip: -5

Status:	CLOSED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-11-02 08:33 UTC by Tomi Sarvela
Modified:	2016-10-14 13:29 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BDW, BSW/CHT, SKL
i915 features:	GPU hang

Attachments
*dmesg - [drm:i915_reset [i915]] ERROR* Failed to reset chip: -5** (492.36 KB, text/plain) 2015-11-02 08:33 UTC, Tomi Sarvela	no flags	Details
SKL dmesg with same problem (1000.48 KB, text/plain) 2015-11-02 10:49 UTC, Tomi Sarvela	no flags	Details
drm/i915: Request for resets under forcewake (1.92 KB, patch) 2015-11-03 13:02 UTC, Mika Kuoppala	no flags	Details \| Splinter Review
View All

Description Tomi Sarvela 2015-11-02 08:33:27 UTC

Created attachment 119335 [details]
dmesg - [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

We're tracking drm-intel-nightly from freedesktop.org and running IGT (intel-gpu-tools/piglit) against each merge. Large amount of debug options is turned on for this kernel.

On one machine the tests can cause GPU HANG, but the GPU reset fails, and rest of the tests give the same error code. Below is the interesting part, full dmesg attached.

Hardware is Intel NUC5CPYH (Braswell Celeron N3050)

Kuoppala, Mika <mika.kuoppala@intel.com> knows about this issue.





[  206.105691] kms_pipe_crc_basic: executing
[  206.368453] kms_pipe_crc_basic: starting subtest hang-read-crc-pipe-A
[  211.785045] [drm] stuck on render ring
[  211.798824] [drm] GPU HANG: ecode 8:0:0xfffffffe, in kms_pipe_crc_ba [5601], reason: Ring hung, action: reset
[  211.799199] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  211.799209] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  211.799215] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  211.799221] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  211.799227] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  211.799639] kobject: 'card0' (ffff88007906a530): kobject_uevent_env
[  211.799791] kobject: 'card0' (ffff88007906a530): fill_kobj_path: path = '/devices/pci0000:00/0000:00:02.0/drm/card0'
[  211.802857] kobject: 'card0' (ffff88007906a530): kobject_uevent_env
[  211.803104] kobject: 'card0' (ffff88007906a530): fill_kobj_path: path = '/devices/pci0000:00/0000:00:02.0/drm/card0'
[  212.509176] [drm:gen8_do_reset [i915]] *ERROR* render ring: reset request timeout
[  212.509244] [drm] Simulated gpu hang, resetting stop_rings
[  212.509248] drm/i915: Resetting chip after gpu hang
[  212.509275] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[  212.641248] kms_pipe_crc_basic: exiting, ret=0
[  212.656806] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5
[  212.853766] gem_ctx_param_basic: executing
[  212.857279] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5
[  212.861674] gem_ctx_param_basic: exiting, ret=99
[  213.050754] kms_addfb_basic: executing
[  213.053785] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5
[  213.061222] kms_addfb_basic: exiting, ret=99

Comment 1 Tomi Sarvela 2015-11-02 08:37:06 UTC

This has happened twice, about one week separate. Latest commit this happened:
86ba603f327626055fe1436112b3786eaaaf7fb1 2015-10-31_08-27-21 drm-intel-nightly: 2015y-10m-31d-08h-26m-39s UTC integration manifest

Comment 2 Jani Nikula 2015-11-02 09:12:48 UTC

http://patchwork.freedesktop.org/patch/msgid/1446216229-26474-1-git-send-email-mika.kuoppala@intel.com

Comment 3 Tomi Sarvela 2015-11-02 10:49:18 UTC

Created attachment 119340 [details]
SKL dmesg with same problem

Comment 4 Mika Kuoppala 2015-11-03 13:02:03 UTC

Created attachment 119376 [details] [review]
drm/i915: Request for resets under forcewake

Comment 5 Tomi Sarvela 2015-11-03 14:43:34 UTC

Comment on attachment 119376 [details] [review]
drm/i915: Request for resets under forcewake

Review of attachment 119376 [details] [review]:
-----------------------------------------------------------------

Tested with BSW NUC hardware where the problem was easily reproduced.
With this patch the test runs didn't trigger GPU reset fail.

Tested-by: Tomi Sarvela <tomix.p.sarvela@intel.com>

Comment 6 Jani Nikula 2015-11-05 13:30:31 UTC

Fixed by

commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Nov 5 13:11:38 2015 +0200

    drm/i915: Do graphics device reset under forcewake

in drm-intel-next-fixes.

Comment 7 Chris Wilson 2015-12-31 10:37:52 UTC

I've just seen

[  163.979728] drm/i915: Resetting chip after gpu hang
[  164.695335] [drm:gen8_do_reset] *ERROR* blitter ring: reset request timeout
[  164.695342] [drm:i915_reset] *ERROR* Failed to reset chip: -5

on bdw with

commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Nov 5 13:11:38 2015 +0200

    drm/i915: Do graphics device reset under forcewake

applied. I do not expect it to be easily reproducible.

Comment 8 yann 2016-09-20 15:50:16 UTC

(In reply to Chris Wilson from comment #7)
> I've just seen
> 
> [  163.979728] drm/i915: Resetting chip after gpu hang
> [  164.695335] [drm:gen8_do_reset] *ERROR* blitter ring: reset request
> timeout
> [  164.695342] [drm:i915_reset] *ERROR* Failed to reset chip: -5
> 
> on bdw with
> 
> commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Thu Nov 5 13:11:38 2015 +0200
> 
>     drm/i915: Do graphics device reset under forcewake
> 
> applied. I do not expect it to be easily reproducible.

Any update on this, especially based on all improvement done recently in kernel, I would propose to close this one and fill a new one if this is occurring again.

Comment 9 Chris Wilson 2016-10-14 08:50:51 UTC

It is not impossible for us to kill the GPU in such a way that recovery fails, seems like it is out of our control.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.