97968 – [i915] GPU HANG: ecode 9:0:0xfffffffe (Team Fortress 2)

Bug 97968 - [i915] GPU HANG: ecode 9:0:0xfffffffe (Team Fortress 2)

Summary: [i915] GPU HANG: ecode 9:0:0xfffffffe (Team Fortress 2)

Status:	CLOSED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i915 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Ian Romanick
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-09-29 02:54 UTC by snrub
Modified:	2016-12-08 11:10 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	SKL
i915 features:	GPU hang

Attachments
Ouptut of /sys/class/drm/card0/error (63.74 KB, application/x-compressed-tar) 2016-09-29 02:54 UTC, snrub	Details
Second crash dump (48.72 KB, application/x-bzip) 2016-09-29 15:24 UTC, snrub	Details
View All

Description snrub 2016-09-29 02:54:35 UTC

Created attachment 126840 [details]
Ouptut of /sys/class/drm/card0/error

To reproduce:

1.  Start up Steam
2.  Play Team Fortress 2
3.  Start a practice run with some bots
4.  Game stutters and eventually crashes (within the first 2 seconds).

$ uname -a 
Linux desktop 4.7.5-040705-generic #201609240533 SMP Sat Sep 24 09:35:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial

Machine: NUC Skull Canyon, Model nuc6i7kyk

Connector: DisplayPort to DVI adapter.

$ dmesg 
....
[  195.680575] [drm] stuck on render ring
[  195.688034] [drm] GPU HANG: ecode 9:0:0xfffffffe, in MatQueue0 [4918], reason: Engine(s) hung, action: reset
[  195.688036] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  195.688037] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  195.688038] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  195.688039] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  195.688040] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  195.689701] drm/i915: Resetting chip after gpu hang
[  197.680299] [drm] RC6 on
[  205.632284] [drm] stuck on render ring
[  205.638688] [drm] GPU HANG: ecode 9:0:0x85dffffb, in MatQueue0 [4918], reason: Engine(s) hung, action: reset
[  205.640245] drm/i915: Resetting chip after gpu hang
[  206.652422] [drm] RC6 on


The output of /sys/class/drm/card0/error and full dmesg output is in the attachment since I can't upload more than one file.

Comment 1 yann 2016-09-29 12:31:52 UTC

if you add i915.enable_rc6=0 to your command line, is gpu hang still happening?

Comment 2 yann 2016-09-29 12:33:35 UTC

(In reply to yann from comment #1)
> if you add i915.enable_rc6=0 to your command line, is gpu hang still
> happening?

If so, this should be fixed with:

commit d528a6a0f3fd346bd7cc2de611a4149b6ebaab41
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Tue Apr 5 15:56:16 2016 +0300

drm/i915/skl: Fix rc6 based gpu/system hang

Comment 3 snrub 2016-09-29 15:19:59 UTC

Yes, adding that to my command line still causes the hang to happen.

$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.7.5-040705-generic root=/dev/mapper/ubuntu--vg-root ro quiet splash i915.enable_rc6=0 vt.handoff=7

Comment 4 snrub 2016-09-29 15:23:43 UTC

And the dmesg / debug output: (and uploading the new crashdump)

$ dmesg
....
[  180.764441] [drm] GPU HANG: ecode 9:0:0x85dffffb, in MatQueue0 [4771], reason: Engine(s) hung, action: reset
[  180.764443] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  180.764444] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  180.764445] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  180.764445] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  180.764446] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  180.765925] drm/i915: Resetting chip after gpu hang
[  181.793871] [drm] RC6 off
[  190.793791] [drm] stuck on render ring
[  190.799477] [drm] GPU HANG: ecode 9:0:0xfefffffe, in MatQueue0 [4771], reason: Engine(s) hung, action: reset
[  190.801365] drm/i915: Resetting chip after gpu hang
[  192.793828] [drm] RC6 off

Comment 5 snrub 2016-09-29 15:24:10 UTC

Created attachment 126873 [details]
Second crash dump

Comment 6 yann 2016-09-29 15:36:37 UTC

(In reply to snrub from comment #5)
> Created attachment 126873 [details]
> Second crash dump

thanks for your quick feedback. So it looks like we have a different issue from the original one. For the 1st one either disabling rc6 or ensuring that you are updating to a kernel that has the commit

You may also try to update your mesa version if this is not already the case, collect and attach logs collected thanks to apitrace: http://apitrace.github.io/.

Regarding the last one, reassigning to Mesa (please let me know if I am mistaken with this GPU Hang).


Kernel: 4.7.5-040705-generic
Platform: Skylake NUC Skull Canyon, Model nuc6i7kyk (pci id: 0x193b)
Mesa: [Please confirm your mesa version]

From this error dump, hung is happening in render ring batch with active head at 0xd4bb4594, with 0x7a000004 (PIPE_CONTROL) as IPEHR.

Batch extract (around 0xd4bb4594):

0xd4bb4548:      0x7b000005: 3DPRIMITIVE: fail sequential
0xd4bb454c:      0x00000104:    vertex count
0xd4bb4550:      0x0000000c:    start vertex
0xd4bb4554:      0x00001d3a:    instance count
0xd4bb4558:      0x00000001:    start instance
0xd4bb455c:      0x00000000:    index bias
0xd4bb4560:      0x00000000: MI_NOOP
Bad count in PIPE_CONTROL
0xd4bb4564:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0xd4bb4568:      0x0000a000:    destination address
0xd4bb456c:      0xddc6a008:    immediate dword low
0xd4bb4570:      0x00000000:    immediate dword high
Bad count in PIPE_CONTROL
0xd4bb457c:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0xd4bb4580:      0x00101001:    destination address
0xd4bb4584:      0x00000000:    immediate dword low
0xd4bb4588:      0x00000000:    immediate dword high
Bad count in PIPE_CONTROL
0xd4bb4594:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0xd4bb4598:      0x00000408:    destination address
0xd4bb459c:      0x00000000:    immediate dword low
0xd4bb45a0:      0x00000000:    immediate dword high
0xd4bb45ac:      0x78210000: 3D UNKNOWN: 3d_965 opcode = 0x7821
0xd4bb45b0:      0x00006680: MI_NOOP
0xd4bb45b4:      0x78240000: 3D UNKNOWN: 3d_965 opcode = 0x7824

Comment 7 snrub 2016-09-30 01:16:40 UTC

Thanks for the pointer yann.  I updated Mesa from 11.2.2 to 12.1.0-devel  (using ppa:oibaf/graphics-drivers ).


Now it works fine!

Comment 8 randy 2016-12-08 11:10:51 UTC

May I know how does Mesa cause gpu hang?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.