106443 – [skl] GPU HANG: ecode 9:0:0x859ffffb, in Xorg

Bug 106443 - [skl] GPU HANG: ecode 9:0:0x859ffffb, in Xorg

Summary: [skl] GPU HANG: ecode 9:0:0x859ffffb, in Xorg

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high major
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:	Triaged ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-05-08 18:24 UTC by Simeon Miteff
Modified:	2019-09-25 19:11 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	SKL
i915 features:	GPU hang

Attachments
GPU crash dump (29.07 KB, text/plain) 2018-05-08 18:24 UTC, Simeon Miteff	Details
GPU crash dump with 4.4.0 kernel (472.12 KB, text/plain) 2018-05-09 05:17 UTC, Simeon Miteff	Details
dmesg with drm.debug=0x1e log_buf_len=4M (353.35 KB, text/plain) 2018-06-01 19:32 UTC, Simeon Miteff	Details
Latest GPU crashdump (50.40 KB, text/plain) 2018-06-01 19:32 UTC, Simeon Miteff	Details
View All

Description Simeon Miteff 2018-05-08 18:24:13 UTC

Created attachment 139429 [details]
GPU crash dump

I get GPU hangs on Sky Lake integrated graphics when not booting with either:

i915.modeset=0 
or
video=vesafb:off

The hardware is a Dell 7040 with Intel i7-6700 CPU.

Xorg does not crash, but everything except the cursor locks up each time the GPU resets. Here his the drm message:

[   41.823218] [drm] GPU HANG: ecode 9:0:0x859ffffb, in Xorg [2092], reason: hang on rcs0, action: reset
[   41.823219] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   41.823220] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   41.823220] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   41.823220] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   41.823221] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Subsequent to this there are frequent repeated resets:

[   92.731613] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  100.731570] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  108.731500] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  116.731410] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  606.840675] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  617.816522] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  625.816467] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  633.816354] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Kernel: 4.17.0
Xorg server: 1.19.6
Intel driver: 2.99.917+git20171229-1
Mesa: 18.0.0
GuC firmware: i915/skl_guc_ver9_33.bin
HuC firmware: i915/skl_huc_ver01_07_1398.bin

I started out with a stock Ubuntu Xenial and progressively upgraded bits with no avail. Turning off modeset/vesafb was an acceptable solution until I needed dual display support.

GPU crash dump is attached

Comment 1 Chris Wilson 2018-05-08 20:31:47 UTC

Nothing stands out as being an old bug resurfaced; so time for some fresh debugging.

What was the old kernel/userspace this reproduced on? i.e. do you still have the original kernel you installed? Could you capture that error state for comparison?

Comment 2 Simeon Miteff 2018-05-09 05:17:26 UTC

Created attachment 139437 [details]
GPU crash dump with 4.4.0 kernel

Comment 3 Simeon Miteff 2018-05-09 05:18:20 UTC

The old kernel was Ubuntu's 4.4.0-122.146-generic.

Based on what is currently in Xenial, I believe the old userspace had:

Intel driver: 2.99.917+git20160325
Mesa: 11.2.0

Let me know if accuracy is critical (I think can search /var/log/apt/ to find the exact versions).

I booted the old kernel (same upgraded userspace), and reproduced the error. This time Xorg crashed once after login and then started again. This time kernel messages where a little more colorful than before:

[   54.764325] [drm] stuck on render ring
[   54.764435] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [2102], reason: Engine(s) hung, action: reset
[   54.764436] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   54.764437] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   54.764438] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   54.764438] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   54.764439] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   54.766472] drm/i915: Resetting chip after gpu hang
[   56.764626] [drm] RC6 on
[   72.756913] [drm] stuck on render ring
[   72.757189] [drm] GPU HANG: ecode 9:0:0x86dffffd, in Xorg [2102], reason: Engine(s) hung, action: reset
[   72.759030] drm/i915: Resetting chip after gpu hang
[   73.792203] [drm] RC6 on
[   75.850856] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
[   84.749848] [drm] stuck on render ring
[   84.750103] [drm] GPU HANG: ecode 9:0:0x86dffffd, in Xorg [2102], reason: Engine(s) hung, action: reset
[   84.751978] drm/i915: Resetting chip after gpu hang
[   86.748842] [drm] RC6 on
[  165.726069] [drm] stuck on render ring
[  165.726437] [drm] GPU HANG: ecode 9:0:0x86dffffd, in Xorg [3066], reason: Engine(s) hung, action: reset
[  165.728202] drm/i915: Resetting chip after gpu hang
[  167.724909] [drm] RC6 on
[  185.713779] [drm] stuck on render ring
[  185.714098] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [3066], reason: Engine(s) hung, action: reset
[  185.715884] drm/i915: Resetting chip after gpu hang
[  187.712852] [drm] RC6 on

I uploaded the corresponding error state.

Comment 4 Jani Saarinen 2018-05-09 05:45:01 UTC

HI,
If Chris agrees and makes sense, you could try also using latest drm-tip: https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M, please also send debug dmesg from the kernel you see issues on.

Comment 5 Chris Wilson 2018-05-09 07:50:08 UTC

(In reply to Simeon Miteff from comment #2)
> Created attachment 139437 [details]
> GPU crash dump with 4.4.0 kernel

Hmm, also switched to -modesetting. I would make sure that mesa is uptodate (18.0; at least 17.3 to be sure of having the majority of hang fixes for -modesetting on Skylake). But the switch defeats the purpose of testing the old kernel :)

Comment 6 Jani Saarinen 2018-05-11 05:01:08 UTC

So, maybe try drm-tip then?

Comment 7 Simeon Miteff 2018-05-11 05:15:45 UTC

No problem. I think I can try that tomorrow.

Comment 8 Jani Saarinen 2018-05-13 07:48:46 UTC

OK, thanks.

Comment 9 Jani Saarinen 2018-05-17 10:04:30 UTC

ping, testing drm-tip?

Comment 10 Jani Saarinen 2018-05-24 07:36:19 UTC

Any updates on this testing drm-tip?

Comment 11 Jani Saarinen 2018-05-28 06:24:21 UTC

Reporter, any upddates on this?

Comment 12 Simon Lee 2018-05-31 16:53:45 UTC

(In reply to Simeon Miteff from comment #7)
> No problem. I think I can try that tomorrow.

Hi, do you have an update?

Comment 13 Simeon Miteff 2018-06-01 19:31:10 UTC

Hi

So sorry for the delay. I retested with the 4.17 kernel built by Ubuntu from drm-tip as of yesterday, with drm.debug=0x1e log_buf_len=4M as requested.

I attach the new dmesg and crashdump.

Regards,
Simeon

Comment 14 Simeon Miteff 2018-06-01 19:32:07 UTC

Created attachment 139952 [details]
dmesg with drm.debug=0x1e log_buf_len=4M

Comment 15 Simeon Miteff 2018-06-01 19:32:25 UTC

Created attachment 139953 [details]
Latest GPU crashdump

Comment 16 Lakshmi 2018-09-06 12:38:35 UTC

Simeon, Sorry for the delay..

Can you try to reproduce the issue using latest drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
AT this point, this will help to proceed with this bug.

Comment 17 Chris Wilson 2018-09-06 18:32:59 UTC

The last dumps indicate the problem is in mesa, and many related bugs have been fixed, hopefully yours included.

Comment 18 Simeon Miteff 2018-09-06 22:36:37 UTC

Sorry guys, I have changed jobs and no longer have access to this machine.

If someone else has access to a Dell 7040 with the Intel i7-6700 CPU, maybe they can test?

Comment 19 GitLab Migration User 2019-09-25 19:11:10 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1722.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.