108717 – [drm] GPU HANG: ecode 9:0:0x85dffffd, in chrome [18418], reason: hang on rcs0, action: reset

Bug 108717 - [drm] GPU HANG: ecode 9:0:0x85dffffd, in chrome [18418], reason: hang on rcs0, action: reset

Summary: [drm] GPU HANG: ecode 9:0:0x85dffffd, in chrome [18418], reason: hang on rcs0...

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-11-12 21:49 UTC by muradm
Modified:	2019-09-25 19:15 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	KBL
i915 features:

Attachments
cat /sys/class/drm/card0/error (3.24 KB, application/gzip) 2018-11-12 21:49 UTC, muradm	Details
dmesg (106.73 KB, text/plain) 2018-11-12 21:50 UTC, muradm	Details
glxinfo (29.12 KB, text/plain) 2018-11-12 21:54 UTC, muradm	Details
vainfo (2.02 KB, text/plain) 2018-11-12 21:54 UTC, muradm	Details
xrandr (312 bytes, text/plain) 2018-11-12 21:55 UTC, muradm	Details
xrandr --verbose (1.38 KB, text/plain) 2018-11-12 22:12 UTC, muradm	Details
cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc.dump (103.48 KB, application/gzip) 2018-11-13 23:17 UTC, muradm	Details
cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc_xorg.dump (141.18 KB, application/gzip) 2018-11-19 03:41 UTC, muradm	Details
/sys/class/drm/card0/error (32.98 KB, text/plain) 2018-12-15 12:34 UTC, tomaik	Details
View All

Description muradm 2018-11-12 21:49:04 UTC

Created attachment 142445 [details]
cat /sys/class/drm/card0/error

A month back I moved to ThinkPad X1 Carbon 6th Gen (20KH006MRT) with fresh ArchLinux install. Since then I'm battling with GPU.

Periodically (at least once a day, can do more frequently) GPU hangs. Google Chrome is running (with hardware acceleration). As the result, sometimes not in any particular order:

1) GPU process of Chrome may crash on first hang, then in few hours Gnome is crashing any way
2) Gnome may crash to black text mode screen with me be able to switch to another terminal to reboot
3) Everything is crashing to black screen (no text cursor) and host not responding to anything (including network) then hard power cycle reboot is needed.

This happens regardless external monitor attached to HDMI or not.

I think I read every article / wiki available on subject, and tried a lot of configurations of i915 and other things.

Yesterday I switched from mainline 4.18 to testing 4.19 Linux kernel in order to get latest everything. Just now same hang happened as per 1) above.

journalctl (omitting other errors) =>
========================================
Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU HANG: ecode 9:0:0x85dffffd, in chrome [18418], reason: hang on rcs0, action: reset
Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 13 01:15:22 muradm-aln1 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 13 01:15:22 muradm-aln1 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 13 01:15:22 muradm-aln1 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 13 01:15:22 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
========================================

Dump attached as well.

OS: Arch Linux x86_64
Kernel: 4.19.1-arch1-1-ARCH
Host: 20KH006MRT ThinkPad X1 Carbon 6th
DE: GNOME 3.30.1
CPU: Intel i7-8550U (8) @ 4.000GHz
GPU: Intel UHD Graphics 620

Some related packages:

local/libdrm 2.4.96-1
local/libva 2.3.0-1
local/libva-intel-driver 2.2.0-1
local/libva-utils 2.3.0-1
local/linux 4.19.1.arch1-1 (base)
local/linux-api-headers 4.17.11-1
local/linux-firmware 20181026.1cb4e51-1 (base)
local/mesa 18.2.4-1
local/mesa-demos 8.4.0-1
local/qt5-wayland 5.11.2-1 (qt qt5)
local/util-linux 2.33-2 (base base-devel)
local/vulkan-icd-loader 1.1.85+2969+5abee6173-1
local/vulkan-intel 18.2.4-1
local/wayland 1.16.0-1
local/wayland-protocols 1.16-1
local/xorg-bdftopcf 1.1-1 (xorg xorg-apps)
local/xorg-server 1.20.3-1 (xorg)
local/xorg-server-common 1.20.3-1 (xorg)
local/xorg-server-xwayland 1.20.3-1 (xorg)
local/xorgproto 2018.4-1

cat /etc/modprobe.d/i915.conf
options i915 modeset=1 enable_guc=3 enable_fbc=1 fastboot=1

dmesg | grep drm
== (up to a point of hang) ==============
[    2.654949] fb: switching to inteldrmfb from EFI VGA
[    2.654994] [drm] Replacing VGA console driver
[    2.657309] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    2.657310] [drm] Driver supports precise vblank timestamp query.
[    2.659687] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[    2.666245] [drm] HuC: Loaded firmware i915/kbl_huc_ver02_00_1810.bin (version 2.0)
[    2.677443] [drm] GuC: Loaded firmware i915/kbl_guc_ver9_39.bin (version 9.39)
[    3.224056] [drm] Initialized i915 1.6.0 20180719 for 0000:00:02.0 on minor 0
[    3.674308] fbcon: inteldrmfb (fb0) is primary device
[    3.674318] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    4.145904] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[   31.447100] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 3377.147569] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 3389.843556] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 3391.847593] [drm] HuC: Loaded firmware i915/kbl_huc_ver02_00_1810.bin (version 2.0)
[ 3391.858472] [drm] GuC: Loaded firmware i915/kbl_guc_ver9_39.bin (version 9.39)
[ 3392.079989] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 3413.745747] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
==========================================

Comment 1 muradm 2018-11-12 21:50:25 UTC

Created attachment 142446 [details]
dmesg

Comment 2 muradm 2018-11-12 21:54:23 UTC

Created attachment 142447 [details]
glxinfo

Comment 3 muradm 2018-11-12 21:54:48 UTC

Created attachment 142448 [details]
vainfo

Comment 4 muradm 2018-11-12 21:55:47 UTC

Created attachment 142449 [details]
xrandr

Comment 5 muradm 2018-11-12 22:12:39 UTC

Created attachment 142450 [details]
xrandr --verbose

Comment 6 Chris Wilson 2018-11-13 09:23:05 UTC

While this time is appears to be a userspace; don't enable unsafe parameters such as enable_guc -- it is not enabled by default because it has known unaddressed issues (such as causing GPU hangs).

Comment 7 Lionel Landwerlin 2018-11-13 09:43:15 UTC

Is it a side effect of loading GuC that we're missing the batch in the error state?

Comment 8 Lionel Landwerlin 2018-11-13 09:45:28 UTC

(In reply to Lionel Landwerlin from comment #7)
> Is it a side effect of loading GuC that we're missing the batch in the error
> state?

Oops, running into a decompression issue or something...

Comment 9 muradm 2018-11-13 17:30:58 UTC

Ok, I removed GUC option. Left with:

  options i915 modeset=1 enable_fbc=1 fastboot=1

Let's if repeats, then I will try rolling back to 4.18.
Will report back.

Comment 10 muradm 2018-11-13 23:16:34 UTC

Just happened again, after enable_guc is removed.

  kernel: [drm] GPU HANG: ecode 9:0:0x87f5fff9, in chrome [1909], reason: hang on rcs0, action: reset
  kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
  kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
  kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
  kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
  kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error

I'm adding drmi915kblcrash_no_guc.dump also.

Comment 11 muradm 2018-11-13 23:17:18 UTC

Created attachment 142456 [details]
cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc.dump

Comment 12 muradm 2018-11-14 02:31:11 UTC

On my last comment only Chrome's GPU process crashed. I continued working. And now host totally crashed.

Nov 14 06:19:55 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/061955.007214:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=0.998911 s, last_timebase_=32267103603 bogo-microseconds, timebase=32268102514 bogo-microseconds, last_media_st>
Nov 14 06:19:55 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/061955.366345:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=0.971436 s, last_timebase_=32252803405 bogo-microseconds, timebase=32268346391 bogo-microseconds, last_media_st>
Nov 14 06:20:00 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/062000.007032:ERROR:gl_surface_presentation_helper.cc(237)] GetVSyncParametersIfAvailable() failed!
Nov 14 06:20:01 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/062001.241087:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=1.00253 s, last_timebase_=32273103226 bogo-microseconds, timebase=32274105752 bogo-microseconds, last_media_str>
Nov 14 06:20:17 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 14 06:20:17 muradm-aln1 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 14 06:20:18 muradm-aln1 kernel: i915 0000:00:02.0: Failed to reset chip
Nov 14 06:20:18 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error
Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error
Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error
Nov 14 06:20:18 muradm-aln1 terminator[2758]: Error reading events from display: Broken pipe
Nov 14 06:20:18 muradm-aln1 evolution-alarm[1725]: Error reading events from display: Broken pipe
Nov 14 06:20:18 muradm-aln1 gitter.desktop[1445]: [12242:12242:1114/062018.137101:ERROR:x11_util.cc(90)] X IO error received (X server probably went away)
Nov 14 06:20:18 muradm-aln1 gitter.desktop[1445]: [12208:12208:1114/062018.137779:ERROR:chrome_browser_main_extra_parts_x11.cc(62)] X IO error received (X server probably went away)

Comment 13 muradm 2018-11-19 03:41:29 UTC

Created attachment 142510 [details]
cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc_xorg.dump

Switching from Wayland to Xorg still causes GPU hang.

First hang with crash dump report
===================================
Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU HANG: ecode 9:0:0x87f5fef9, in chromium [1907], reason: hang on rcs0, action: reset
Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 19 06:45:39 muradm-aln1 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 19 06:45:39 muradm-aln1 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 19 06:45:39 muradm-aln1 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 19 06:45:39 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
===================================

And then in 40 minutes
===================================
Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Failed to reset chip
Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
===================

which causes Xorg and Gnome to crash.

Comment 14 tomaik 2018-12-15 12:33:06 UTC

I have similar issue.

Comment 15 tomaik 2018-12-15 12:34:01 UTC

Created attachment 142815 [details]
/sys/class/drm/card0/error

Comment 16 GitLab Migration User 2019-09-25 19:15:11 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1770.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.