Created attachment 142445 [details] cat /sys/class/drm/card0/error A month back I moved to ThinkPad X1 Carbon 6th Gen (20KH006MRT) with fresh ArchLinux install. Since then I'm battling with GPU. Periodically (at least once a day, can do more frequently) GPU hangs. Google Chrome is running (with hardware acceleration). As the result, sometimes not in any particular order: 1) GPU process of Chrome may crash on first hang, then in few hours Gnome is crashing any way 2) Gnome may crash to black text mode screen with me be able to switch to another terminal to reboot 3) Everything is crashing to black screen (no text cursor) and host not responding to anything (including network) then hard power cycle reboot is needed. This happens regardless external monitor attached to HDMI or not. I think I read every article / wiki available on subject, and tried a lot of configurations of i915 and other things. Yesterday I switched from mainline 4.18 to testing 4.19 Linux kernel in order to get latest everything. Just now same hang happened as per 1) above. journalctl (omitting other errors) => ======================================== Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU HANG: ecode 9:0:0x85dffffd, in chrome [18418], reason: hang on rcs0, action: reset Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Nov 13 01:15:22 muradm-aln1 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Nov 13 01:15:22 muradm-aln1 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Nov 13 01:15:22 muradm-aln1 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Nov 13 01:15:22 muradm-aln1 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error Nov 13 01:15:22 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ======================================== Dump attached as well. OS: Arch Linux x86_64 Kernel: 4.19.1-arch1-1-ARCH Host: 20KH006MRT ThinkPad X1 Carbon 6th DE: GNOME 3.30.1 CPU: Intel i7-8550U (8) @ 4.000GHz GPU: Intel UHD Graphics 620 Some related packages: local/libdrm 2.4.96-1 local/libva 2.3.0-1 local/libva-intel-driver 2.2.0-1 local/libva-utils 2.3.0-1 local/linux 4.19.1.arch1-1 (base) local/linux-api-headers 4.17.11-1 local/linux-firmware 20181026.1cb4e51-1 (base) local/mesa 18.2.4-1 local/mesa-demos 8.4.0-1 local/qt5-wayland 5.11.2-1 (qt qt5) local/util-linux 2.33-2 (base base-devel) local/vulkan-icd-loader 1.1.85+2969+5abee6173-1 local/vulkan-intel 18.2.4-1 local/wayland 1.16.0-1 local/wayland-protocols 1.16-1 local/xorg-bdftopcf 1.1-1 (xorg xorg-apps) local/xorg-server 1.20.3-1 (xorg) local/xorg-server-common 1.20.3-1 (xorg) local/xorg-server-xwayland 1.20.3-1 (xorg) local/xorgproto 2018.4-1 cat /etc/modprobe.d/i915.conf options i915 modeset=1 enable_guc=3 enable_fbc=1 fastboot=1 dmesg | grep drm == (up to a point of hang) ============== [ 2.654949] fb: switching to inteldrmfb from EFI VGA [ 2.654994] [drm] Replacing VGA console driver [ 2.657309] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 2.657310] [drm] Driver supports precise vblank timestamp query. [ 2.659687] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4) [ 2.666245] [drm] HuC: Loaded firmware i915/kbl_huc_ver02_00_1810.bin (version 2.0) [ 2.677443] [drm] GuC: Loaded firmware i915/kbl_guc_ver9_39.bin (version 9.39) [ 3.224056] [drm] Initialized i915 1.6.0 20180719 for 0000:00:02.0 on minor 0 [ 3.674308] fbcon: inteldrmfb (fb0) is primary device [ 3.674318] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device [ 4.145904] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 31.447100] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 3377.147569] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 3389.843556] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 3391.847593] [drm] HuC: Loaded firmware i915/kbl_huc_ver02_00_1810.bin (version 2.0) [ 3391.858472] [drm] GuC: Loaded firmware i915/kbl_guc_ver9_39.bin (version 9.39) [ 3392.079989] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 3413.745747] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. ==========================================
Created attachment 142446 [details] dmesg
Created attachment 142447 [details] glxinfo
Created attachment 142448 [details] vainfo
Created attachment 142449 [details] xrandr
Created attachment 142450 [details] xrandr --verbose
While this time is appears to be a userspace; don't enable unsafe parameters such as enable_guc -- it is not enabled by default because it has known unaddressed issues (such as causing GPU hangs).
Is it a side effect of loading GuC that we're missing the batch in the error state?
(In reply to Lionel Landwerlin from comment #7) > Is it a side effect of loading GuC that we're missing the batch in the error > state? Oops, running into a decompression issue or something...
Ok, I removed GUC option. Left with: options i915 modeset=1 enable_fbc=1 fastboot=1 Let's if repeats, then I will try rolling back to 4.18. Will report back.
Just happened again, after enable_guc is removed. kernel: [drm] GPU HANG: ecode 9:0:0x87f5fff9, in chrome [1909], reason: hang on rcs0, action: reset kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error I'm adding drmi915kblcrash_no_guc.dump also.
Created attachment 142456 [details] cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc.dump
On my last comment only Chrome's GPU process crashed. I continued working. And now host totally crashed. Nov 14 06:19:55 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/061955.007214:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=0.998911 s, last_timebase_=32267103603 bogo-microseconds, timebase=32268102514 bogo-microseconds, last_media_st> Nov 14 06:19:55 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/061955.366345:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=0.971436 s, last_timebase_=32252803405 bogo-microseconds, timebase=32268346391 bogo-microseconds, last_media_st> Nov 14 06:20:00 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/062000.007032:ERROR:gl_surface_presentation_helper.cc(237)] GetVSyncParametersIfAvailable() failed! Nov 14 06:20:01 muradm-aln1 org.gnome.Shell.desktop[1445]: [4380:4380:1114/062001.241087:ERROR:sync_control_vsync_provider.cc(141)] Calculated bogus refresh interval=1.00253 s, last_timebase_=32273103226 bogo-microseconds, timebase=32274105752 bogo-microseconds, last_media_str> Nov 14 06:20:17 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 14 06:20:17 muradm-aln1 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 14 06:20:17 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 14 06:20:18 muradm-aln1 kernel: i915 0000:00:02.0: Failed to reset chip Nov 14 06:20:18 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error Nov 14 06:20:18 muradm-aln1 org.gnome.Shell.desktop[1445]: i965: Failed to submit batchbuffer: Input/output error Nov 14 06:20:18 muradm-aln1 terminator[2758]: Error reading events from display: Broken pipe Nov 14 06:20:18 muradm-aln1 evolution-alarm[1725]: Error reading events from display: Broken pipe Nov 14 06:20:18 muradm-aln1 gitter.desktop[1445]: [12242:12242:1114/062018.137101:ERROR:x11_util.cc(90)] X IO error received (X server probably went away) Nov 14 06:20:18 muradm-aln1 gitter.desktop[1445]: [12208:12208:1114/062018.137779:ERROR:chrome_browser_main_extra_parts_x11.cc(62)] X IO error received (X server probably went away)
Created attachment 142510 [details] cat /sys/class/drm/card0/error > drmi915kblcrash_no_guc_xorg.dump Switching from Wayland to Xorg still causes GPU hang. First hang with crash dump report =================================== Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU HANG: ecode 9:0:0x87f5fef9, in chromium [1907], reason: hang on rcs0, action: reset Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Nov 19 06:45:39 muradm-aln1 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Nov 19 06:45:39 muradm-aln1 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Nov 19 06:45:39 muradm-aln1 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Nov 19 06:45:39 muradm-aln1 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error Nov 19 06:45:39 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 =================================== And then in 40 minutes =================================== Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout Nov 19 07:25:27 muradm-aln1 kernel: i915 0000:00:02.0: Failed to reset chip Nov 19 07:25:27 muradm-aln1 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout =================== which causes Xorg and Gnome to crash.
I have similar issue.
Created attachment 142815 [details] /sys/class/drm/card0/error
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1770.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.