Created attachment 131693 [details] dmesg Getting this occasionally while zooming windows or moving graphic objects (drawing etc). It happens too rarely to be reproducible, but often enough to be annoying. The hang somehow makes X restart after repeated reset attempts. [54034.511500] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [836], reason: Hang on render ring, action: reset [54034.511509] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [54034.511513] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [54034.511517] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [54034.511520] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [54034.511524] [drm] GPU crash dump saved to /sys/class/drm/card0/error [54034.511622] drm/i915: Resetting chip after gpu hang [54034.511708] [drm] RC6 on [54034.527085] [drm] GuC firmware load skipped [[54046.478860] drm/i915: Resetting chip after gpu hang [54046.478993] [drm] RC6 on [54046.497115] [drm] GuC firmware load skipped System is a plain Debian stretch running on a Lenovo Thinkpad X1 Carbon 4th gen. bjorn@miraculix:~$ uname -a Linux miraculix 4.9.0-3-amd64 #1 SMP Debian 4.9.25-1 (2017-05-02) x86_64 GNU/Linux bjorn@miraculix:~$ grep . /sys/class/dmi/id/{bios,board}* 2>/dev/null /sys/class/dmi/id/bios_date:11/28/2016 /sys/class/dmi/id/bios_vendor:LENOVO /sys/class/dmi/id/bios_version:N1FET47W (1.21 ) /sys/class/dmi/id/board_asset_tag:Not Available /sys/class/dmi/id/board_name:20FB006AMN /sys/class/dmi/id/board_vendor:LENOVO /sys/class/dmi/id/board_version:SDK0J40697 WIN root@miraculix:/tmp# lspci -vvvnns :2 00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 520 [8086:1916] (rev 07) (prog-if 00 [VGA controller]) Subsystem: Lenovo HD Graphics 520 [17aa:2238] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 131 Region 0: Memory at e0000000 (64-bit, non-prefetchable) [size=16M] Region 2: Memory at c0000000 (64-bit, prefetchable) [size=512M] Region 4: I/O ports at e000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00018 Data: 0000 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv-, Max PASID Width: 14 PASIDCtl: Enable- Exec- Priv- Capabilities: [200 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable-, Smallest Translation Unit: 00 Capabilities: [300 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped- Page Request Capacity: 00008000, Page Request Allocation: 00000000 Kernel driver in use: i915 Kernel modules: i915 Using modesetting driver. Xorg.log before and after X restart is attached, as well as full dmesg and the /sys/class/drm/card0/error dump.
Created attachment 131694 [details] /sys/class/drm/card0/error
Created attachment 131695 [details] Xorg.log before GPU hang
Created attachment 131696 [details] Xorg.log after GPU hang
Hello Bjorn, this could be involved with the latest hangs reported with some desktops and games. Could you try latest mesa release 17.3.6 which includes some important fixes for those? Thank you.
(In reply to Elizabeth from comment #4) > Hello Bjorn, this could be involved with the latest hangs reported with some > desktops and games. Could you try latest mesa release 17.3.6 which includes > some important fixes for those? Thank you. Thanks. I am starting to hope you are correct. I have been running with the package list below for two weeks now without any hangs. So it does look good. bjorn@miraculix:~$ dpkg -l|grep mesa ii libegl-mesa0:amd64 17.3.6-1 amd64 free implementation of the EGL API -- Mesa vendor library ii libegl1-mesa:amd64 17.3.6-1 amd64 transitional dummy package ii libgl1-mesa-dev:amd64 17.3.6-1 amd64 free implementation of the OpenGL API -- GLX development files ii libgl1-mesa-dri:amd64 17.3.6-1 amd64 free implementation of the OpenGL API -- DRI modules ii libgl1-mesa-dri:i386 17.3.6-1 i386 free implementation of the OpenGL API -- DRI modules ii libgl1-mesa-glx:amd64 17.3.6-1 amd64 transitional dummy package ii libgl1-mesa-glx:i386 17.3.6-1 i386 transitional dummy package ii libglapi-mesa:amd64 17.3.6-1 amd64 free implementation of the GL API -- shared library ii libglapi-mesa:i386 17.3.6-1 i386 free implementation of the GL API -- shared library ii libgles2-mesa:amd64 17.3.6-1 amd64 transitional dummy package ii libglu1-mesa:amd64 9.0.0-2.1 amd64 Mesa OpenGL utility library (GLU) ii libglu1-mesa:i386 9.0.0-2.1 i386 Mesa OpenGL utility library (GLU) ii libglu1-mesa-dev:amd64 9.0.0-2.1 amd64 Mesa OpenGL utility library -- development files ii libglx-mesa0:amd64 17.3.6-1 amd64 free implementation of the OpenGL API -- GLX vendor library ii libglx-mesa0:i386 17.3.6-1 i386 free implementation of the OpenGL API -- GLX vendor library ii libwayland-egl1-mesa:amd64 17.3.6-1 amd64 implementation of the Wayland EGL platform -- runtime ii mesa-common-dev:amd64 17.3.6-1 amd64 Developer documentation for Mesa ii mesa-utils 8.4.0-1 amd64 Miscellaneous Mesa GL utilities ii mesa-va-drivers:amd64 17.3.6-1 amd64 Mesa VA-API video acceleration drivers
And then it happened again.... Mar 23 12:58:45 miraculix kernel: [20823.597291] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [875], reason: Hang on rcs0, action: reset Mar 23 12:58:45 miraculix kernel: [20823.597294] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Mar 23 12:58:45 miraculix kernel: [20823.597295] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Mar 23 12:58:45 miraculix kernel: [20823.597296] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Mar 23 12:58:45 miraculix kernel: [20823.597296] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Mar 23 12:58:45 miraculix kernel: [20823.597298] [drm] GPU crash dump saved to /sys/class/drm/card0/error Mar 23 12:58:45 miraculix kernel: [20823.597303] i915 0000:00:02.0: Resetting rcs0 after gpu hang Mar 23 12:58:53 miraculix kernel: [20831.588393] i915 0000:00:02.0: Resetting rcs0 after gpu hang Mar 23 12:59:01 miraculix kernel: [20839.588307] i915 0000:00:02.0: Resetting rcs0 after gpu hang Mar 23 12:59:09 miraculix kernel: [20847.588283] i915 0000:00:02.0: Resetting rcs0 after gpu hang Mar 23 12:59:17 miraculix kernel: [20855.588213] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Created attachment 138310 [details] /sys/class/drm/card0/error from last hang
So as is seems, the problem should be different and a way to easy reproduce the hang is still needed to find a root-cause or a fix, have you tried a different desktop environment or window manager? There is still the chance that it could be related to the desktop :/
Bjørn, your kernel/driver version still uses an older DMC fw version. Could you try if the problem still occurs with the latest one? You need commit 39ccc9852e2b46964c9c44eba52db57413ba6d27 Author: Anusha Srivatsa <anusha.srivatsa@intel.com> Date: Thu Nov 9 17:18:32 2017 -0800 drm/i915/skl: DMC firmware for skylake v1.27 and https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915/skl_dmc_ver1_27.bin copied to /lib/firmware/i915 Thanks.
Thanks. testing now with a plain v4.16, which includes that commit AFAICS, and the latest i915/skl_dmc_ver1_27.bin from linux-firmware.git. Please let me know if I got anything wrong here: root@miraculix:/home/bjorn# dmesg |grep drm [ 4.624336] fb: switching to inteldrmfb from simple [ 4.630129] [drm] Replacing VGA console driver [ 4.633901] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 4.633924] [drm] Driver supports precise vblank timestamp query. [ 4.637091] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27) [ 4.704297] [drm] Initialized i915 1.6.0 20171222 for 0000:00:02.0 on minor 0 [ 4.748850] fbcon: inteldrmfb (fb0) is primary device [ 6.232211] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 6.275291] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device Will be back with a report in a couple of weeks. The hangs are very infrequent, so I consider that a minimum test period for an acceptable confidence level. But it does look good this far ;-)
(In reply to Bjørn Mork from comment #10) > Thanks. testing now with a plain v4.16, which includes that commit AFAICS, > and the latest i915/skl_dmc_ver1_27.bin from linux-firmware.git. Please let > me know if I got anything wrong here: Yes, this setup should show if the fix in the newer DMC fw version could apply to the bug you saw. > root@miraculix:/home/bjorn# dmesg |grep drm > [ 4.624336] fb: switching to inteldrmfb from simple > [ 4.630129] [drm] Replacing VGA console driver > [ 4.633901] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). > [ 4.633924] [drm] Driver supports precise vblank timestamp query. > [ 4.637091] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin > (v1.27) > [ 4.704297] [drm] Initialized i915 1.6.0 20171222 for 0000:00:02.0 on > minor 0 > [ 4.748850] fbcon: inteldrmfb (fb0) is primary device > [ 6.232211] [drm] Reducing the compressed framebuffer size. This may lead > to less power savings than a non-reduced-size. Try to increase stolen memory > size if available in BIOS. > [ 6.275291] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device > > > Will be back with a report in a couple of weeks. The hangs are very > infrequent, so I consider that a minimum test period for an acceptable > confidence level. But it does look good this far ;-) Ok, thanks.
Created attachment 138770 [details] /sys/class/drm/card0/error with Linux v4.16 and i915/skl_dmc_ver1_27.bin firmware OK, so I just had another hang. This time with the i915/skl_dmc_ver1_27.bin firmware. The reset behaviour was a lot nicer than before, though. It actually worked without killing the X server. The new /sys/class/drm/card0/error is attached. Relevant log messages: [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1152], reason: Hang on rcs0, action: reset [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [drm] GPU crash dump saved to /sys/class/drm/card0/error i915 0000:00:02.0: Resetting rcs0 after gpu hang
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1599.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.