Bug 101288 - [skl] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [836], reason: Hang on render ring, action: reset
Summary: [skl] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [836], reason: Hang on render r...
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-03 13:27 UTC by Bjørn Mork
Modified: 2019-09-25 19:02 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
dmesg (105.23 KB, text/plain)
2017-06-03 13:27 UTC, Bjørn Mork
Details
/sys/class/drm/card0/error (761.73 KB, text/plain)
2017-06-03 13:28 UTC, Bjørn Mork
Details
Xorg.log before GPU hang (32.55 KB, text/plain)
2017-06-03 13:29 UTC, Bjørn Mork
Details
Xorg.log after GPU hang (30.38 KB, text/plain)
2017-06-03 13:30 UTC, Bjørn Mork
Details
/sys/class/drm/card0/error from last hang (39.56 KB, text/plain)
2018-03-23 12:03 UTC, Bjørn Mork
Details
/sys/class/drm/card0/error with Linux v4.16 and i915/skl_dmc_ver1_27.bin firmware (47.59 KB, text/plain)
2018-04-11 20:55 UTC, Bjørn Mork
Details

Description Bjørn Mork 2017-06-03 13:27:48 UTC
Created attachment 131693 [details]
dmesg

Getting this occasionally while zooming windows or moving graphic objects (drawing etc).  It happens too rarely to be reproducible, but often enough to be annoying.  The hang somehow makes X restart after repeated reset attempts.


[54034.511500] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [836], reason: Hang on render ring, action: reset
[54034.511509] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[54034.511513] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[54034.511517] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[54034.511520] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[54034.511524] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[54034.511622] drm/i915: Resetting chip after gpu hang
[54034.511708] [drm] RC6 on
[54034.527085] [drm] GuC firmware load skipped
[[54046.478860] drm/i915: Resetting chip after gpu hang
[54046.478993] [drm] RC6 on
[54046.497115] [drm] GuC firmware load skipped


System is a plain Debian stretch running on a Lenovo Thinkpad X1 Carbon 4th gen.

bjorn@miraculix:~$ uname -a
Linux miraculix 4.9.0-3-amd64 #1 SMP Debian 4.9.25-1 (2017-05-02) x86_64 GNU/Linux

bjorn@miraculix:~$ grep . /sys/class/dmi/id/{bios,board}* 2>/dev/null 
/sys/class/dmi/id/bios_date:11/28/2016
/sys/class/dmi/id/bios_vendor:LENOVO
/sys/class/dmi/id/bios_version:N1FET47W (1.21 )
/sys/class/dmi/id/board_asset_tag:Not Available
/sys/class/dmi/id/board_name:20FB006AMN
/sys/class/dmi/id/board_vendor:LENOVO
/sys/class/dmi/id/board_version:SDK0J40697 WIN


root@miraculix:/tmp# lspci -vvvnns :2
00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 520 [8086:1916] (rev 07) (prog-if 00 [VGA controller])
        Subsystem: Lenovo HD Graphics 520 [17aa:2238]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 131
        Region 0: Memory at e0000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at c0000000 (64-bit, prefetchable) [size=512M]
        Region 4: I/O ports at e000 [size=64]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee00018  Data: 0000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Process Address Space ID (PASID)
                PASIDCap: Exec+ Priv-, Max PASID Width: 14
                PASIDCtl: Enable- Exec- Priv-
        Capabilities: [200 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [300 v1] Page Request Interface (PRI)
                PRICtl: Enable- Reset-
                PRISta: RF- UPRGI- Stopped-
                Page Request Capacity: 00008000, Page Request Allocation: 00000000
        Kernel driver in use: i915
        Kernel modules: i915



Using modesetting driver.  Xorg.log before and after X restart is attached, as well as full dmesg and the  /sys/class/drm/card0/error dump.
Comment 1 Bjørn Mork 2017-06-03 13:28:37 UTC
Created attachment 131694 [details]
/sys/class/drm/card0/error
Comment 2 Bjørn Mork 2017-06-03 13:29:38 UTC
Created attachment 131695 [details]
Xorg.log before GPU hang
Comment 3 Bjørn Mork 2017-06-03 13:30:06 UTC
Created attachment 131696 [details]
Xorg.log after GPU hang
Comment 4 Elizabeth 2018-03-06 22:32:28 UTC
Hello Bjorn, this could be involved with the latest hangs reported with some desktops and games. Could you try latest mesa release 17.3.6 which includes some important fixes for those? Thank you.
Comment 5 Bjørn Mork 2018-03-21 12:12:37 UTC
(In reply to Elizabeth from comment #4)
> Hello Bjorn, this could be involved with the latest hangs reported with some
> desktops and games. Could you try latest mesa release 17.3.6 which includes
> some important fixes for those? Thank you.

Thanks.  I am starting to hope you are correct.  I have been running with the package list below for two weeks now without any hangs.  So it does look good.



bjorn@miraculix:~$  dpkg -l|grep mesa
ii  libegl-mesa0:amd64                    17.3.6-1                                    amd64        free implementation of the EGL API -- Mesa vendor library
ii  libegl1-mesa:amd64                    17.3.6-1                                    amd64        transitional dummy package
ii  libgl1-mesa-dev:amd64                 17.3.6-1                                    amd64        free implementation of the OpenGL API -- GLX development files
ii  libgl1-mesa-dri:amd64                 17.3.6-1                                    amd64        free implementation of the OpenGL API -- DRI modules
ii  libgl1-mesa-dri:i386                  17.3.6-1                                    i386         free implementation of the OpenGL API -- DRI modules
ii  libgl1-mesa-glx:amd64                 17.3.6-1                                    amd64        transitional dummy package
ii  libgl1-mesa-glx:i386                  17.3.6-1                                    i386         transitional dummy package
ii  libglapi-mesa:amd64                   17.3.6-1                                    amd64        free implementation of the GL API -- shared library
ii  libglapi-mesa:i386                    17.3.6-1                                    i386         free implementation of the GL API -- shared library
ii  libgles2-mesa:amd64                   17.3.6-1                                    amd64        transitional dummy package
ii  libglu1-mesa:amd64                    9.0.0-2.1                                   amd64        Mesa OpenGL utility library (GLU)
ii  libglu1-mesa:i386                     9.0.0-2.1                                   i386         Mesa OpenGL utility library (GLU)
ii  libglu1-mesa-dev:amd64                9.0.0-2.1                                   amd64        Mesa OpenGL utility library -- development files
ii  libglx-mesa0:amd64                    17.3.6-1                                    amd64        free implementation of the OpenGL API -- GLX vendor library
ii  libglx-mesa0:i386                     17.3.6-1                                    i386         free implementation of the OpenGL API -- GLX vendor library
ii  libwayland-egl1-mesa:amd64            17.3.6-1                                    amd64        implementation of the Wayland EGL platform -- runtime
ii  mesa-common-dev:amd64                 17.3.6-1                                    amd64        Developer documentation for Mesa
ii  mesa-utils                            8.4.0-1                                     amd64        Miscellaneous Mesa GL utilities
ii  mesa-va-drivers:amd64                 17.3.6-1                                    amd64        Mesa VA-API video acceleration drivers
Comment 6 Bjørn Mork 2018-03-23 12:02:29 UTC
And then it happened again....


Mar 23 12:58:45 miraculix kernel: [20823.597291] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [875], reason: Hang on rcs0, action: reset
Mar 23 12:58:45 miraculix kernel: [20823.597294] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Mar 23 12:58:45 miraculix kernel: [20823.597295] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Mar 23 12:58:45 miraculix kernel: [20823.597296] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Mar 23 12:58:45 miraculix kernel: [20823.597296] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Mar 23 12:58:45 miraculix kernel: [20823.597298] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Mar 23 12:58:45 miraculix kernel: [20823.597303] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Mar 23 12:58:53 miraculix kernel: [20831.588393] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Mar 23 12:59:01 miraculix kernel: [20839.588307] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Mar 23 12:59:09 miraculix kernel: [20847.588283] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Mar 23 12:59:17 miraculix kernel: [20855.588213] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Comment 7 Bjørn Mork 2018-03-23 12:03:35 UTC
Created attachment 138310 [details]
/sys/class/drm/card0/error from last hang
Comment 8 Elizabeth 2018-03-23 15:24:24 UTC
So as is seems, the problem should be different and a way to easy reproduce the hang is still needed to find a root-cause or a fix, have you tried a different desktop environment or window manager? There is still the chance that it could be related to the desktop :/
Comment 9 Imre Deak 2018-03-26 11:23:41 UTC
Bjørn,

your kernel/driver version still uses an older DMC fw version. Could you try if the problem still occurs with the latest one? You need

commit 39ccc9852e2b46964c9c44eba52db57413ba6d27
Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
Date:   Thu Nov 9 17:18:32 2017 -0800

    drm/i915/skl: DMC firmware for skylake v1.27

and

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915/skl_dmc_ver1_27.bin

copied to /lib/firmware/i915

Thanks.
Comment 10 Bjørn Mork 2018-04-03 11:11:40 UTC
Thanks.  testing now with a plain v4.16, which includes that commit AFAICS, and the latest i915/skl_dmc_ver1_27.bin from linux-firmware.git.  Please let me know if I got anything wrong here:

root@miraculix:/home/bjorn# dmesg |grep drm
[    4.624336] fb: switching to inteldrmfb from simple
[    4.630129] [drm] Replacing VGA console driver
[    4.633901] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    4.633924] [drm] Driver supports precise vblank timestamp query.
[    4.637091] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)
[    4.704297] [drm] Initialized i915 1.6.0 20171222 for 0000:00:02.0 on minor 0
[    4.748850] fbcon: inteldrmfb (fb0) is primary device
[    6.232211] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[    6.275291] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device


Will be back with a report in a couple of weeks.  The hangs are very infrequent, so I consider that a minimum test period for an acceptable confidence level.  But it does look good this far ;-)
Comment 11 Imre Deak 2018-04-04 11:37:22 UTC
(In reply to Bjørn Mork from comment #10)
> Thanks.  testing now with a plain v4.16, which includes that commit AFAICS,
> and the latest i915/skl_dmc_ver1_27.bin from linux-firmware.git.  Please let
> me know if I got anything wrong here:

Yes, this setup should show if the fix in the newer DMC fw version could apply to the bug you saw.

> root@miraculix:/home/bjorn# dmesg |grep drm
> [    4.624336] fb: switching to inteldrmfb from simple
> [    4.630129] [drm] Replacing VGA console driver
> [    4.633901] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [    4.633924] [drm] Driver supports precise vblank timestamp query.
> [    4.637091] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin
> (v1.27)
> [    4.704297] [drm] Initialized i915 1.6.0 20171222 for 0000:00:02.0 on
> minor 0
> [    4.748850] fbcon: inteldrmfb (fb0) is primary device
> [    6.232211] [drm] Reducing the compressed framebuffer size. This may lead
> to less power savings than a non-reduced-size. Try to increase stolen memory
> size if available in BIOS.
> [    6.275291] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
> 
> 
> Will be back with a report in a couple of weeks.  The hangs are very
> infrequent, so I consider that a minimum test period for an acceptable
> confidence level.  But it does look good this far ;-)

Ok, thanks.
Comment 12 Bjørn Mork 2018-04-11 20:55:53 UTC
Created attachment 138770 [details]
/sys/class/drm/card0/error with Linux v4.16 and i915/skl_dmc_ver1_27.bin firmware

OK, so I just had another hang.  This time with the i915/skl_dmc_ver1_27.bin firmware.  The reset behaviour was a lot nicer than before, though.  It actually worked without killing the X server.

The new /sys/class/drm/card0/error is attached. Relevant log messages:

[drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1152], reason: Hang on rcs0, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
i915 0000:00:02.0: Resetting rcs0 after gpu hang
Comment 13 GitLab Migration User 2019-09-25 19:02:38 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1599.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.