Bug 99582

Summary: [SKL] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [1085], reason: Hang on render ring, action: reset
Product: Mesa Reporter: Antoine Aubry <antoine>
Component: Drivers/DRI/i965Assignee: Antoine Aubry <antoine>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: critical    
Priority: medium CC: intel-gfx-bugs, mark.a.janes
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Content of /sys/class/drm/card0/error
Sample kicad project that reproduces the problem

Description Antoine Aubry 2017-01-29 00:02:13 UTC
Created attachment 129205 [details]
Content of /sys/class/drm/card0/error

When using KiCad's Pcbnew program, after a while the screen freezes for a few seconds, then the window manager restarts. The output of dmesg instructs to file a new bug, so here it is.

The output of dmesg shows the following:

[ 2359.835457] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [1085], reason: Hang on render ring, action: reset
[ 2359.835461] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2359.835463] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2359.835465] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2359.835467] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2359.835469] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2359.835576] drm/i915: Resetting chip after gpu hang
[ 2359.835665] [drm] RC6 on
[ 2359.855202] [drm] GuC firmware load skipped
[ 2371.867120] drm/i915: Resetting chip after gpu hang
[ 2371.867214] [drm] RC6 on
[ 2371.883179] [drm] GuC firmware load skipped

I am using Ubuntu 16.10 with kernel 4.9. This is the output of "uname -a":

Linux spectre 4.9.0-040900-generic #201612111631 SMP Sun Dec 11 21:33:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I have attached the GPU crash dump from /sys/class/drm/card0/error, and here is the output of lspci:

00:00.0 Host bridge: Intel Corporation Skylake Host Bridge/DRAM Registers (rev 08)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 520 (rev 07)
00:13.0 Non-VGA unclassified device: Intel Corporation Device 9d35 (rev 21)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1)
00:1c.1 PCI bridge: Intel Corporation Device 9d11 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
00:1f.3 Multimedia audio controller: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
01:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5227 PCI Express Card Reader (rev 01)
02:00.0 Network controller: Intel Corporation Wireless 7265 (rev 61)
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01)

Please let me know what other information I can provide to help with this.
Comment 1 Antoine Aubry 2017-01-30 22:21:22 UTC
I experimented with different versions of the kernel. With versions 4.9.0 and 4.9.6 the freeze is followed by every GUI process segfaulting. With kernel 4.8.0, there are still freezes, but after a while the applications become responsive again.
Comment 2 Antoine Aubry 2017-01-30 22:27:33 UTC
Also, when installing a kernel, update-initramfs generates the following warnings:

W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_14.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver8_7.bin for module i915

I don't know if they are related to the freezes, though.
Comment 3 Mark Janes 2017-01-31 00:57:03 UTC
The guc is not needed for current hardware -- you can ignore that error.
Comment 4 yann 2017-01-31 13:38:43 UTC
Can you try with latest version of kernel from drm-tip (https://cgit.freedesktop.org/drm/drm-tip/), xf86-video-intel (https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/) and latest Mesa (https://mesa.freedesktop.org/archive/).

Moreover, if this is still occurring, to confirm what is causing it, can you also try with modesetting driver (https://cgit.freedesktop.org/xorg/driver/xf86-video-modesetting) rather than Intel DDX (ie xf86-video-intel) and let us know the status? 

Provide your config and change current status to:
- RESOLVED/* if you cannot reproduce.
- REOPENED otherwise; attach fresh gpu error dump, kernel log & xorg log


* Details:
- Kernel: 4.9.0-040900-generic
- Platform: Skylake (PCI ID: 0x1916, PCI Revision: 0x07, PCI Subsystem: 103c:81a1)
- Mesa : [Please confirm your version]
- xf86-video-intel : [Please confirm your version]
Comment 5 Chris Wilson 2017-01-31 13:45:43 UTC
(In reply to yann from comment #4)
> Can you try with latest version of kernel from drm-tip
> (https://cgit.freedesktop.org/drm/drm-tip/), xf86-video-intel
> (https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/) and latest Mesa
> (https://mesa.freedesktop.org/archive/).
> 
> Moreover, if this is still occurring, to confirm what is causing it, can you
> also try with modesetting driver

Note that this was occurring with -modesetting.
Comment 6 yann 2017-01-31 13:54:23 UTC
(In reply to Chris Wilson from comment #5)
> (In reply to yann from comment #4)
> > Can you try with latest version of kernel from drm-tip
> > (https://cgit.freedesktop.org/drm/drm-tip/), xf86-video-intel
> > (https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/) and latest Mesa
> > (https://mesa.freedesktop.org/archive/).
> > 
> > Moreover, if this is still occurring, to confirm what is causing it, can you
> > also try with modesetting driver
> 
> Note that this was occurring with -modesetting.

thanks Chris. 
just for my knowledge/curiosity, is it because this is default Ubuntu 16.10 configuration or is this info available else where here?
Comment 7 Chris Wilson 2017-01-31 14:07:14 UTC
It is the type of batch buffer in the error state.
Comment 8 yann 2017-01-31 14:09:08 UTC
(In reply to Chris Wilson from comment #7)
> It is the type of batch buffer in the error state.

thank so much :)
Comment 9 Antoine Aubry 2017-01-31 14:29:36 UTC
(In reply to yann from comment #4)
> Can you try with latest version of kernel from drm-tip
> (https://cgit.freedesktop.org/drm/drm-tip/), xf86-video-intel
> (https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/) and latest Mesa
> (https://mesa.freedesktop.org/archive/).
> 
> Moreover, if this is still occurring, to confirm what is causing it, can you
> also try with modesetting driver
> (https://cgit.freedesktop.org/xorg/driver/xf86-video-modesetting) rather
> than Intel DDX (ie xf86-video-intel) and let us know the status? 
> 
> Provide your config and change current status to:
> - RESOLVED/* if you cannot reproduce.
> - REOPENED otherwise; attach fresh gpu error dump, kernel log & xorg log
> 
> 
> * Details:
> - Kernel: 4.9.0-040900-generic
> - Platform: Skylake (PCI ID: 0x1916, PCI Revision: 0x07, PCI Subsystem:
> 103c:81a1)
> - Mesa : [Please confirm your version]
> - xf86-video-intel : [Please confirm your version]

I don't mind trying with the latest versions of the components that you mentioned, but I need some help with that. I am a developer, so I an not completely clueless, but I know nothing about installing drivers that are not already packaged :)
However, I am willing to learn. Can you indicate where I can find information on how to compile and install these components?

Thanks
Comment 10 yann 2017-01-31 14:38:10 UTC
So at this stage Antoine, the issue you are facing is probably linked to Mesa. So you may start to update first your mesa version to the lastest: check http://www.mesa3d.org/download.html
Comment 11 Antoine Aubry 2017-01-31 14:58:04 UTC
(In reply to yann from comment #10)
> So at this stage Antoine, the issue you are facing is probably linked to
> Mesa. So you may start to update first your mesa version to the lastest:
> check http://www.mesa3d.org/download.html

I'll try that and report back. Thanks
Comment 12 Antoine Aubry 2017-02-01 16:03:03 UTC
So, I tried compiling mesa and installing it, but I think that I used the wrong prefix on the configure script, because ldconfig still reported that it would resolve the libraries from a different location than /usr/local/lib.
Then I tried to install them using a ppa that supposedly offers the latest versions. After that, my desktop environment (unity) ceased to work. I fiddled a lot, reverted the packages from the ppa and uninstalled the libraries that I had compiled myself, but it was still broken. In the end, I reinstalled the OS :S

The good news are that since I did a fresh install, I won't loose anything more if I need to reinstall again :)

So, before I try again, I may need more information on how to install the libraries. On my setup - Ubuntu 16.10 - each of the libraries seems to be in a different directory. What arguments should I pass to ./configure to ensure that it installs properly ?
Also, after installing, how can I confirm that the libraries that are in use are indeed the ones that I compiled ?

Thanks
Comment 13 yann 2017-02-01 16:18:36 UTC
(In reply to Antoine Aubry from comment #12)
> So, I tried compiling mesa and installing it, but I think that I used the
> wrong prefix on the configure script, because ldconfig still reported that
> it would resolve the libraries from a different location than /usr/local/lib.
> Then I tried to install them using a ppa that supposedly offers the latest
> versions. After that, my desktop environment (unity) ceased to work. I
> fiddled a lot, reverted the packages from the ppa and uninstalled the
> libraries that I had compiled myself, but it was still broken. In the end, I
> reinstalled the OS :S
> 
> The good news are that since I did a fresh install, I won't loose anything
> more if I need to reinstall again :)
> 
> So, before I try again, I may need more information on how to install the
> libraries. On my setup - Ubuntu 16.10 - each of the libraries seems to be in
> a different directory. What arguments should I pass to ./configure to ensure
> that it installs properly ?

Please start with http://www.mesa3d.org/autoconf.html & https://01.org/linuxgraphics/documentation/build-guide-0

> Also, after installing, how can I confirm that the libraries that are in use
> are indeed the ones that I compiled ?
> 
> Thanks
v

here check glxinfo ;)
https://dri.freedesktop.org/wiki/glxinfo/
Comment 14 Mark Janes 2017-02-01 16:27:12 UTC
Thank you for trying the latest mesa.  I'm sorry that it was so painful for you.

You can use mesa in the installed path (/usr/local/lib) by setting environment variables before launching the app from the command line:

$ export LD_LIBRARY_PATH=/usr/local/lib
$ export LIBGL_DRIVERS_PATH=/usr/local/lib/dri
$ pcbnew 

This application works fine for me on debian testing with mesa's tip and linux 4.9.  Is there some activity in the app that provokes your hang?
Comment 15 Antoine Aubry 2017-02-01 16:35:01 UTC
Thanks for the information. I will try that.

The hangs seem to happen when some part of the schematic is repainted. They tend to happen more while scrolling the view. Their frequency also seems to increase with time. After a reboot, it usually takes 5 to 10 minutes for the screen to freeze. But after a few freezes, they start to become much more frequent. Maybe it is due to some kind of resource leak.
Comment 16 Mark Janes 2017-02-01 16:46:06 UTC
If you can provide a sample schematic, I can try to scroll around and make it happen.
Comment 17 Antoine Aubry 2017-02-01 23:16:25 UTC
Created attachment 129284 [details]
Sample kicad project that reproduces the problem

Includes the entire kicad project. The actual issue occurs when opening mainboard.kicad_pcb in pcbnew. The fastest way to cause a freeze is to zoom in and out repeatedly using the mouse wheel. After a few tens of zooms, I get freezes consistently.
Comment 18 Antoine Aubry 2017-02-01 23:17:50 UTC
I forgot to say in my previous comment that I found that zooming in and out is even more likely to cause the freeze.
Comment 19 Antoine Aubry 2017-02-01 23:28:16 UTC
I attempted again to build mesa 13.0.3, which I believe is the latest version. After installing the library with make install, I tried running pcbnew as instructed:

$ export LD_LIBRARY_PATH=/usr/local/lib
$ export LIBGL_DRIVERS_PATH=/usr/local/lib/dri
$ pcbnew 

I observed no changes, the GPU still hangs. Btw, that this made my system unusable again. After logging off, the desktop manager would no longer start, just as before. I attempted to uninstall with "make uninstall", but that did not help. I has to reinstall the OS again to fix it. This is certainly an easily fixable problem, but I could not find out what was the problem.
Comment 20 Mark Janes 2017-02-03 04:03:42 UTC
I opened your file on my sklgt2, and was able to scroll/zoom for a *long* time with no gpu hangs.  I'm using debian testing with linux 4.9, modesetting, mesa 13.

There seems to be something else going on with your system, because installing mesa to /usr/local/ can not disable your system.  Encountering a GPU Hang can sometimes generate unrecoverable errors that require a reboot, but you should never need to reinstall.

It's troubling that this wouldn't work properly on Ubuntu with the oibaf ppa.  It would be good for someone else who uses Ubuntu can reproduce this with your kicad project.
Comment 21 Antoine Aubry 2017-02-03 10:21:13 UTC
Humm, I did not use the ppa that you mentioned. It was someone else's and maybe it was broken. I'll try with that one and report.
Thanks for testing with my files. In the configuration that you mention, you refer to "modesetting". What does this mean ?
Comment 22 Mark Janes 2017-02-03 17:52:50 UTC
There are a few ways to accelerate 2D graphics.  For Intel hardware, you can use:

 - SNA: xf86-video-intel, which accelerates the 2D api for Intel GPUs
 - Glamor/modesetting: implements the 2D api in OpenGL

Different distributions choose different defaults.  A GPU hang caused by Xorg can be caused by SNA or Mesa's implementation of the GL commands sent by Glamor.

Some Mesa GPU hangs have been recently fixed by:
180653c357d19ca88f7895f59874a58fac99cc53
Author:     Topi Pohjolainen <topi.pohjolainen@intel.com>
i965/blorp: Make post draw flush more explicit

Some SNA GPU hangs have been recently fixed by:
4acd4a7d3d2f41227022fa7581cfb85a0b124eae 
author	Chris Wilson <chris@chris-wilson.co.uk>
sna/gen9: Emit a dummy primitive between VertexElements

Make sure you use versions that have those patches.

For information on how to switch between sna and modesetting, see
https://bugs.freedesktop.org/show_bug.cgi?id=98999
Comment 23 Antoine Aubry 2017-02-04 18:39:52 UTC
So I installed the drivers from the oibaf ppa, and since then I ceased experiencing hangs!

Thank you so much for the patience and helpfuless. I really appreciate the time that you all took to explain the concepts to me.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.