96894 – stuck on render ring (v4.7-rc7 Sky Lake Integrated Graphics [8086:1916])

Bug 96894 - stuck on render ring (v4.7-rc7 Sky Lake Integrated Graphics [8086:1916])

Summary: stuck on render ring (v4.7-rc7 Sky Lake Integrated Graphics [8086:1916])

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Bjørn Mork
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-11 14:14 UTC by Bjørn Mork
Modified:	2019-09-25 18:57 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	SKL
i915 features:	GPU hang

Attachments
cat /sys/class/drm/card0/error \|bzip2 >/tmp/error.bz2 (22.77 KB, application/x-bzip2) 2016-07-11 14:14 UTC, Bjørn Mork	Details
xorg log file from approximately when the reported error occurred (24.22 KB, text/plain) 2017-01-17 17:26 UTC, Bjørn Mork	Details
/sys/class/drm/card0/error from GPU HANG with modeset, Linux v4.10-rc4 and Mesa 13.0.3 (27.29 KB, text/plain) 2017-01-17 20:26 UTC, Bjørn Mork	Details
xorg.log from GPU HANG with modeset, Mesa 13.0.3 and Linux v4.10-rc4 (35.83 KB, text/plain) 2017-01-17 20:29 UTC, Bjørn Mork	Details
dmesg, Xorg.log and /sys/class/drm/card0/error from drm-tip GPU hang and repeated resets (39.19 KB, application/octet-stream) 2017-01-19 17:27 UTC, Bjørn Mork	Details
Correct /var/log/Xorg.0.log from the 20170119 hang+reset (32.17 KB, text/plain) 2017-01-19 17:31 UTC, Bjørn Mork	Details
View All

Description Bjørn Mork 2016-07-11 14:14:25 UTC

Created attachment 125010 [details]
cat /sys/class/drm/card0/error |bzip2 >/tmp/error.bz2

Just got this.  Don't know how interesting it is, but filing bug as requested:

Jul 11 15:53:45 miraculix kernel: [ 3584.878743] [drm] stuck on render ring
Jul 11 15:53:45 miraculix kernel: [ 3584.879658] [drm] GPU HANG: ecode 9:0:0x84dffff8, in Xorg [2555], reason: Engine(s) hung, action: reset
Jul 11 15:53:45 miraculix kernel: [ 3584.879901] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jul 11 15:53:45 miraculix kernel: [ 3584.879906] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jul 11 15:53:45 miraculix kernel: [ 3584.879910] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jul 11 15:53:45 miraculix kernel: [ 3584.879914] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jul 11 15:53:45 miraculix kernel: [ 3584.879918] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jul 11 15:53:45 miraculix kernel: [ 3584.885867] drm/i915: Resetting chip after gpu hang
Jul 11 15:53:47 miraculix kernel: [ 3586.879631] [drm] RC6 on

Comment 1 yann 2016-08-30 14:37:28 UTC

Assigning to Mesa product.

From this error dump, hung is happening in render ring batch with active head at 0xfe3de644, with 0x7b000005 (3DPRIMITIVE) as IPEHR.

Batch extract (around 0xfe3de644):

0xfe3de624:      0x78090005: 3DSTATE_VERTEX_ELEMENTS
0xfe3de628:      0x02000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0xfe3de62c:      0x22220000:    (0.0, 0.0, 0.0, 0.0), dst offset 0x00 bytes
0xfe3de630:      0x02f60000:    buffer 0: invalid, type 0x00f6, src offset 0x0000 bytes
0xfe3de634:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0xfe3de638:      0x02f60004:    buffer 0: invalid, type 0x00f6, src offset 0x0004 bytes
0xfe3de63c:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
Bad length 7 in (null), expected 6-6
0xfe3de640:      0x7b000005: 3DPRIMITIVE: fail sequential
0xfe3de644:      0x00000000:    vertex count
0xfe3de648:      0x00000003:    start vertex
0xfe3de64c:      0x00000004:    instance count
0xfe3de650:      0x00000001:    start instance
0xfe3de654:      0x00000000:    index bias
0xfe3de658:      0x00000000: MI_NOOP

Comment 2 Mark Janes 2017-01-17 16:50:57 UTC

Bjorn, please attach your Xorg log so we can understand how X was configured when it died.

Some gpu hangs in X have been resolved recently.  Can you please try the following and report your results:

 - update to public 4.9 kernel
 - update to Mesa 13.0.3
 - compile/install latest xf86-video-intel

Regardless of whether this fixes you issue, please also attempt to reproduce with the modesetting driver, and let us know if you encounter hangs in that configuration.

Let me know if you need more information on how to do any of this.

Comment 3 Bjørn Mork 2017-01-17 17:26:24 UTC

Created attachment 129003 [details]
xorg log file from approximately when the reported error occurred

This is the log from a later X session than the one I initially reported the GPU hang.  It should should the configuration at the time.  Was this what you needed?

Comment 4 Bjørn Mork 2017-01-17 17:35:29 UTC

(In reply to Mark Janes from comment #2)
> Bjorn, please attach your Xorg log so we can understand how X was configured
> when it died.
> 
> Some gpu hangs in X have been resolved recently.  Can you please try the
> following and report your results:
> 
>  - update to public 4.9 kernel

I am generally following the kernel development, so I've already been through a few 4.9-rcs and the 4.9 release by now.  I am currently testing the v4.10-rcs.  None of these kernel versions have changed the "stuck on render ring" issue.  It occures with a frequency of once or twice a week, usually as a result of resizing an X client like an xterm.

The only noticable effect of upgrading the kernel was that the driver error handling got significantly worse with v4.9-rcX and later.  It does not successfully reset the GPU anymore.  Instead it tries again and again every 20th second.

>  - update to Mesa 13.0.3
>  - compile/install latest xf86-video-intel


Will look into this.  FYI, I am usually following Debian testing/sid wrt userspace.

> Regardless of whether this fixes you issue, please also attempt to reproduce
> with the modesetting driver, and let us know if you encounter hangs in that
> configuration.

OK

> Let me know if you need more information on how to do any of this.

Hints are appreciated, but I guess I will figure it out by looking at how Debian package these things.

Comment 5 Mark Janes 2017-01-17 19:50:40 UTC

Thanks for the log.  Your version of sna is before this patch:

https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4acd4a7d3d2f41227022fa7581cfb85a0b124eae

which fixed at least one xorg gpu hang, as described in the bug referenced in the commit.  A duplicate of that bug describes switching to modesetting:

https://bugs.freedesktop.org/show_bug.cgi?id=99325

It will help us to improve the quality of Mesa to know:

 * do you encounter hangs with a newer sna?
 * do you encounter hangs with modesetting?

Comment 6 Bjørn Mork 2017-01-17 20:23:47 UTC

Looking over what I've actually got, it seems that I'm already testing the configurations you refer to.  Using Debian sid, this is what I currently have:

Mesa 13.0.3:

bjorn@miraculix:~$ dpkg -l libgl1-mesa\*|egrep ^ii
ii  libgl1-mesa-dev:amd64        13.0.3-1     amd64        free implementation of the OpenGL API -- GLX development files
ii  libgl1-mesa-dri:amd64        13.0.3-1     amd64        free implementation of the OpenGL API -- DRI modules
ii  libgl1-mesa-glx:amd64        13.0.3-1     amd64        free implementation of the OpenGL API -- GLX runtime
ii  libgl1-mesa-glx:i386         13.0.3-1     i386         free implementation of the OpenGL API -- GLX runtime

A git snapshot of xserver-xorg-video-intel from 20161206:

bjorn@miraculix:~$ dpkg -l xserver-xorg-video-intel|egrep ^ii
ii  xserver-xorg-video-intel 2:2.99.917+git20161206-1 amd64        X.Org X server -- Intel i8xx, i9xx display driver

And I'm running a v4.10-rc4 kernel.

The Debian xorg installation also seems to prefer the modesetting driver by default, so it appears I'm currently using that.  Just tells how much of this I'm normally aware :)


The bad news is that I still experience GPU HANGs with this configuration. I cannot tell if they are the same issue or something different, but I'll upload a new /sys/class/drm/card0/error with the matching xorg.log from the last incident.

Comment 7 Bjørn Mork 2017-01-17 20:26:10 UTC

Created attachment 129007 [details]
/sys/class/drm/card0/error from GPU HANG with modeset, Linux v4.10-rc4 and Mesa 13.0.3

Comment 8 Bjørn Mork 2017-01-17 20:29:16 UTC

Created attachment 129008 [details]
xorg.log from GPU HANG with modeset, Mesa 13.0.3 and Linux v4.10-rc4

This is the log with the actual GPU HANG event, as you can see by matching up the Modeline log lines with the timing of the GPU resets:


[19308.656674] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1171], reason: Hang on render ring, action: reset
[19308.656769] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[19308.656770] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[19308.656771] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[19308.656772] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[19308.656773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[19308.657131] drm/i915: Resetting chip after gpu hang
[19308.657752] [drm] RC6 on
[19308.677139] [drm] GuC firmware load skipped
[19328.645312] drm/i915: Resetting chip after gpu hang
[19328.649380] [drm] RC6 on
[19328.668497] [drm] GuC firmware load skipped
[19348.612672] drm/i915: Resetting chip after gpu hang
[19348.613017] [drm] RC6 on
[19348.630830] [drm] GuC firmware load skipped
[19364.612475] drm/i915: Resetting chip after gpu hang
[19364.614544] [drm] RC6 on
[19364.629781] [drm] GuC firmware load skipped
[19382.660101] drm/i915: Resetting chip after gpu hang
[19382.660955] [drm] RC6 on
[19382.680661] [drm] GuC firmware load skipped
[19402.628876] drm/i915: Resetting chip after gpu hang
[19402.629229] [drm] RC6 on
[19402.643134] [drm] GuC firmware load skipped
[19422.660054] drm/i915: Resetting chip after gpu hang
[19422.660419] [drm] RC6 on
[19422.675415] [drm] GuC firmware load skipped
[19440.644097] drm/i915: Resetting chip after gpu hang
[19440.644558] [drm] RC6 on
[19440.663878] [drm] GuC firmware load skipped
[19458.627752] drm/i915: Resetting chip after gpu hang
[19458.634024] [drm] RC6 on
[19458.650700] [drm] GuC firmware load skipped
[19478.659877] drm/i915: Resetting chip after gpu hang
[19478.665303] [drm] RC6 on
[19478.684634] [drm] GuC firmware load skipped
[19498.627632] drm/i915: Resetting chip after gpu hang
[19498.634862] [drm] RC6 on
[19498.653638] [drm] GuC firmware load skipped
[19510.659670] drm/i915: Resetting chip after gpu hang
[19510.665894] [drm] RC6 on
[19510.680479] [drm] GuC firmware load skipped

Comment 9 Mika Kuoppala 2017-01-19 15:02:06 UTC

Could you retry with git://anongit.freedesktop.org/drm-tip
if you can reproduce the reset loop?

Comment 10 Bjørn Mork 2017-01-19 16:55:01 UTC

(In reply to Mika Kuoppala from comment #9)
> Could you retry with git://anongit.freedesktop.org/drm-tip
> if you can reproduce the reset loop?

Will try.  But I am so far unsuccessful in provoking it, so I'll just have to wait and see.

Comment 11 Bjørn Mork 2017-01-19 17:27:12 UTC

Created attachment 129051 [details]
dmesg, Xorg.log and /sys/class/drm/card0/error from drm-tip GPU hang and repeated resets

I was lucky.  Just got a hang again running the current drm-tip. The resets were repeated multiple times, before the X session died and I was sent back to the xdm window.

The whole process was considerably quicker this time.  But having all X clients die is still a no-go...

Attaching a tar.tz with 3 files: dmesg output, /sys/class/drm/card0/error and /var/log/Xorg.0.log.

The kernel/driver/drm was built from commit 6b590a717c7f ("drm-tip: 2017y-01m-19d-15h-51m-19s UTC integration manifest")

Comment 12 Bjørn Mork 2017-01-19 17:31:37 UTC

Created attachment 129052 [details]
Correct /var/log/Xorg.0.log from the 20170119 hang+reset

Sorry, that was of course the Xorg.0.log from the new X session, started after the reset.  Attaching the correct log from the session which ended with hang+reset.

Comment 13 Bjørn Mork 2017-01-31 08:37:38 UTC

Issue still present in v4.10.0-rc6, with 24 failing attempts to reset the GPU causing the display to hang for 414 seconds in total:

[22578.675066] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1115], reason: Hang on render ring, action: reset
[22578.675071] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[22578.675072] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[22578.675073] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[22578.675074] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[22578.675075] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[22578.675117] drm/i915: Resetting chip after gpu hang
[22578.675204] [drm] RC6 on
[22578.690808] [drm] GuC firmware load skipped
[22594.632570] drm/i915: Resetting chip after gpu hang
[22594.632799] [drm] RC6 on
[22594.648499] [drm] GuC firmware load skipped
[22614.664380] drm/i915: Resetting chip after gpu hang
[22614.664571] [drm] RC6 on
[22614.683429] [drm] GuC firmware load skipped
[22634.632106] drm/i915: Resetting chip after gpu hang
[22634.632359] [drm] RC6 on
[22634.646027] [drm] GuC firmware load skipped
[22652.679799] drm/i915: Resetting chip after gpu hang
[22652.679989] [drm] RC6 on
[22652.696463] [drm] GuC firmware load skipped
[22670.663733] drm/i915: Resetting chip after gpu hang
[22670.668684] [drm] RC6 on
[22670.683865] [drm] GuC firmware load skipped
[22688.647556] drm/i915: Resetting chip after gpu hang
[22688.658581] [drm] RC6 on
[22688.681845] [drm] GuC firmware load skipped
[22706.631362] drm/i915: Resetting chip after gpu hang
[22706.637003] [drm] RC6 on
[22706.657547] [drm] GuC firmware load skipped
[22722.631107] drm/i915: Resetting chip after gpu hang
[22722.631170] [drm] RC6 on
[22722.645082] [drm] GuC firmware load skipped
[22742.663136] drm/i915: Resetting chip after gpu hang
[22742.665635] [drm] RC6 on
[22742.682193] [drm] GuC firmware load skipped
[22760.647001] drm/i915: Resetting chip after gpu hang
[22760.649805] [drm] RC6 on
[22760.668163] [drm] GuC firmware load skipped
[22780.614816] drm/i915: Resetting chip after gpu hang
[22780.619130] [drm] RC6 on
[22780.637673] [drm] GuC firmware load skipped
[22800.646570] drm/i915: Resetting chip after gpu hang
[22800.650751] [drm] RC6 on
[22800.669378] [drm] GuC firmware load skipped
[22820.614457] drm/i915: Resetting chip after gpu hang
[22820.622438] [drm] RC6 on
[22820.641270] [drm] GuC firmware load skipped
[22834.630392] drm/i915: Resetting chip after gpu hang
[22834.636328] [drm] RC6 on
[22834.654929] [drm] GuC firmware load skipped
[22850.630252] drm/i915: Resetting chip after gpu hang
[22850.630428] [drm] RC6 on
[22850.646390] [drm] GuC firmware load skipped
[22868.614075] drm/i915: Resetting chip after gpu hang
[22868.614264] [drm] RC6 on
[22868.633856] [drm] GuC firmware load skipped
[22886.661904] drm/i915: Resetting chip after gpu hang
[22886.668489] [drm] RC6 on
[22886.684943] [drm] GuC firmware load skipped
[22906.629771] drm/i915: Resetting chip after gpu hang
[22906.635558] [drm] RC6 on
[22906.651226] [drm] GuC firmware load skipped
[22927.685275] drm/i915: Resetting chip after gpu hang
[22927.685369] [drm] RC6 on
[22927.701798] [drm] GuC firmware load skipped
[22944.645560] drm/i915: Resetting chip after gpu hang
[22944.651301] [drm] RC6 on
[22944.667156] [drm] GuC firmware load skipped
[22962.629440] drm/i915: Resetting chip after gpu hang
[22962.635535] [drm] RC6 on
[22962.651782] [drm] GuC firmware load skipped
[22982.661283] drm/i915: Resetting chip after gpu hang
[22982.666887] [drm] RC6 on
[22982.687104] [drm] GuC firmware load skipped
[22992.645253] drm/i915: Resetting chip after gpu hang
[22992.651314] [drm] RC6 on
[22992.667404] [drm] GuC firmware load skipped



Would it be possible to at least make the GPU reset work, like it used to do in v4.8 and earlier?  The hang is of course annoying, but waiting for minutes and ending up with X restarting is so much worse than the earlier behaviour that I see it as a regression in itself.

Comment 14 Mark Janes 2017-01-31 18:14:52 UTC

Can you describe what you are doing when the gpu hang occurs?  Are you using a standard debian sid desktop environment, or are you using XFCE/KDE?

Comment 15 Bjørn Mork 2017-01-31 18:27:55 UTC

No, my desktop environment is not exactly standard. I am using the LXDE 
lightdm and lxsession to fire up WindowMaker as my window manager.  Kind of prehistoric, I know :)

The hangs always happen when I resize an X client by dragging its frame.  The client is usually (always?) an xterm.  Don't know if it happens to other types of clients since I rarely resize them.  But most of the time resizing will not cause a hang.  If I were to guess, I'd say than only one out of 20-100 such events ends up hanging the GPU.

I have had no success trying to trigger the bug by repeatedly resizing an xterm, making me guess that it's not the count but some other factor.  I have no idea what that could be.  I have not been able to correlate the hangs with anything else.

Comment 16 GitLab Migration User 2019-09-25 18:57:10 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1527.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.