Summary: | stuck on render ring (v4.7-rc7 Sky Lake Integrated Graphics [8086:1916]) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Bjørn Mork <bjorn> |
Component: | Drivers/DRI/i965 | Assignee: | Bjørn Mork <bjorn> |
Status: | RESOLVED MOVED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs, mark.a.janes |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | SKL | i915 features: | GPU hang |
Attachments: |
cat /sys/class/drm/card0/error |bzip2 >/tmp/error.bz2
xorg log file from approximately when the reported error occurred /sys/class/drm/card0/error from GPU HANG with modeset, Linux v4.10-rc4 and Mesa 13.0.3 xorg.log from GPU HANG with modeset, Mesa 13.0.3 and Linux v4.10-rc4 dmesg, Xorg.log and /sys/class/drm/card0/error from drm-tip GPU hang and repeated resets Correct /var/log/Xorg.0.log from the 20170119 hang+reset |
Assigning to Mesa product. From this error dump, hung is happening in render ring batch with active head at 0xfe3de644, with 0x7b000005 (3DPRIMITIVE) as IPEHR. Batch extract (around 0xfe3de644): 0xfe3de624: 0x78090005: 3DSTATE_VERTEX_ELEMENTS 0xfe3de628: 0x02000000: buffer 0: invalid, type 0x0000, src offset 0x0000 bytes 0xfe3de62c: 0x22220000: (0.0, 0.0, 0.0, 0.0), dst offset 0x00 bytes 0xfe3de630: 0x02f60000: buffer 0: invalid, type 0x00f6, src offset 0x0000 bytes 0xfe3de634: 0x11230000: (X, Y, 0.0, 1.0), dst offset 0x00 bytes 0xfe3de638: 0x02f60004: buffer 0: invalid, type 0x00f6, src offset 0x0004 bytes 0xfe3de63c: 0x11230000: (X, Y, 0.0, 1.0), dst offset 0x00 bytes Bad length 7 in (null), expected 6-6 0xfe3de640: 0x7b000005: 3DPRIMITIVE: fail sequential 0xfe3de644: 0x00000000: vertex count 0xfe3de648: 0x00000003: start vertex 0xfe3de64c: 0x00000004: instance count 0xfe3de650: 0x00000001: start instance 0xfe3de654: 0x00000000: index bias 0xfe3de658: 0x00000000: MI_NOOP Bjorn, please attach your Xorg log so we can understand how X was configured when it died. Some gpu hangs in X have been resolved recently. Can you please try the following and report your results: - update to public 4.9 kernel - update to Mesa 13.0.3 - compile/install latest xf86-video-intel Regardless of whether this fixes you issue, please also attempt to reproduce with the modesetting driver, and let us know if you encounter hangs in that configuration. Let me know if you need more information on how to do any of this. Created attachment 129003 [details]
xorg log file from approximately when the reported error occurred
This is the log from a later X session than the one I initially reported the GPU hang. It should should the configuration at the time. Was this what you needed?
(In reply to Mark Janes from comment #2) > Bjorn, please attach your Xorg log so we can understand how X was configured > when it died. > > Some gpu hangs in X have been resolved recently. Can you please try the > following and report your results: > > - update to public 4.9 kernel I am generally following the kernel development, so I've already been through a few 4.9-rcs and the 4.9 release by now. I am currently testing the v4.10-rcs. None of these kernel versions have changed the "stuck on render ring" issue. It occures with a frequency of once or twice a week, usually as a result of resizing an X client like an xterm. The only noticable effect of upgrading the kernel was that the driver error handling got significantly worse with v4.9-rcX and later. It does not successfully reset the GPU anymore. Instead it tries again and again every 20th second. > - update to Mesa 13.0.3 > - compile/install latest xf86-video-intel Will look into this. FYI, I am usually following Debian testing/sid wrt userspace. > Regardless of whether this fixes you issue, please also attempt to reproduce > with the modesetting driver, and let us know if you encounter hangs in that > configuration. OK > Let me know if you need more information on how to do any of this. Hints are appreciated, but I guess I will figure it out by looking at how Debian package these things. Thanks for the log. Your version of sna is before this patch: https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4acd4a7d3d2f41227022fa7581cfb85a0b124eae which fixed at least one xorg gpu hang, as described in the bug referenced in the commit. A duplicate of that bug describes switching to modesetting: https://bugs.freedesktop.org/show_bug.cgi?id=99325 It will help us to improve the quality of Mesa to know: * do you encounter hangs with a newer sna? * do you encounter hangs with modesetting? Looking over what I've actually got, it seems that I'm already testing the configurations you refer to. Using Debian sid, this is what I currently have: Mesa 13.0.3: bjorn@miraculix:~$ dpkg -l libgl1-mesa\*|egrep ^ii ii libgl1-mesa-dev:amd64 13.0.3-1 amd64 free implementation of the OpenGL API -- GLX development files ii libgl1-mesa-dri:amd64 13.0.3-1 amd64 free implementation of the OpenGL API -- DRI modules ii libgl1-mesa-glx:amd64 13.0.3-1 amd64 free implementation of the OpenGL API -- GLX runtime ii libgl1-mesa-glx:i386 13.0.3-1 i386 free implementation of the OpenGL API -- GLX runtime A git snapshot of xserver-xorg-video-intel from 20161206: bjorn@miraculix:~$ dpkg -l xserver-xorg-video-intel|egrep ^ii ii xserver-xorg-video-intel 2:2.99.917+git20161206-1 amd64 X.Org X server -- Intel i8xx, i9xx display driver And I'm running a v4.10-rc4 kernel. The Debian xorg installation also seems to prefer the modesetting driver by default, so it appears I'm currently using that. Just tells how much of this I'm normally aware :) The bad news is that I still experience GPU HANGs with this configuration. I cannot tell if they are the same issue or something different, but I'll upload a new /sys/class/drm/card0/error with the matching xorg.log from the last incident. Created attachment 129007 [details]
/sys/class/drm/card0/error from GPU HANG with modeset, Linux v4.10-rc4 and Mesa 13.0.3
Created attachment 129008 [details]
xorg.log from GPU HANG with modeset, Mesa 13.0.3 and Linux v4.10-rc4
This is the log with the actual GPU HANG event, as you can see by matching up the Modeline log lines with the timing of the GPU resets:
[19308.656674] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1171], reason: Hang on render ring, action: reset
[19308.656769] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[19308.656770] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[19308.656771] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[19308.656772] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[19308.656773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[19308.657131] drm/i915: Resetting chip after gpu hang
[19308.657752] [drm] RC6 on
[19308.677139] [drm] GuC firmware load skipped
[19328.645312] drm/i915: Resetting chip after gpu hang
[19328.649380] [drm] RC6 on
[19328.668497] [drm] GuC firmware load skipped
[19348.612672] drm/i915: Resetting chip after gpu hang
[19348.613017] [drm] RC6 on
[19348.630830] [drm] GuC firmware load skipped
[19364.612475] drm/i915: Resetting chip after gpu hang
[19364.614544] [drm] RC6 on
[19364.629781] [drm] GuC firmware load skipped
[19382.660101] drm/i915: Resetting chip after gpu hang
[19382.660955] [drm] RC6 on
[19382.680661] [drm] GuC firmware load skipped
[19402.628876] drm/i915: Resetting chip after gpu hang
[19402.629229] [drm] RC6 on
[19402.643134] [drm] GuC firmware load skipped
[19422.660054] drm/i915: Resetting chip after gpu hang
[19422.660419] [drm] RC6 on
[19422.675415] [drm] GuC firmware load skipped
[19440.644097] drm/i915: Resetting chip after gpu hang
[19440.644558] [drm] RC6 on
[19440.663878] [drm] GuC firmware load skipped
[19458.627752] drm/i915: Resetting chip after gpu hang
[19458.634024] [drm] RC6 on
[19458.650700] [drm] GuC firmware load skipped
[19478.659877] drm/i915: Resetting chip after gpu hang
[19478.665303] [drm] RC6 on
[19478.684634] [drm] GuC firmware load skipped
[19498.627632] drm/i915: Resetting chip after gpu hang
[19498.634862] [drm] RC6 on
[19498.653638] [drm] GuC firmware load skipped
[19510.659670] drm/i915: Resetting chip after gpu hang
[19510.665894] [drm] RC6 on
[19510.680479] [drm] GuC firmware load skipped
Could you retry with git://anongit.freedesktop.org/drm-tip if you can reproduce the reset loop? (In reply to Mika Kuoppala from comment #9) > Could you retry with git://anongit.freedesktop.org/drm-tip > if you can reproduce the reset loop? Will try. But I am so far unsuccessful in provoking it, so I'll just have to wait and see. Created attachment 129051 [details]
dmesg, Xorg.log and /sys/class/drm/card0/error from drm-tip GPU hang and repeated resets
I was lucky. Just got a hang again running the current drm-tip. The resets were repeated multiple times, before the X session died and I was sent back to the xdm window.
The whole process was considerably quicker this time. But having all X clients die is still a no-go...
Attaching a tar.tz with 3 files: dmesg output, /sys/class/drm/card0/error and /var/log/Xorg.0.log.
The kernel/driver/drm was built from commit 6b590a717c7f ("drm-tip: 2017y-01m-19d-15h-51m-19s UTC integration manifest")
Created attachment 129052 [details]
Correct /var/log/Xorg.0.log from the 20170119 hang+reset
Sorry, that was of course the Xorg.0.log from the new X session, started after the reset. Attaching the correct log from the session which ended with hang+reset.
Issue still present in v4.10.0-rc6, with 24 failing attempts to reset the GPU causing the display to hang for 414 seconds in total: [22578.675066] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1115], reason: Hang on render ring, action: reset [22578.675071] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [22578.675072] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [22578.675073] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [22578.675074] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [22578.675075] [drm] GPU crash dump saved to /sys/class/drm/card0/error [22578.675117] drm/i915: Resetting chip after gpu hang [22578.675204] [drm] RC6 on [22578.690808] [drm] GuC firmware load skipped [22594.632570] drm/i915: Resetting chip after gpu hang [22594.632799] [drm] RC6 on [22594.648499] [drm] GuC firmware load skipped [22614.664380] drm/i915: Resetting chip after gpu hang [22614.664571] [drm] RC6 on [22614.683429] [drm] GuC firmware load skipped [22634.632106] drm/i915: Resetting chip after gpu hang [22634.632359] [drm] RC6 on [22634.646027] [drm] GuC firmware load skipped [22652.679799] drm/i915: Resetting chip after gpu hang [22652.679989] [drm] RC6 on [22652.696463] [drm] GuC firmware load skipped [22670.663733] drm/i915: Resetting chip after gpu hang [22670.668684] [drm] RC6 on [22670.683865] [drm] GuC firmware load skipped [22688.647556] drm/i915: Resetting chip after gpu hang [22688.658581] [drm] RC6 on [22688.681845] [drm] GuC firmware load skipped [22706.631362] drm/i915: Resetting chip after gpu hang [22706.637003] [drm] RC6 on [22706.657547] [drm] GuC firmware load skipped [22722.631107] drm/i915: Resetting chip after gpu hang [22722.631170] [drm] RC6 on [22722.645082] [drm] GuC firmware load skipped [22742.663136] drm/i915: Resetting chip after gpu hang [22742.665635] [drm] RC6 on [22742.682193] [drm] GuC firmware load skipped [22760.647001] drm/i915: Resetting chip after gpu hang [22760.649805] [drm] RC6 on [22760.668163] [drm] GuC firmware load skipped [22780.614816] drm/i915: Resetting chip after gpu hang [22780.619130] [drm] RC6 on [22780.637673] [drm] GuC firmware load skipped [22800.646570] drm/i915: Resetting chip after gpu hang [22800.650751] [drm] RC6 on [22800.669378] [drm] GuC firmware load skipped [22820.614457] drm/i915: Resetting chip after gpu hang [22820.622438] [drm] RC6 on [22820.641270] [drm] GuC firmware load skipped [22834.630392] drm/i915: Resetting chip after gpu hang [22834.636328] [drm] RC6 on [22834.654929] [drm] GuC firmware load skipped [22850.630252] drm/i915: Resetting chip after gpu hang [22850.630428] [drm] RC6 on [22850.646390] [drm] GuC firmware load skipped [22868.614075] drm/i915: Resetting chip after gpu hang [22868.614264] [drm] RC6 on [22868.633856] [drm] GuC firmware load skipped [22886.661904] drm/i915: Resetting chip after gpu hang [22886.668489] [drm] RC6 on [22886.684943] [drm] GuC firmware load skipped [22906.629771] drm/i915: Resetting chip after gpu hang [22906.635558] [drm] RC6 on [22906.651226] [drm] GuC firmware load skipped [22927.685275] drm/i915: Resetting chip after gpu hang [22927.685369] [drm] RC6 on [22927.701798] [drm] GuC firmware load skipped [22944.645560] drm/i915: Resetting chip after gpu hang [22944.651301] [drm] RC6 on [22944.667156] [drm] GuC firmware load skipped [22962.629440] drm/i915: Resetting chip after gpu hang [22962.635535] [drm] RC6 on [22962.651782] [drm] GuC firmware load skipped [22982.661283] drm/i915: Resetting chip after gpu hang [22982.666887] [drm] RC6 on [22982.687104] [drm] GuC firmware load skipped [22992.645253] drm/i915: Resetting chip after gpu hang [22992.651314] [drm] RC6 on [22992.667404] [drm] GuC firmware load skipped Would it be possible to at least make the GPU reset work, like it used to do in v4.8 and earlier? The hang is of course annoying, but waiting for minutes and ending up with X restarting is so much worse than the earlier behaviour that I see it as a regression in itself. Can you describe what you are doing when the gpu hang occurs? Are you using a standard debian sid desktop environment, or are you using XFCE/KDE? No, my desktop environment is not exactly standard. I am using the LXDE lightdm and lxsession to fire up WindowMaker as my window manager. Kind of prehistoric, I know :) The hangs always happen when I resize an X client by dragging its frame. The client is usually (always?) an xterm. Don't know if it happens to other types of clients since I rarely resize them. But most of the time resizing will not cause a hang. If I were to guess, I'd say than only one out of 20-100 such events ends up hanging the GPU. I have had no success trying to trigger the bug by repeatedly resizing an xterm, making me guess that it's not the count but some other factor. I have no idea what that could be. I have not been able to correlate the hangs with anything else. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1527. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 125010 [details] cat /sys/class/drm/card0/error |bzip2 >/tmp/error.bz2 Just got this. Don't know how interesting it is, but filing bug as requested: Jul 11 15:53:45 miraculix kernel: [ 3584.878743] [drm] stuck on render ring Jul 11 15:53:45 miraculix kernel: [ 3584.879658] [drm] GPU HANG: ecode 9:0:0x84dffff8, in Xorg [2555], reason: Engine(s) hung, action: reset Jul 11 15:53:45 miraculix kernel: [ 3584.879901] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jul 11 15:53:45 miraculix kernel: [ 3584.879906] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jul 11 15:53:45 miraculix kernel: [ 3584.879910] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jul 11 15:53:45 miraculix kernel: [ 3584.879914] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jul 11 15:53:45 miraculix kernel: [ 3584.879918] [drm] GPU crash dump saved to /sys/class/drm/card0/error Jul 11 15:53:45 miraculix kernel: [ 3584.885867] drm/i915: Resetting chip after gpu hang Jul 11 15:53:47 miraculix kernel: [ 3586.879631] [drm] RC6 on