Created attachment 111123 [details]
I'm using a clean Openelec 4.2.1 installation and just browsing the menus the system hangs for a couple of seconds.
[ 74.253274] [drm] stuck on render ring
[ 74.254915] [drm] GPU HANG: ecode 0:0x85df3c1d, in xbmc.bin , reason: Ring hung, action: reset
[ 74.254919] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 74.254922] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 74.254925] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 74.254928] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 74.254931] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 76.250758] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[ 80.245684] [drm] stuck on render ring
Created attachment 111124 [details]
Created attachment 111125 [details]
The symptoms don't match the recently fixed HSW GT1 bug,
Author: Chris Wilson <firstname.lastname@example.org>
Date: Tue Dec 16 10:02:27 2014 +0000
drm/i915: Disable PSMI sleep messages on all rings around context switches
but it is worthwhile trying drm-intel-nightly just in case. I presume it is a different bug though...
Also I tried the following kernels and get hangs for each of them:
- 3.13 ubuntu
- 3.15 ubuntu
- 3.15.0 ubuntu with i915.enable_rc6=0
- 3.17 OpenELEC RC3 which includes those gpu hang fixes:
I let memtest run once with no errors, and have not updatet to the latest BIOS yet.
The system is:
4GB-Kit G-Skill PC3-12800U CL9
Intel Pentium G3220 Box
WD Purple WD10PURX 1TB
I can upload the logs for 3.15 and 3.17 later.
Created attachment 111149 [details]
Created attachment 111150 [details]
Created attachment 111151 [details]
Created attachment 111153 [details]
Created attachment 111154 [details]
Created attachment 111156 [details]
Created attachment 111158 [details]
Created attachment 111159 [details]
Created attachment 111161 [details]
Created attachment 111162 [details]
CC'ing Ben as he was recently looking at HSW GT1 hangs.
Hi Knut. Have you tried the latest drm-intel-nightly? A fix went in from Chris Wilson to address issues like this which were mostly on IVB. It did fix our local HSW GT1 hang though.
If that doesn't work, please try this patch:
Created attachment 111195 [details]
Comment on attachment 111195 [details]
This is including Bens patch.
Created attachment 111227 [details]
Playing clannad.mkv with vaapi and tracing batch and dword
As requested on irc by bwidawks
Created attachment 111228 [details]
bw1 Error no video playback only browsing the menu
Created attachment 111229 [details]
no video playback only browsing the menu va-log1
Created attachment 111230 [details]
no video playback only browsing the menu va-log2
Created attachment 111233 [details] [review]
limit max PS threads for gt1
Created attachment 111235 [details] [review]
The equivalent mesa patch
Created attachment 111244 [details]
The limit_max_PS_threads patches seem to improve the situation. So far I experience far less hangs and I didn't actually see one yet while watching the video. So I assume they are very short.
Before hangs would at least disturb playback and often freeze the picture for several seconds.
A new issue ist that now the error state is empty.
it contains this warning:
[ 35.202724] WARNING: CPU: 0 PID: 1175 at drivers/gpu/drm/i915/i915_gem_execbuffer.c:126 eb_lookup_vmas.isra.15+0x373/0x410 [i915]()
(In reply to Knut Rupprecht from comment #26)
> The limit_max_PS_threads patches seem to improve the situation. So far I
> experience far less hangs and I didn't actually see one yet while watching
> the video. So I assume they are very short.
> Before hangs would at least disturb playback and often freeze the picture
> for several seconds.
Your errorstate still has max threads = 101 in the mesa batch that hung.
Please provide dpkg -l |grep 10.1.3, so that I can see which packages did not yet update after your build.
Created attachment 111271 [details]
dmesg including kernel, intel and mesa patch
This time Peter provided the patched mesa packages.
Created attachment 111272 [details]
Xorg.0.log including kernel, mesa and intel patches.
This crash was unrelated (fast forwarded too fast)
[ 5506.063941] show_signal_msg: 75 callbacks suppressed
[ 5506.063944] DVDPlayerVideo: segfault at 7fa9c09921e0 ip 00007fa9c09921e0 sp 00007fa9c6061ad8 error 15
Didn't the patches change max_threads from 102->70? I'm wondering how it could be at 101.
Could it be that gt1 isn't correctly identified somewhere so the wrong max_threads is used, or max_threads is set somewhere else aswell?
Comment on attachment 111244 [details]
This error state did not have the mesa patch in it. It is leading to confusion.
Comment on attachment 111271 [details]
dmesg including kernel, intel and mesa patch
We were unable to retrieve the error state from this for some reason:
100540 zeeeh │ [15:17:44] bwidawks, no flames so far :)
100540 fritsch │ [15:18:03] let's see if it survives the mesa compile
100540 bwidawks │ [15:56:11] fritsch, zeeeh no news is good news?
100540 zeeeh │ [15:58:05] bwidawks, No :/ I'm uploading. Fritsch is gone.
100540 zeeeh │ [16:02:21] bwidawks, I think I have to test again, the error file was empty
100540 bwidawks │ [16:02:51] zeeeh: can you pastebin dmesg?
100540 zeeeh │ [16:19:30] bwidawks, http://paste.ubuntu.com/9607552/
100540 │ [ http://127.0.0.1:46704/3Qh ]
100540 bwidawks │ [16:20:34] zeeeh: still hanging in mesa... I really want the error state now :-)
100540 bwidawks │ [16:20:39] oh wait
100540 bwidawks │ [16:20:46] zeeeh: are you using SNA?
100540 bwidawks │ [16:21:10] i presume that may have the wrong threadcounts too
100540 zeeeh │ [16:21:22] uh whats sna?
100540 bwidawks │ [16:21:38] can you pastebin /var/log/Xorg.0.log?
100540 zeeeh │ [16:22:15] http://paste.ubuntu.com/9607575/
100540 │ [ http://127.0.0.1:46704/3Qi ]
100540 bwidawks │ [16:22:15] zeeeh: hmm, you're actually hitting a kernel assertion here
100540 bwidawks │ [16:23:10] a strange warning too
100540 zeeeh │ [16:24:19] maybe I made an error installing the new mesa?
100540 bwidawks │ [16:25:06] zeeeh: well, the kernel issue is really strange, I don't want to look at that. If I had the error state, I could tell you
100540 zeeeh │ [16:27:22] bwidawks, I rebooted before this hang, but the error file again is empty
Knut, can you confirm once again the following:
1. Reboot the machine.
2. Using both the mesa and vaapi packages Peter built, reproduce the hang.
3. Try to read error state from sysfs.
Created attachment 111301 [details]
dump including kernel, intel and mesa patch
Created attachment 111302 [details]
dmesg including kernel, intel and mesa patch
Created attachment 111303 [details]
Xorg.log including kernel, intel and mesa patch
Created attachment 111305 [details] [review]
Same as previous, but for the kernel
The kernel sets up a default context with thread counts. Make sure we obey the PS thread count rules
Created attachment 111306 [details] [review]
Change PS thread count for null render context (kernel)
Created attachment 111307 [details]
This should include Bens latest patch.
Just to recap on the status:
No patches: 2.5s to BSD hang
intel-vaapi: ??s to mesa hang
intel-vaapi + mesa patch: 3-10m to mesa hang
intel-vappi + mesa patch + kernel patch: 20m to mesa hang
intel-vaapi + mesa + kernel + blt only: being tested
While it could be a false correlation, the patches so far seem to have improved the situation. The SNA fix seems to be in master as well, but Knut wasn't able to test that in the limited time he had, so we went with blt only instead.
Expecting an update from Knut on the blt only DDX. Chris, will blit only never emit 3d state?
(In reply to Ben Widawsky from comment #42)
> Expecting an update from Knut on the blt only DDX. Chris, will blit only
> never emit 3d state?
AccelMethod "BLT" will never emit any 3D commands.
(In reply to Knut Rupprecht from comment #41)
> Created attachment 111307 [details]
> This should include Bens latest patch.
It dies before the end of a reasonably long mesa batch, with multiple PS kernels (actually alternating between a pair of kernels) each only using 70 threads. That does imply that it is not the PS kernel itself, but it could still be the PS state hitting a corner condition for the first time.
Created attachment 111316 [details]
error dump with all patches and SNA using generic backend
I started testing at 02:00, hangs occured at 02:09 and 02:10. The next 8 hours it didn't hang, although there have been 2 other errors:
[Do Dez 25 04:58:18 2014] [drm] HPD interrupt storm detected on connector HDMI-A-2: switching from hotplug detection to polling
[Do Dez 25 06:53:36 2014] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Created attachment 111317 [details]
dmesg with all patches and SNA using generic backend
Created attachment 111318 [details]
Xorg-log with all patches and SNA using generic backend
Created attachment 111335 [details]
Error state with pipecontrol patch
Including the previous patches and this one:
It took 15minutes until crash.
The vaapi patch above was a write up by Ben. I only made it compile and exported it via git.
The error state is once gain hanging in mesa. Seems to me like a failure in the SBE, and then the rest of the pipe dies.
I am not sure mesa doesn't also need an extra pipe control like vaapi. I'm recommending this to try if possible (again a long shot)
the bug is still here.
[drm] stuck on render ring
happens deterministically when surfing to www.google.com/chrome using fresh install Firefox.
on a fresh clean Linux Mint 17.1 install with drm-intel-nightly build kernel (drm-intel-nightly: 2014y-12m-30d-13h-01m-34s). this kernel includes patches mentioned above.
using a ThinkPad X61 with Intel GM965/GL960 Integrated Graphics Controller.
Dan, yours is almost certainly a different bug. See https://bugs.freedesktop.org/show_bug.cgi?id=80568 (among many others). As the subject states, this bug is specific to HSW GT1.
Created attachment 111582 [details]
Error State mesa 10.4.0
This is with mesa 10.4.0, drm-intel-nightly including all the patches from this bugtracker. Took about a minute to hang.
Created attachment 111583 [details]
dmesg mesa 10.4.0
Created attachment 111584 [details]
Xorg.log mesa 10.4.0
Created attachment 111596 [details] [review]
Always initialize streamout buffers
As the commit message says, there is garbage in the last error state. As an example:
0x001a4228: 0x79180002: 3DSTATE_SO_BUFFER
0x001a422c: 0x44954ffe: DWord 1:
SO Buffer Index: 2
SO Buffer Object Control State: 2
Surface Pitch: 4094
0x001a4230: 0x169a840f: DWord 2:
Surface Base Address: 0x169a840c
0x001a4234: 0xcde4100d: DWord 3:
Surface End Address: 0xcde4100c
This patch should initialize the SOl state regardless of whether we use xfb. It's probably something we want to put in the kernel null ctx setup, but for now we can just test it in mesa.
Created attachment 111599 [details]
Error State incl. 'Always initialize streamout buffers' patch
Comment on attachment 111599 [details]
Error State incl. 'Always initialize streamout buffers' patch
17:45:31 zeeeh │ bwidawks, I made an error, last error doesn't have the new patch.
Created attachment 111600 [details]
Error dump for 'Always initialize streamout buffers'
Created attachment 111602 [details] [review]
Also initialize the streamout declaration list
Again I see garbage in the context state. Let's see if we can clean it up and uncover any real issues.
Created attachment 111604 [details]
Error State including 'initialize the streamout declaration list'
Created attachment 111605 [details]
dmesg including 'initialize the streamout declaration list'
The video ran 6 hours over night without a hang, but then it did hang.
Created attachment 111629 [details]
Only initialize, don't enable the SOL for null state
Created attachment 111632 [details]
Error State including 'vertex fetcher NULL state'
Including all patches:
It played straight for 4 hours without interaction, then crashed when I navigated the menus for about 2 minutes.
Created attachment 111637 [details]
We can't set non-zero streamout for inactive streams
Another screw up in the so_decl initialization patch
Created attachment 111650 [details]
Error State including 'fix the so_decl_list initialization'
Similar problem here running XBMC/KODI on an Intel NUC DN2820FYKH (http://ark.intel.com/de/products/78953/Intel-NUC-Kit-DN2820FYKH). I'm running XBMC 13.2 on Ubuntu 14.04 amd64 on two NUCs. There are two different version of the DN2820FYKH. They contain different CPUs, the older contains a N2820 Celeron the newer contains a N2830 Celeron (http://ark.intel.com/de/compare/79052,81071).
Everything works fine on the N2820 model, but on the N2830 model I get these GPU HANGs. I also tried the current OpenELEC 5.0.0, same problem there.
[ 102.437084] [drm] stuck on render ring
[ 102.443487] [drm] GPU HANG: ecode 0:0x87f73c06, in kodi.bin , reason: Ring hung, action: reset
[ 102.443500] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 102.443504] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 102.443507] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 102.443510] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 102.443513] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 108.441778] [drm] stuck on render ring
[ 108.448158] [drm] GPU HANG: ecode 0:0x85fffffa, in kodi.bin , reason: Ring hung, action: reset
[ 108.481529] DVDPlayerVideo: segfault at 7f4c00000009 ip 00007f4c7435bb90 sp 00007f4c227fb830 error 4 in libc-2.20.so[7f4c742e3000+194000]
[ 21.360292] [drm] stuck on render ring
[ 21.368137] [drm] GPU HANG: ecode 0:0x87f73c1e, in kodi.bin , reason: Ring hung, action: reset
[ 21.368152] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 21.368162] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 21.368171] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 21.368180] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 21.368190] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 27.381520] [drm] stuck on render ring
[ 27.389472] [drm] GPU HANG: ecode 0:0x85fffffc, in kodi.bin , reason: Ring hung, action: reset
Created attachment 111699 [details] [review]
Kernel patch - single port dispatch
All the hangs that Knut has sent me are related to the PSD, or SBE interaction. Coincidentally, we use this on IVB GT1.
Created attachment 111700 [details] [review]
kernel patch - wait for SBE
Created attachment 111701 [details] [review]
kernel patch - scoreboard even on idle PSD
Adam, yours is a different bug. You have a Baytrail. Yours is likely:
Created attachment 111740 [details] [review]
enable batch buffer end workaround
Implements a workaround mandated by the spec (for all HSW).
Let's test this with all the thread count patches (mesa, kernel, vaapi)
1. And all the mesa extra mesa patches with the streamout buffer, and so_decl stuff (through https://bugs.freedesktop.org/attachment.cgi?id=111629)
2. Only the mesa thread count patch
I'll take error state by email to avoid clutter. If anything is interesting, I will post it.
I have a branch now with both workarounds:
We never heard back. Please re-open if there is still an issue.