Created attachment 59396 [details] error state When playing video through XV extension for a while, I got GPU hang that is not correctly recognized and GPU is not reset. HW: Lenovo T420, 00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 21d0 8086:0126 X driver git 6142232fa0feeb39412cda85ca727cc770eaa042 Kernel 3.3.0
Created attachment 59397 [details] dmesg log
Created attachment 59398 [details] gpu top capture when hung no error, it just displays nothing.
I put all the files into this location as some of them are too large. http://undomiel.ics.muni.cz/tmp/hang/ Before I did echo 1 > /sys/kernel/debug/dri/0/i915_wedged grep . /sys/kernel/debug/dri/0: debug.dri.log intel_gpu_dump > gpudump.log intel_reg_dump > regdump.log gpu_top.txt - intel_gpu_top screen capture after echo 1 grep . /sys/kernel/debug/dri/0: debug1.dri.log dmesg > dmesg1.txt cp /sys/kernel/debug/dri/0/i915_error_state . Hope all files are clear ;) Is there anything more I could do?
This command unhung GPU: echo 1 > /sys/kernel/debug/dri/0/i915_wedged However, before this echo, reading some files from /sys/kernel/debug/dri/0/* took fair amount of time, the same for intel_gpu_dump.
The underlying bug is commit 5d031e5b633d910f35e6e0abce94d9d842390006 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 8 13:34:13 2012 +0000 drm/i915: Remove use of the autoreported ringbuffer HEAD position This is a revert of 6aa56062eaba67adfb247cded244fd877329588d. This was originally introduced to workaround reads of the ringbuffer registers returning 0 on SandyBridge causing hangs due to ringbuffer overflow. The root cause here was reads through the GT powerwell require the forcewake dance, something we only learnt of later. Now it appears that reading the reported head position from the HWS is returning garbage, leading once again to hangs. However, as demonstrated hangcheck cannot afford to fail. It seems we are stuck holding a mutex, so can you find the process in the 'D' state and cat /proc/$pid/stack? Failing spotting a stuck process, grab the stack of the likely suspects (X).
What do I need to compile into kernel so that I have /proc/$PID/stack?
/proc/$pid/stack is shorthand for something like cat /proc/`pidof X`/stack - i.e. you replace $pid with the value of interest for your system at the time of failure. It will require a bit of investigation to find the right process to see who is holding the i915 lock.
I know, the point is, that I don't have the 'stack' item in proc: /proc/3473# ls -1 auxv cgroup clear_refs cmdline comm coredump_filter cpuset cwd environ exe fd fdinfo io limits loginuid maps mem mountinfo mounts mountstats net ns oom_adj oom_score oom_score_adj pagemap personality root sessionid smaps stat statm status syscall task wchan
My misunderstanding, CONFIG_STACKTRACE is the key ingredient.
Xorg is not in D state (no process is in D state). This is stacktrace of X server: [<ffffffffa031651c>] i915_wait_request+0x3dc/0x460 [i915] [<ffffffff81209f31>] timerqueue_add+0x61/0xb0 [<ffffffff8104d970>] autoremove_wake_function+0x0/0x30 [<ffffffffa0317ae8>] i915_gem_object_set_to_gtt_domain+0xa8/0xe0 [i915] [<ffffffffa0317c15>] i915_gem_set_domain_ioctl+0xf5/0x120 [i915] [<ffffffffa01f427c>] drm_ioctl+0x40c/0x4b0 [drm] [<ffffffff8103ecd2>] recalc_sigpending+0x12/0x50 [<ffffffffa0317b20>] i915_gem_set_domain_ioctl+0x0/0x120 [i915] [<ffffffff8103ebdb>] __dequeue_signal+0xfb/0x1e0 [<ffffffff81041970>] block_sigmask+0x40/0x50 [<ffffffff810022c5>] do_signal+0x1d5/0x730 [<ffffffff8110d0de>] do_vfs_ioctl+0x8e/0x4f0 [<ffffffff81002bb3>] sys_rt_sigreturn+0x203/0x220 [<ffffffff8110d589>] sys_ioctl+0x49/0x80 [<ffffffff8149fae2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
Darn, should have read the dmesg. Ok, we just endless wait upon the next seqno which can never happen due to the ringbuffer overflow. So we need to blow up if waiting for a future request and idle.
Created attachment 59445 [details] [review] Trigger hangcheck for repeated "IRQ misses"
(In reply to comment #12) > Created attachment 59445 [details] [review] [review] > Trigger hangcheck for repeated "IRQ misses" looks like it works, but GPU recovery does not work completely. X session did not recovered even after killing mplayer. But this time I tried vaapi output as I were unable to reproduce the hang just with XV output. [427334.174185] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ? [427335.675554] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ? [427337.177001] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ? [427337.177020] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [427337.177027] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [427337.201453] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 14298556 at 14298555, next 14298557) [427337.203071] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000 [427372.032146] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [427372.032228] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 14298565 at 14298555, next 14298566) [427372.032921] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000
Well, I managed to reproduce the hang with ordinary video play through XV. After the second reset, it recovered. It seems that Xserver didn't even notice that GPU hung. Good, at least for me as it did not disable acceleration. [56464.394807] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ? [56465.896203] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ? [56467.397595] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ? [56467.397616] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [56467.397623] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [56467.421850] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 1418567 at 1418566, next 1418568) [56467.422668] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000 [56473.729375] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [56473.729427] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 1418574 at 1418454, next 1418575)
Created attachment 59678 [details] error state Am I hitting the same bug?
(In reply to comment #15) > Created attachment 59678 [details] > error state > > Am I hitting the same bug? Not quite. And how Daniel will laugh, this is: commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Jan 25 16:32:49 2012 +0100 drm/i915: fixup seqno allocation logic for lazy_request Currently we reserve seqnos only when we emit the request to the ring (by bumping dev_priv->next_seqno), but start using it much earlier for ring->oustanding_lazy_request. When 2 threads compete for the gpu and run on two different rings (e.g. ddx on blitter vs. compositor) hilarity ensued, especially when we get constantly interrupted while reserving buffers. [snip] Note the seqno confusion between the render/blt rings.
(In reply to comment #16) > (In reply to comment #15) > > Created attachment 59678 [details] > > error state > > > > Am I hitting the same bug? > > Not quite. And how Daniel will laugh, this is: > > commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Wed Jan 25 16:32:49 2012 +0100 should this commit fix that bug?
I believe so. In short, just update to drm-intel-next (as some fixes depend upon new infrastructure) for a round-up of all known bug fixes ;-)
(In reply to comment #18) > I believe so. In short, just update to drm-intel-next (as some fixes depend > upon new infrastructure) for a round-up of all known bug fixes ;-) is there an easy way to fetch and merge drm-intel-next into my local clone of git kernel (currently 3.3.0 version)? And where is current location of drm-intel-next? This one git://people.freedesktop.org/~keithp/linux seems to be out of date.
Daniel is maintaining the -next trees (and doing the donkey work of tracking patches for -fixes) so please merge in the git://people.freedesktop.org/~danvet/drm-intel #drm-intel-next-queued
I believe the root causes of the hangs are fixed, and this should address the failed hangcheck: commit d1e61e7fc4456c4cb9a33ed182edf40e34ddedea Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 10 17:00:41 2012 +0100 drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ On the first instance we just wish to kick the waiters and see if that terminates the wait conditions. If it does not, then we do not want to keep retrying without ever making any forward progress and becoming stuck in a hangcheck loop. Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209 Reviewed-by: Ben Widawsky <ben@bwidawsk.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.