48209 – Hangcheck does not work on SNB

Bug 48209 - Hangcheck does not work on SNB

Summary: Hangcheck does not work on SNB

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Daniel Vetter
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-04-02 11:46 UTC by Lukas Hejtmanek
Modified:	2017-07-24 23:02 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
error state (1.97 MB, application/octet-stream) 2012-04-02 11:46 UTC, Lukas Hejtmanek	no flags	Details
dmesg log (121.62 KB, text/plain) 2012-04-02 11:47 UTC, Lukas Hejtmanek	no flags	Details
gpu top capture when hung (1.08 KB, text/plain) 2012-04-02 11:48 UTC, Lukas Hejtmanek	no flags	Details
Trigger hangcheck for repeated "IRQ misses" (3.24 KB, patch) 2012-04-03 11:36 UTC, Chris Wilson	no flags	Details \| Splinter Review
error state (1.97 MB, text/plain) 2012-04-09 06:41 UTC, Lukas Hejtmanek	no flags	Details
View All

Description Lukas Hejtmanek 2012-04-02 11:46:17 UTC

Created attachment 59396 [details]
error state

When playing video through XV extension for a while, I got GPU hang that is not correctly recognized and GPU is not reset. 

HW: Lenovo T420, 

00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Device 21d0
8086:0126

X driver git 6142232fa0feeb39412cda85ca727cc770eaa042

Kernel 3.3.0

Comment 1 Lukas Hejtmanek 2012-04-02 11:47:09 UTC

Created attachment 59397 [details]
dmesg log

Comment 2 Lukas Hejtmanek 2012-04-02 11:48:12 UTC

Created attachment 59398 [details]
gpu top capture when hung

no error, it just displays nothing.

Comment 3 Lukas Hejtmanek 2012-04-02 11:54:13 UTC

I put all the files into this location as some of them are too large.

http://undomiel.ics.muni.cz/tmp/hang/

Before I did   echo 1 > /sys/kernel/debug/dri/0/i915_wedged

grep . /sys/kernel/debug/dri/0:
debug.dri.log

intel_gpu_dump > gpudump.log
intel_reg_dump > regdump.log
gpu_top.txt - intel_gpu_top screen capture

after echo 1

grep . /sys/kernel/debug/dri/0:
debug1.dri.log

dmesg > dmesg1.txt

cp /sys/kernel/debug/dri/0/i915_error_state .

Hope all files are clear ;)

Is there anything more I could do?

Comment 4 Lukas Hejtmanek 2012-04-02 11:55:56 UTC

This command unhung GPU:

echo 1 > /sys/kernel/debug/dri/0/i915_wedged

However, before this echo, reading some files from /sys/kernel/debug/dri/0/* took fair amount of time, the same for intel_gpu_dump.

Comment 5 Chris Wilson 2012-04-02 13:30:58 UTC

The underlying bug is

commit 5d031e5b633d910f35e6e0abce94d9d842390006
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 8 13:34:13 2012 +0000

    drm/i915: Remove use of the autoreported ringbuffer HEAD position
    
    This is a revert of 6aa56062eaba67adfb247cded244fd877329588d.
    
    This was originally introduced to workaround reads of the ringbuffer
    registers returning 0 on SandyBridge causing hangs due to ringbuffer
    overflow. The root cause here was reads through the GT powerwell require
    the forcewake dance, something we only learnt of later. Now it appears
    that reading the reported head position from the HWS is returning
    garbage, leading once again to hangs.


However, as demonstrated hangcheck cannot afford to fail. It seems we are stuck holding a mutex, so can you find the process in the 'D' state and cat /proc/$pid/stack? Failing spotting a stuck process, grab the stack of the likely suspects (X).

Comment 6 Lukas Hejtmanek 2012-04-02 14:37:00 UTC

What do I need to compile into kernel so that I have /proc/$PID/stack?

Comment 7 Chris Wilson 2012-04-02 14:43:16 UTC

/proc/$pid/stack is shorthand for something like cat /proc/`pidof X`/stack - i.e. you replace $pid with the value of interest for your system at the time of failure. It will require a bit of investigation to find the right process to see who is holding the i915 lock.

Comment 8 Lukas Hejtmanek 2012-04-02 14:51:33 UTC

I know, the point is, that I don't have the 'stack' item in proc:

/proc/3473# ls -1
auxv
cgroup
clear_refs
cmdline
comm
coredump_filter
cpuset
cwd
environ
exe
fd
fdinfo
io
limits
loginuid
maps
mem
mountinfo
mounts
mountstats
net
ns
oom_adj
oom_score
oom_score_adj
pagemap
personality
root
sessionid
smaps
stat
statm
status
syscall
task
wchan

Comment 9 Chris Wilson 2012-04-02 15:00:59 UTC

My misunderstanding, CONFIG_STACKTRACE is the key ingredient.

Comment 10 Lukas Hejtmanek 2012-04-03 11:23:45 UTC

Xorg is not in D state (no process is in D state). This is stacktrace of X server:
[<ffffffffa031651c>] i915_wait_request+0x3dc/0x460 [i915]
[<ffffffff81209f31>] timerqueue_add+0x61/0xb0
[<ffffffff8104d970>] autoremove_wake_function+0x0/0x30
[<ffffffffa0317ae8>] i915_gem_object_set_to_gtt_domain+0xa8/0xe0 [i915]
[<ffffffffa0317c15>] i915_gem_set_domain_ioctl+0xf5/0x120 [i915]
[<ffffffffa01f427c>] drm_ioctl+0x40c/0x4b0 [drm]
[<ffffffff8103ecd2>] recalc_sigpending+0x12/0x50
[<ffffffffa0317b20>] i915_gem_set_domain_ioctl+0x0/0x120 [i915]
[<ffffffff8103ebdb>] __dequeue_signal+0xfb/0x1e0
[<ffffffff81041970>] block_sigmask+0x40/0x50
[<ffffffff810022c5>] do_signal+0x1d5/0x730
[<ffffffff8110d0de>] do_vfs_ioctl+0x8e/0x4f0
[<ffffffff81002bb3>] sys_rt_sigreturn+0x203/0x220
[<ffffffff8110d589>] sys_ioctl+0x49/0x80
[<ffffffff8149fae2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Comment 11 Chris Wilson 2012-04-03 11:28:27 UTC

Darn, should have read the dmesg.

Ok, we just endless wait upon the next seqno which can never happen due to the ringbuffer overflow. So we need to blow up if waiting for a future request and idle.

Comment 12 Chris Wilson 2012-04-03 11:36:59 UTC

Created attachment 59445 [details] [review]
Trigger hangcheck for repeated "IRQ misses"

Comment 13 Lukas Hejtmanek 2012-04-08 14:58:14 UTC

(In reply to comment #12)
> Created attachment 59445 [details] [review] [review]
> Trigger hangcheck for repeated "IRQ misses"


looks like it works, but GPU recovery does not work completely. X session did not recovered even after killing mplayer. But this time I tried vaapi output as I were unable to reproduce the hang just with XV output.

[427334.174185] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ?
[427335.675554] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ?
[427337.177001] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 14298556, at 14298555], missed IRQ?
[427337.177020] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[427337.177027] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[427337.201453] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 14298556 at 14298555, next 14298557)
[427337.203071] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000
[427372.032146] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[427372.032228] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 14298565 at 14298555, next 14298566)
[427372.032921] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000

Comment 14 Lukas Hejtmanek 2012-04-09 06:40:03 UTC

Well, I managed to reproduce the hang with ordinary video play through XV.
After the second reset, it recovered. It seems that Xserver didn't even notice that GPU hung. Good, at least for me as it did not disable acceleration. 
 
[56464.394807] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ?
[56465.896203] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ?
[56467.397595] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 1418567, at 1418566], missed IRQ?
[56467.397616] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[56467.397623] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[56467.421850] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 1418567 at 1418566, next 1418568)
[56467.422668] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000
[56473.729375] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[56473.729427] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 1418574 at 1418454, next 1418575)

Comment 15 Lukas Hejtmanek 2012-04-09 06:41:26 UTC

Created attachment 59678 [details]
error state

Am I hitting the same bug?

Comment 16 Chris Wilson 2012-04-09 07:00:30 UTC

(In reply to comment #15)
> Created attachment 59678 [details]
> error state
> 
> Am I hitting the same bug?

Not quite. And how Daniel will laugh, this is:

commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 25 16:32:49 2012 +0100

    drm/i915: fixup seqno allocation logic for lazy_request
    
    Currently we reserve seqnos only when we emit the request to the ring
    (by bumping dev_priv->next_seqno), but start using it much earlier for
    ring->oustanding_lazy_request. When 2 threads compete for the gpu and
    run on two different rings (e.g. ddx on blitter vs. compositor)
    hilarity ensued, especially when we get constantly interrupted while
    reserving buffers. [snip]

Note the seqno confusion between the render/blt rings.

Comment 17 Lukas Hejtmanek 2012-04-09 07:05:00 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > Created attachment 59678 [details]
> > error state
> > 
> > Am I hitting the same bug?
> 
> Not quite. And how Daniel will laugh, this is:
> 
> commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Jan 25 16:32:49 2012 +0100

should this commit fix that bug?

Comment 18 Chris Wilson 2012-04-09 07:17:56 UTC

I believe so. In short, just update to drm-intel-next (as some fixes depend upon new infrastructure) for a round-up of all known bug fixes ;-)

Comment 19 Lukas Hejtmanek 2012-04-09 12:01:36 UTC

(In reply to comment #18)
> I believe so. In short, just update to drm-intel-next (as some fixes depend
> upon new infrastructure) for a round-up of all known bug fixes ;-)

is there an easy way to fetch and merge drm-intel-next into my local clone of git kernel (currently 3.3.0 version)?

And where is current location of drm-intel-next? This one git://people.freedesktop.org/~keithp/linux seems to be out of date.

Comment 20 Chris Wilson 2012-04-09 12:35:56 UTC

Daniel is maintaining the -next trees (and doing the donkey work of tracking patches for -fixes) so please merge in the git://people.freedesktop.org/~danvet/drm-intel #drm-intel-next-queued

Comment 21 Chris Wilson 2012-04-16 04:45:57 UTC

I believe the root causes of the hangs are fixed, and this should address the failed hangcheck:

commit d1e61e7fc4456c4cb9a33ed182edf40e34ddedea
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 10 17:00:41 2012 +0100

    drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
    
    On the first instance we just wish to kick the waiters and see if that
    terminates the wait conditions. If it does not, then we do not want to
    keep retrying without ever making any forward progress and becoming
    stuck in a hangcheck loop.
    
    Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
    Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.