Bug 95312 - 4.6-rc5: [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
Summary: 4.6-rc5: [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-07 20:49 UTC by Martin Mokrejs
Modified: 2017-04-11 15:29 UTC (History)
1 user (show)

See Also:
i915 platform: SNB
i915 features: power/Other


Attachments
/sys/class/drm/card0/error.bz2 (216.51 KB, text/plain)
2016-05-07 20:49 UTC, Martin Mokrejs
no flags Details
.config.gz (24.58 KB, application/octet-stream)
2016-05-07 20:59 UTC, Martin Mokrejs
no flags Details
dmesg (67.59 KB, text/plain)
2016-05-07 20:59 UTC, Martin Mokrejs
no flags Details
/sys/kernel/debug/dri/0/i915_error_state.bz2 (216.51 KB, application/octet-stream)
2016-05-07 21:32 UTC, Martin Mokrejs
no flags Details
Xorg.0.log (for currently running kernel) (104.84 KB, text/plain)
2016-05-08 06:10 UTC, Martin Mokrejs
no flags Details

Description Martin Mokrejs 2016-05-07 20:49:31 UTC
Created attachment 123539 [details]
/sys/class/drm/card0/error.bz2

This could be a dupe of bug #93710 (same hardware, same kernel .config and without explicit intel_iommu=off).

[798005.164877] snd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj.
[800008.420528] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[800008.420592] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[800008.420593] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[800008.420594] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[800008.420595] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[800008.420596] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[848558.214001] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
Comment 1 Martin Mokrejs 2016-05-07 20:58:43 UTC
# uname -a
Linux vostro 4.6.0-rc5-default-pciehp #3 SMP Wed Apr 27 16:57:10 CEST 2016 x86_64 Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz GenuineIntel GNU/Linux
# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
stepping        : 7
microcode       : 0x1b
cpu MHz         : 3299.999
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 5586.99
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

...

This is a laptop Dell Vostro 3550 with BIOS A12 (SandyBridge). External LCD hooked via HDMI, internal LCD panel turned off.

What could trigger the issue? Fully loaded both CPU cores (for a week or so), time to time my external screen goes blank after I leave the computer for a while. It could have happened when I came back and woke up the X11 screen or even, turned on the power button of the external HDMI. In general both scenarios happen and I do not know what I was doing at about that time. Any of these could me related to the issue but it is a wild guess.
Comment 2 Martin Mokrejs 2016-05-07 20:59:25 UTC
Created attachment 123540 [details]
.config.gz
Comment 3 Martin Mokrejs 2016-05-07 20:59:58 UTC
Created attachment 123541 [details]
dmesg
Comment 4 Martin Mokrejs 2016-05-07 21:32:22 UTC
Created attachment 123542 [details]
/sys/kernel/debug/dri/0/i915_error_state.bz2

Interestingly, I see I have another, other error state as well?!

# ls -latr /sys/kernel/debug/dri/0/i915_error_state
-rw-r--r-- 1 root root 0 Apr 27 19:05 /sys/kernel/debug/dri/0/i915_error_state
# ls -latr /sys/class/drm/card0/error 
-rw------- 1 root root 0 May  7 22:45 /sys/class/drm/card0/error
#
# w
 23:13:14 up 10 days,  6:07, 32 users,  load average: 2.97, 3.22, 3.29
...

Is the timestamp on /sys/kernel/debug/dri/0/i915_error_state set at bootup?

No, not really:

# grep -a 'syslog-ng starting up' /var/log/messages 
Apr 27 16:11:54 vostro syslog-ng[3641]: syslog-ng starting up; version='3.6.2'
Apr 27 16:21:46 vostro syslog-ng[3301]: syslog-ng starting up; version='3.6.2'
Apr 27 16:27:58 vostro syslog-ng[3328]: syslog-ng starting up; version='3.6.2'
Apr 27 16:31:54 vostro syslog-ng[3639]: syslog-ng starting up; version='3.6.2'
Apr 27 17:06:08 vostro syslog-ng[3642]: syslog-ng starting up; version='3.6.2'
#
grep -a '^Apr 27 19:0' /var/log/messages
... [ gives me no clue why i915_error_state was created, syslogd was running but that is all I can say ]
#


But from /var/log/messages I see during previous bootup I had a different issue:

Apr 27 16:12:46 vostro kernel: [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO underrun
Apr 27 16:12:46 vostro kernel: [drm:intel_set_pch_fifo_underrun_reporting] *ERROR* uncleared pch fifo underrun on pch transcoder A
Apr 27 16:12:46 vostro kernel: [drm:intel_pch_fifo_underrun_irq_handler] *ERROR* PCH transcoder A FIFO underrun
Apr 27 16:12:46 vostro kernel: [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe B FIFO underrun
Apr 27 16:12:46 vostro kernel: [drm:intel_check_pch_fifo_underruns] *ERROR* pch fifo underrun on pch transcoder B


but I was inserting and ejecting my ExpressCards into the slot a few second after this message was logged in, probably that is related to it.
Comment 5 Martin Mokrejs 2016-05-08 06:10:43 UTC
Created attachment 123544 [details]
Xorg.0.log (for currently running kernel)
Comment 6 yann 2016-09-02 15:56:41 UTC
Both gpu crash dumps are describing same issue.

From error dump, there is no hung in render ring batch with active head
at 0x0006a028, with 0x01800100 (MI_WAIT_FOR_EVENT) as IPEHR (could it be issue linked to dpms?)

Moreover we can note that we have ERROR: 0x00000012
    Context page GTT translation generated a fault (GTT entry not valid)
    TLB page VTD translation generated an error

for reference, batch extract (around 0x0006a028):

0x0006a018:      0x11000001: MI_LOAD_REGISTER_IMM
0x0006a01c:      0x00002050:    dword 1
0x0006a020:      0x00010001:    dword 2
0x0006a024:      0x01800100: MI_WAIT_FOR_EVENT, plane B scan line wait
0x0006a028:      0x11000001: MI_LOAD_REGISTER_IMM
0x0006a02c:      0x00002050:    dword 1
0x0006a030:      0x00010000:    dword 2
0x0006a034:      0x11000001: MI_LOAD_REGISTER_IMM
Comment 7 Jani Saarinen 2016-12-09 11:04:38 UTC
Reporter, is this still valid with latest kernel?
Comment 8 Martin Mokrejs 2017-04-11 12:31:02 UTC
I was just trying to connect to bugzilla now. After I baked my CPU for a day or so (seemed to be associated with high CPU load) and I did not hit this issue anymore, I conclude 4.10.8 is fixed.
Comment 9 yann 2017-04-11 15:29:05 UTC
(In reply to Martin Mokrejs from comment #8)
> I was just trying to connect to bugzilla now. After I baked my CPU for a day
> or so (seemed to be associated with high CPU load) and I did not hit this
> issue anymore, I conclude 4.10.8 is fixed.

Thanks Martin


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.