93710 – 4.4.0-rc7: [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue

Bug 93710 - 4.4.0-rc7: [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue

Summary: 4.4.0-rc7: [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait ...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-01-14 11:36 UTC by Martin Mokrejs
Modified:	2017-04-11 15:30 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	SNB
i915 features:	GPU hang

Attachments
/sys/class/drm/card0/error (2.04 MB, text/plain) 2016-01-14 11:36 UTC, Martin Mokrejs	no flags	Details
dmesg.txt (66.45 KB, text/plain) 2016-01-14 11:42 UTC, Martin Mokrejs	no flags	Details
/sys/class/drm/card0/error (2.04 MB, text/plain) 2016-01-21 09:03 UTC, Martin Mokrejs	no flags	Details
View All

Description Martin Mokrejs 2016-01-14 11:36:37 UTC

Created attachment 121025 [details]
/sys/class/drm/card0/error

[269775.753427] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[269775.753506] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[269775.753507] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[269775.753508] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[269775.753509] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[269775.753510] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[269915.727405] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[269959.728650] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[269963.728783] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[269967.728893] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[269971.729018] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270009.730095] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270015.730245] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270076.762027] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270249.766961] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270255.767150] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270259.767439] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270273.767813] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270329.769279] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270379.740681] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270387.770929] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270391.771459] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270430.742148] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[270434.742281] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[271625.776388] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[271830.812272] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[271834.812396] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[271838.812515] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue

Comment 1 Martin Mokrejs 2016-01-14 11:42:11 UTC

# uname -a
Linux foo 4.4.0-rc7-default-pciehp #1 SMP Mon Dec 28 18:30:44 CET 2015 x86_64 Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz GenuineIntel GNU/Linux
#

This is Dell Vostro 3550 laptop with no additional graphics chip. BIOS A12.

Comment 2 Martin Mokrejs 2016-01-14 11:42:27 UTC

Created attachment 121026 [details]
dmesg.txt

Comment 3 Chris Wilson 2016-01-15 10:25:26 UTC

Note to self: the wait is referencing an active pipe, with a range within the plane. DERRMR seems set correctly.

Hmm, the vsync region is right up to the last scanline (give or take the granularity). Do you see the error if you move the window off the bottom of the screen?

Also the error state is reporting GPU page faults from the context and VT'd. A recent drm-intel-nightly kernel should fix the context fault, and can you try intel_iommu=off?

Do you see the same error on older kernels?

Comment 4 Martin Mokrejs 2016-01-15 10:36:44 UTC

(In reply to Chris Wilson from comment #3)

> Hmm, the vsync region is right up to the last scanline (give or take the
> granularity). Do you see the error if you move the window off the bottom of
> the screen?

I do not understand. I have just one screen, via HDMI to an external display. The internal LVDS display of the laptop is disabled. I can only add that when I leave my computer and turn off the external LCD, then after returning back and turning on the external LCD power it has no signal and the laptop's LCD is enabled. I have to use arandr to swap the outputs. I don't remember having to use arandr for a while. Maybe some code is too eager to disable the HDMI output? I would dare to guess that this (need to use arandr to re-enable HDMI output) is a consequence of the GPU HANG, because I realized this only after the GPU HANG.

> 
> Also the error state is reporting GPU page faults from the context and VT'd.
> A recent drm-intel-nightly kernel should fix the context fault, and can you
> try intel_iommu=off?

I can after I reboot after some days.

> 
> Do you see the same error on older kernels?

You can search bugzilla for my previous reports with GPU HANGs, I have no idea which have the same traces. But with 4.3.0, 4.3.3, 3.18.xx I did not see it.

Comment 5 Martin Mokrejs 2016-01-20 09:56:52 UTC

(In reply to Martin Mokrejs from comment #4)
> (In reply to Chris Wilson from comment #3)
> 
> > Hmm, the vsync region is right up to the last scanline (give or take the
> > granularity). Do you see the error if you move the window off the bottom of
> > the screen?
> 
> I do not understand. I have just one screen, via HDMI to an external
> display. The internal LVDS display of the laptop is disabled. I can only add
> that when I leave my computer and turn off the external LCD, then after
> returning back and turning on the external LCD power it has no signal and
> the laptop's LCD is enabled. I have to use arandr to swap the outputs. I
> don't remember having to use arandr for a while. Maybe some code is too
> eager to disable the HDMI output? I would dare to guess that this (need to
> use arandr to re-enable HDMI output) is a consequence of the GPU HANG,
> because I realized this only after the GPU HANG.

The need to use arandr/xrandr to re-enable external HDMI and to disable internal LVDS is not related to the GPU hang. After a few other bootups with the same kernel and runtime options I conclude this is some new issue with the kernel elsewhere. It happens whenever I power off my external LCD (not if it just falls asleep and get woken up).


> > Also the error state is reporting GPU page faults from the context and VT'd.
> > A recent drm-intel-nightly kernel should fix the context fault, and can you
> > try intel_iommu=off?
> 
> I can after I reboot after some days.

I booted up several times but so far I never hit the GPU HANG. Therefore, I don't think even trying intel_iommu=off would help us getting any conclusion. It just does not manifest often enough.

Comment 6 Martin Mokrejs 2016-01-21 09:03:17 UTC

Created attachment 121172 [details]
/sys/class/drm/card0/error

Another case but with the settings so IOMMU still turned on.

[195975.826841] [drm] GPU HANG: ecode 6:-1:0x00000000, reason: Kicking stuck wait on render ring, action: continue
[195975.826906] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[195975.826907] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[195975.826908] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[195975.826909] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[195975.826910] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Comment 7 Martin Mokrejs 2016-02-06 23:21:27 UTC

With the same kernel and intel_iommu=off I did not hit this issue (yet).

Comment 8 yann 2017-03-16 13:09:14 UTC

(In reply to Martin Mokrejs from comment #7)
> With the same kernel and intel_iommu=off I did not hit this issue (yet).

We seem to have neglected the bug a bit, apologies.

Martin Mokrejs, since There were improvements pushed in kernel that will benefit to your system, so please re-test with latest kernel and mark as REOPENED if you can reproduce (and attach fresh gpu error dump & kernel log) and RESOLVED/* if you cannot reproduce.

Comment 9 yann 2017-04-11 12:26:21 UTC

(In reply to yann from comment #8)
> (In reply to Martin Mokrejs from comment #7)
> > With the same kernel and intel_iommu=off I did not hit this issue (yet).
> 
> We seem to have neglected the bug a bit, apologies.
> 
> Martin Mokrejs, since There were improvements pushed in kernel that will
> benefit to your system, so please re-test with latest kernel and mark as
> REOPENED if you can reproduce (and attach fresh gpu error dump & kernel log)
> and RESOLVED/* if you cannot reproduce.

Timeout - assuming resolved+fixed.

If problem still persist with the latest kernels (preferable drm-tip from git://anongit.freedesktop.org/git/drm-tip), reopen this bug with latest logs as attachments.

Comment 10 Martin Mokrejs 2017-04-11 12:29:46 UTC

I was just trying to connect to bugzilla now. After I baked my CPU for a day or so (seemed to be associated with high CPU load) and I did not hit this issue anymore, I conclude 4.10.8 is fixed.

Comment 11 yann 2017-04-11 15:30:11 UTC

(In reply to Martin Mokrejs from comment #10)
> I was just trying to connect to bugzilla now. After I baked my CPU for a day
> or so (seemed to be associated with high CPU load) and I did not hit this
> issue anymore, I conclude 4.10.8 is fixed.

Thanks Martin for this confirmation

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.