Bug 97396

Summary:

[drm] stuck on bsd ring (bisected)

Product:

DRI

Reporter:

Rainer Fiebig <mymailclone>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

minor

Priority:

medium

CC:

intel-gfx-bugs

Version:

XOrg git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

ILK

i915 features:

GPU hang

Attachments:

Description	Flags
/sys/class/drm/card0/error	none
dmesg (part of)	none
dmesg_4.4.18_problem	none
dmesg_4.8-rc3_part	none

Description Rainer Fiebig 2016-08-18 16:16:31 UTC

Since kernel 4.3 and up to 4.7.1, dmesg shows the following or similar messages after resuming from hibernation, here for 4.7.1:

[drm] stuck on bsd ring
[drm] GPU HANG: ecode 5:2:0x01000000, reason: Engine(s) hung, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang

Interestingly, with 4.4.18 I only see this:

[drm] stuck on bsd ring
[drm] GPU HANG: ecode 5:1:0x01000000, reason: Ring hung, action: reset
drm/i915: Resetting chip after gpu hang

Other than the messages, I haven't noticed any negative impact on my system (Core i3). But this may be different for other setups (see https://bugs.freedesktop.org/show_bug.cgi?id=94203). 

I bisected between 4.2 and 4.3, "bad/good" strictly according to whether "[drm] stuck on bsd ring" occurred or not - disregarding all the other numerous warnings in connection with i915. The result this produced was:

> git bisect good
ba01cc9346bce45a8861f36bce2c4c5d44b800b2 is the first bad commit
commit ba01cc9346bce45a8861f36bce2c4c5d44b800b2
Author: John Harrison <John.C.Harrison@Intel.com>
Date:   Fri May 29 17:43:41 2015 +0100

    drm/i915: Update i915_switch_context() to take a request structure
    
    Now that the request is guaranteed to specify the context, it is possible to
    update the context switch code to use requests rather than ring and context
    pairs. This patch updates i915_switch_context() accordingly.
    
    Also removed the warning that the request's context must match the last context
    switch's context. As the context switch now gets the context object from the
    request structure, there is no longer any scope for the two to become out of
    step.
    
    For: VIZ-5115
    Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
    Reviewed-by: Tomas Elf <tomas.elf@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 3717675a773aafd62ff883278f0a9dd0b4747af9 0f32cac1cb40d7d1f757495a85e77bf39863862a M     drivers
---

Comment 1 yann 2016-08-23 14:01:16 UTC

Jay, can you attach GPU crash dump saved to /sys/class/drm/card0/error as well as kernel log (ie dmesg) ?
thanks

Comment 2 Rainer Fiebig 2016-08-23 19:58:33 UTC

Created attachment 125985 [details]
/sys/class/drm/card0/error

Comment 3 Rainer Fiebig 2016-08-23 19:59:16 UTC

Created attachment 125986 [details]
dmesg (part of)

Comment 4 Rainer Fiebig 2016-08-23 20:01:01 UTC

(In reply to yann from comment #1)
> Jay, can you attach GPU crash dump saved to /sys/class/drm/card0/error as
> well as kernel log (ie dmesg) ?
> thanks

"error" and "dmesg" (part of) attached, kernel was 4.7.1.

It seems there were quite some changes made in the i915-driver between kernels 4.2 and 4.3. So I guess it won't be too easy to figure out where the problem really is. At least the bisected commit offers a starting point.

Comment 5 Chris Wilson 2016-08-24 09:56:50 UTC

845g with a bsd ring is impressive ;)

Comment 6 Chris Wilson 2016-08-24 10:05:05 UTC

This should be fixed on drm-intel-nightly already, don't know if that made it into the 4.8 cut.

Comment 7 yann 2016-08-24 10:06:37 UTC

(In reply to Chris Wilson from comment #5)
> 845g with a bsd ring is impressive ;)

Oups, you are right, not sure why I clicked on I854G since pci id is 0x0042 (ie Ironlake (Clarkdale))... my bad

Thanks Chris for correcting it :)

Comment 8 Rainer Fiebig 2016-08-24 10:56:59 UTC

(In reply to Chris Wilson from comment #6)
> This should be fixed on drm-intel-nightly already, don't know if that made
> it into the 4.8 cut.

Sounds good. Had it something to do with the incriminated commit or could I have wasted my time in a better way? ;)

Comment 9 Chris Wilson 2016-08-24 11:03:10 UTC

There were certainly bugs in how requests + contexts operated on Ironlake that required fixing. But there have also been changes to make hibernation more reliable (hopefully!). And since everything touching the GPU has a request, everything may be related back to the bisect result ;)

Comment 10 Rainer Fiebig 2016-08-24 11:15:20 UTC

(In reply to Chris Wilson from comment #9)
> There were certainly bugs in how requests + contexts operated on Ironlake
> that required fixing. But there have also been changes to make hibernation
> more reliable (hopefully!). And since everything touching the GPU has a
> request, everything may be related back to the bisect result ;)

Thanks! I read this as: "Time perhaps not completely wasted." 

Alright with me. ;)

Comment 11 Rainer Fiebig 2016-08-24 18:42:26 UTC

I was just experiencing a problem after resume from s2disk: frozen windows, missing titlebars, only partly rendered windows, keyboard not working. Could solve it with log out/in but had to shoot a VM. Kernel was 4.4.18.

Among some of the new messages in dmesg (see new attachm.) :

...
[drm] stuck on render ring
[drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg [1183], reason: Ring hung, action: reset
i915 0000:00:02.0: GEM idle failed, resume might fail
pci_pm_freeze(): i915_pm_suspend+0x0/0x40 [i915] returns -11
dpm_run_callback(): pci_pm_freeze+0x0/0xe0 returns -11
PM: Device 0000:00:02.0 failed to freeze async: error -11
drm/i915: Resetting chip after gpu hang
...

Seems more serious than I first thought. 

@Chris: Could you give me a hint concerning the patch that you mentioned in your post? I could give it a try. Or is it a whole bunch? In that case I'd rather wait until they are integrated in 4.8 (in git I have 4.8-rc2 and that doesn't seem to have the patches for this problem).

Comment 12 Rainer Fiebig 2016-08-24 18:43:59 UTC

Created attachment 126016 [details]
dmesg_4.4.18_problem

Comment 13 Chris Wilson 2016-08-24 18:54:27 UTC

I realise I was thinking of some context bugs that don't affect you since Ironlake hasn't enabled HW contexts. Hmm, also the hibernate bug I was thinking of also related to using HW context. Double red herring, sorry.

Gut feeling form the error state is that the GPU is being clobbered by the framebuffer upon resume. Could I just tempt you into try an rc or nightly? Even if just to grab an error state? :)

Comment 14 Rainer Fiebig 2016-08-24 19:06:57 UTC

(In reply to Chris Wilson from comment #13)
...
> thinking of also related to using HW context. Double red herring, sorry.
Eat it! ;)

> 
> Gut feeling form the error state is that the GPU is being clobbered by the
> framebuffer upon resume. Could I just tempt you into try an rc or nightly?
> Even if just to grab an error state? :)

I'll try 4.8-rc3 and report back.

Comment 15 Rainer Fiebig 2016-08-24 20:34:50 UTC

How was your red herring? ;)

4.8-rc3 looks promising so far: the messages are gone, /sys/class/drm/card0/error was clean.

Did several s2disk/resume and all were good, no problems thereafter.

If it stays like this, perhaps you should suggest "Double Red Herring" as the name for 4.8? ;)

Comment 16 Rainer Fiebig 2016-08-24 20:35:58 UTC

Created attachment 126019 [details]
dmesg_4.8-rc3_part

Comment 17 Rainer Fiebig 2016-08-25 18:29:13 UTC

I've used 4.8-rc3 for some more hours, doing several s2disk-/s2both-resume and everything was fine, like described in my previous post. So I think the problem is indeed solved in 4.8.

What perhaps should be done now, is identify the relevant commits (if you do not know them already) so that they can soon be backported to longterm-kernel 4.4 and to 4.7. If you then need help in testing, let me know.

Thanks and bye for now!

Comment 18 yann 2016-08-26 09:47:03 UTC

Thanks Jay, closing then this bug

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.