Since kernel 4.3 and up to 4.7.1, dmesg shows the following or similar messages after resuming from hibernation, here for 4.7.1: [drm] stuck on bsd ring [drm] GPU HANG: ecode 5:2:0x01000000, reason: Engine(s) hung, action: reset [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [drm] GPU crash dump saved to /sys/class/drm/card0/error drm/i915: Resetting chip after gpu hang Interestingly, with 4.4.18 I only see this: [drm] stuck on bsd ring [drm] GPU HANG: ecode 5:1:0x01000000, reason: Ring hung, action: reset drm/i915: Resetting chip after gpu hang Other than the messages, I haven't noticed any negative impact on my system (Core i3). But this may be different for other setups (see https://bugs.freedesktop.org/show_bug.cgi?id=94203). I bisected between 4.2 and 4.3, "bad/good" strictly according to whether "[drm] stuck on bsd ring" occurred or not - disregarding all the other numerous warnings in connection with i915. The result this produced was: > git bisect good ba01cc9346bce45a8861f36bce2c4c5d44b800b2 is the first bad commit commit ba01cc9346bce45a8861f36bce2c4c5d44b800b2 Author: John Harrison <John.C.Harrison@Intel.com> Date: Fri May 29 17:43:41 2015 +0100 drm/i915: Update i915_switch_context() to take a request structure Now that the request is guaranteed to specify the context, it is possible to update the context switch code to use requests rather than ring and context pairs. This patch updates i915_switch_context() accordingly. Also removed the warning that the request's context must match the last context switch's context. As the context switch now gets the context object from the request structure, there is no longer any scope for the two to become out of step. For: VIZ-5115 Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Tomas Elf <tomas.elf@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> :040000 040000 3717675a773aafd62ff883278f0a9dd0b4747af9 0f32cac1cb40d7d1f757495a85e77bf39863862a M drivers ---
Jay, can you attach GPU crash dump saved to /sys/class/drm/card0/error as well as kernel log (ie dmesg) ? thanks
Created attachment 125985 [details] /sys/class/drm/card0/error
Created attachment 125986 [details] dmesg (part of)
(In reply to yann from comment #1) > Jay, can you attach GPU crash dump saved to /sys/class/drm/card0/error as > well as kernel log (ie dmesg) ? > thanks "error" and "dmesg" (part of) attached, kernel was 4.7.1. It seems there were quite some changes made in the i915-driver between kernels 4.2 and 4.3. So I guess it won't be too easy to figure out where the problem really is. At least the bisected commit offers a starting point.
845g with a bsd ring is impressive ;)
This should be fixed on drm-intel-nightly already, don't know if that made it into the 4.8 cut.
(In reply to Chris Wilson from comment #5) > 845g with a bsd ring is impressive ;) Oups, you are right, not sure why I clicked on I854G since pci id is 0x0042 (ie Ironlake (Clarkdale))... my bad Thanks Chris for correcting it :)
(In reply to Chris Wilson from comment #6) > This should be fixed on drm-intel-nightly already, don't know if that made > it into the 4.8 cut. Sounds good. Had it something to do with the incriminated commit or could I have wasted my time in a better way? ;)
There were certainly bugs in how requests + contexts operated on Ironlake that required fixing. But there have also been changes to make hibernation more reliable (hopefully!). And since everything touching the GPU has a request, everything may be related back to the bisect result ;)
(In reply to Chris Wilson from comment #9) > There were certainly bugs in how requests + contexts operated on Ironlake > that required fixing. But there have also been changes to make hibernation > more reliable (hopefully!). And since everything touching the GPU has a > request, everything may be related back to the bisect result ;) Thanks! I read this as: "Time perhaps not completely wasted." Alright with me. ;)
I was just experiencing a problem after resume from s2disk: frozen windows, missing titlebars, only partly rendered windows, keyboard not working. Could solve it with log out/in but had to shoot a VM. Kernel was 4.4.18. Among some of the new messages in dmesg (see new attachm.) : ... [drm] stuck on render ring [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg [1183], reason: Ring hung, action: reset i915 0000:00:02.0: GEM idle failed, resume might fail pci_pm_freeze(): i915_pm_suspend+0x0/0x40 [i915] returns -11 dpm_run_callback(): pci_pm_freeze+0x0/0xe0 returns -11 PM: Device 0000:00:02.0 failed to freeze async: error -11 drm/i915: Resetting chip after gpu hang ... Seems more serious than I first thought. @Chris: Could you give me a hint concerning the patch that you mentioned in your post? I could give it a try. Or is it a whole bunch? In that case I'd rather wait until they are integrated in 4.8 (in git I have 4.8-rc2 and that doesn't seem to have the patches for this problem).
Created attachment 126016 [details] dmesg_4.4.18_problem
I realise I was thinking of some context bugs that don't affect you since Ironlake hasn't enabled HW contexts. Hmm, also the hibernate bug I was thinking of also related to using HW context. Double red herring, sorry. Gut feeling form the error state is that the GPU is being clobbered by the framebuffer upon resume. Could I just tempt you into try an rc or nightly? Even if just to grab an error state? :)
(In reply to Chris Wilson from comment #13) ... > thinking of also related to using HW context. Double red herring, sorry. Eat it! ;) > > Gut feeling form the error state is that the GPU is being clobbered by the > framebuffer upon resume. Could I just tempt you into try an rc or nightly? > Even if just to grab an error state? :) I'll try 4.8-rc3 and report back.
How was your red herring? ;) 4.8-rc3 looks promising so far: the messages are gone, /sys/class/drm/card0/error was clean. Did several s2disk/resume and all were good, no problems thereafter. If it stays like this, perhaps you should suggest "Double Red Herring" as the name for 4.8? ;)
Created attachment 126019 [details] dmesg_4.8-rc3_part
I've used 4.8-rc3 for some more hours, doing several s2disk-/s2both-resume and everything was fine, like described in my previous post. So I think the problem is indeed solved in 4.8. What perhaps should be done now, is identify the relevant commits (if you do not know them already) so that they can soon be backported to longterm-kernel 4.4 and to 4.7. If you then need help in testing, let me know. Thanks and bye for now!
Thanks Jay, closing then this bug
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.