Bug 24884

Summary: MI_WAIT_FOR_EVENT hangs upon resume
Product: xorg Reporter: Linus Torvalds <torvalds>
Component: Driver/intelAssignee: Jesse Barnes <jbarnes>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: axet, eric, jbarnes, keithp
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Failing Xorg.0.log
none
dmesg of two successful suspends, followed by two failing ones
none
gzipped intel_gpu_dump *before* lid close
none
gzipped intel_gpu_dump *after* lid open when gpu is hung none

Description Linus Torvalds 2009-11-03 12:20:22 UTC
Created attachment 30944 [details]
Failing Xorg.0.log

When suspending and then resuming a compiz desktop by closing and opening the lid of my Dell Inspiron 11z laptop, X dies. That obviously results in logging me out, and then X respawning with the login screen.

X itself dies with this message:

    Fatal server error:
    Failed to map batchbuffer: Input/output error

in the Xorg.0.log file, and the kernel has various messages that seem to boil down to the GPU being hung:

    [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

I'm attaching the full Xorg.0.log and more complete kernel messages.
Comment 1 Linus Torvalds 2009-11-03 12:30:38 UTC
Created attachment 30945 [details]
dmesg of two successful suspends, followed by two failing ones

Note how the successful suspend/resume cycle has just a single

   [drm] LVDS-8: set mode 1366x768 c

message at resume time, while the failing ones have two:

   [drm] LVDS-8: set mode 1366x768 c
   [drm] LVDS-8: set mode <corrupt> 1d

and the failing ones eventually then result in a

   Aborting core
   [drm] LVDS-8: set mode 1366x768 c

which is apparently X killing itself off, and then the new X starting (successfully).

So that dmesg is all from one single boot - the machine stayed up, and was usable, but in the failure case the session had been killed on resume.
Comment 2 Carl Worth 2009-11-06 12:23:49 UTC
Hi Linus,

Thanks for your bug report.

There seem to be a recent spate of Intel driver problems with resume, (though
most are things like backlight missing---yours is the first I've seen resulting
in X server death).

I'm doing my best to get minions to bisect things. We'll see what turns up.

Eric, Keith, and Jesse, any other immediate thoughts?

Thanks,

-Carl
Comment 3 Linus Torvalds 2009-11-14 14:06:25 UTC
Created attachment 31206 [details]
gzipped intel_gpu_dump *before* lid close

This is intel_gpu_dump before the lid closed

I had to gzip it to make it fit the bugzilla limits
Comment 4 Linus Torvalds 2009-11-14 14:07:44 UTC
Created attachment 31207 [details]
gzipped intel_gpu_dump *after* lid open when gpu is hung

I _think_ I caught the actual "hung" case rather than the case that happens a bit afterwards when hangcheck timers eventually force the GPU reset.
Comment 5 Linus Torvalds 2009-11-14 14:09:22 UTC
Hopefully the above two attachments make sense to somebody.

I seem to be able to trigger the hang without actually suspending or resuming the machine at all, which made debugging much easier. I just need to close and open the lid a few times, it will hang after a couple of those events.

If intel_gpu_dump isn't the right tool, then please point me to something better.
Comment 6 Linus Torvalds 2009-11-14 15:34:26 UTC
Some additional rumblings and thoughts about this:

 - I seem to be able to close and open the lid as much as I want, if I first just make sure that X and all other applications are stopped. IOW, I logged in from the network, and did a simple

    kill -STOP -1
    killall -STOP Xorg

   and then I close and open the lid repeatedly, and nothing bad happens. The kernel catches the lid event, and my /var/log/messages looks like this:

  Nov 14 15:22:01 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:22:25 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:22:36 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:22:46 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:22:51 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:22:56 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:23:52 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:23:57 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:24:06 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:24:18 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:24:22 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:24:31 localhost kernel: [drm] LVDS-8: set mode 1366x768 c

In fact, I seem to be able to do that even with X itself not stopped, but if I revive all the actual drawing programs, then the lid open/close will end up resulting in a GPU hang very soon:

  Nov 14 15:28:51 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:29:06 localhost kernel: [drm] LVDS-8: set mode 1366x768 c
  Nov 14 15:29:07 localhost kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
  Nov 14 15:29:07 localhost kernel: render error detected, EIR: 0x00000000
  Nov 14 15:29:07 localhost kernel: i915: Waking up sleeping processes
  ...

so there is definitely some interaction with the lid open/close code and the actual direct rendering. Which explains why everything works fine if I'm at the GDM login screen or if I don't have compiz enabled, but quickly goes to hell if I'm using desktop effects.
Comment 7 Chris Wilson 2010-05-11 10:30:38 UTC
Similarities with bug 27922 and bug 27285? In particular the xrandr/modechange during 3D activity.
Comment 8 Chris Wilson 2010-05-31 15:44:50 UTC
after.lid.gz:

  IPEHR: 0x01800020

I now recognise this little tell-tale. It's a MI_WAIT_FOR_EVENT. The GPU is waiting for a scanline on a disabled pipe. Given the GPU is idled - we wait for the completion of all batch buffers in the pipeline - prior to suspend, this instruction should not be being executed at the time of suspend/resume. The only source of this in the ringbuffer (apart from the old UMS path) is the new overlay code, that is unlikely to be the cause given compiz/GL rendering during suspend.

Similar to bug 27146.
Comment 9 Chris Wilson 2010-07-09 03:14:01 UTC
Pinging Jesse, as I think he knows best whether we are now safe from WAIT_FOR_EVENT + modeswitch.
Comment 10 Jesse Barnes 2010-07-09 08:06:21 UTC
I hope this is a dupe of one of the earlier ones, because I don't see how we could emit a wait_for_event after the display was off in current code.

Linus, do you still see this with the latest 2D driver?
Comment 11 Jesse Barnes 2010-07-09 08:26:45 UTC
Ah and if you do see it, I was just reminded about an old patch that might fix it.  Don't know why it's not upstream though, I'll check on that.

https://patchwork.kernel.org/patch/80474/
Comment 12 Linus Torvalds 2010-07-09 09:18:15 UTC
On Fri, Jul 9, 2010 at 8:06 AM,  <bugzilla-daemon@freedesktop.org> wrote:
>
> Linus, do you still see this with the latest 2D driver?

This is my daughters laptop, and I haven't checked lately if it's
still flaky. She's not been complaining, but at one point the
work-around was to log out before suspending (which gets rid of
compiz), so it's possible that it's still there. But at this point,
you might as well close the bugzilla, especially if you think it's a
dup.

                    Linus
Comment 13 Jesse Barnes 2010-07-09 09:27:27 UTC
Ok, thanks.  I'll close it out and get the other potential fix upstream just in case.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.