27111 – black screen after several S4 cycles

Bug 27111 - black screen after several S4 cycles

Summary: black screen after several S4 cycles

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	7.5 (2009.10)
Hardware:	Other All

Importance:	medium major
Assignee:	Wang Zhenyu
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:	NEEDINFO

Depends on:
Blocks:

Reported:	2010-03-16 10:15 UTC by Matthias Hopf
Modified:	2010-09-14 05:35 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
intel_gpu_dump.txt.gz (80.75 KB, application/octet-stream) 2010-03-16 10:15 UTC, Matthias Hopf	no flags	Details
intel_reg_dumper output in broken state (8.82 KB, application/octet-stream) 2010-04-27 07:55 UTC, Matthias Hopf	no flags	Details
intel_reg_read -f on the broken machine (123.67 KB, application/octet-stream) 2010-04-27 08:04 UTC, Matthias Hopf	no flags	Details
Diff of intel_reg_read -f between broken and working state (17.64 KB, text/plain) 2010-04-27 08:06 UTC, Matthias Hopf	no flags	Details
Show Obsolete (1) View All

Description Matthias Hopf 2010-03-16 10:15:15 UTC

Created attachment 34122 [details]
intel_gpu_dump.txt.gz

After a number of S4 cycles (number varies, and is pretty high, about 600x) the machine resumes with only a black screen (cursor is visible).

I have kernel messages, but unfortunately no X backtrace (no debuginfo packages installed). Machine is equipped with a IGDNG_M_G. kernel is 2.6.32.9, intel driver is 2.10.0, libdrm is 2.6.18. Currently testing with kernel 2.6.33, whether the bug still occurs.

The bug is persistent over reboots. A power cycle is required to get the chip in working state again.


Basically, direct after resume, the DRM module spits out:

Mar 13 02:00:29 linux-nc5s kernel: [    7.744398] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 13 02:00:29 linux-nc5s kernel: [    7.744407] render error detected, EIR: 0x00000000
Mar 13 02:00:29 linux-nc5s kernel: [    7.744411] i915: Waking up sleeping processes
[repeated 2x]
Mar 13 02:00:29 linux-nc5s kernel: [    8.059984] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[repeated 10x]

then both types intermixed (several 100 times), about two hangcheck timer elapses per second. EIR is always 0.


After several hundred of these messages, I get kernel errors:

Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736874] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736880] render error detected, EIR: 0x00000000
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736883] i915: Waking up sleeping processes
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736895] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 181915 at 172405)
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.737988] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739393] vmap allocation failed - use vmalloc=<size> to increase size.
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739399] vmalloc size=6000 start=f77fe000 end=feffe000 node=-1 gfp=80d2
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739405] Pid: 2046, comm: X Tainted: P          NX 2.6.32.5-0.1.1.1026.0.PTF-pae #1
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739408] Call Trace:
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739427]  [<c0206921>] try_stack_unwind+0x1b1/0x1f0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739435]  [<c020589f>] dump_trace+0x3f/0xe0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739442]  [<c020652b>] show_trace_log_lvl+0x4b/0x60
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739449]  [<c0206558>] show_trace+0x18/0x20
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739457]  [<c056a3c9>] dump_stack+0x6d/0x74
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739466]  [<c02e3df9>] alloc_vmap_area+0x2d9/0x2f0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739474]  [<c02e3f11>] __get_vm_area_node+0x101/0x1c0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739481]  [<c02e493e>] __vmalloc_node+0x9e/0xe0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739487]  [<c02e4b76>] __vmalloc+0x36/0x50
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739514]  [<f82f4817>] i915_gem_execbuffer+0x247/0xe40 [i915]
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739549]  [<f81d568c>] drm_ioctl+0x15c/0x340 [drm]
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739561]  [<c030e1e8>] vfs_ioctl+0x78/0x90
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739567]  [<c030e663>] do_vfs_ioctl+0x373/0x3f0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739573]  [<c030e78a>] sys_ioctl+0xaa/0xb0
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739579]  [<c02030a4>] sysenter_do_call+0x12/0x22
Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739598]  [<ffffe424>] 0xffffe424


Again, about 2 per second, intermixed with (fewer) hangcheck timer messages.
The attached intel_gpu_dump buffer dump shows that there is no batch buffer (?), but only a ring buffer with a few commands included. As the commands refer to a batch buffer, this looks odd, but this may be related to the kernel errors (alloc failed).

Comment 1 Carl Worth 2010-03-22 14:43:02 UTC

Jeese,

Care to take a look at this one?

-Carl

Comment 2 Jesse Barnes 2010-04-06 11:28:20 UTC

Hm, no error reported sounds like our hangcheck timer might be buggy.  Maybe hibernate needs some special hangcheck handling.  Can you instrument the i915 irq handler to see if we're getting a spurious user interrupt at resume?  If so, we might need an if (!dev_priv->mm.suspended) in there somewhere.

Comment 3 Matthias Hopf 2010-04-08 03:26:35 UTC

As the issue is persistent over boot this cannot only be the hangcheck timer. Some state must be not initialized correctly after a reset.

Just trying to reproduce on the machine I last encountered it. This will take some time.

Comment 4 Matthias Hopf 2010-04-12 09:50:20 UTC

I was just able to reproduce after 1975 hibernate cycles. The effect is persistent over reboots, *and* over hibernate. Initially I thought it would not survive hibernate

This indicates that some state is not initialized correctly, but saved & restored over hibernate. Weird.


Jesse, as this only occurs after oh so many hibernate cycles (there are supposed to be machines that exhibit this issue earlier, but none I have access to), can you elaborate a bit what information would be helpful?  Any information that could be extracted from the broken state before I power cycle the machine?

Comment 5 Jesse Barnes 2010-04-12 10:26:57 UTC

Yuck.  It could also be that we're initializing some chip state out of order on resume, and get lucky and avoid a hardware race most of the time.  We could also be hitting one of the bugs fixed since libdrm 2.4.18, have you tested with 2.4.20?

Comment 6 Matthias Hopf 2010-04-12 12:14:00 UTC

Haven't tested updated libdrm yet (will do), but if that fixes the issue for good, drm is still doing something wrong (userspace should never be able to mess around like that).

In the meantime, any ideas for reasonable post-mortem analysis? The gpu dump is already attached to this bug from an earlier breakage (same machine).


I have to correct myself - it seems this issue isn't seen on many more machines. Is is plausible that this could be a single unit failure? I.e. hardware issue?

OTOH the machine regularly works fine, and persistence speaks against this theory.

Comment 7 Jesse Barnes 2010-04-12 12:27:56 UTC

It's possible it's a hw problem on this specific machine, but I'd be more inclined to believe it's a sw problem that just doesn't trigger very often.

As for debugging, there may be more state that changes across suspend/resume than is tracked by our reg dumper.  You could capture the whole register map from sysfs before and after (you may need to write a simple mmap + dump program for this).  I know there are MCHBAR regs we don't bother with that we probably should, but it would be interesting to see exactly what's changed.

Comment 8 Matthias Hopf 2010-04-27 07:49:24 UTC

I managed to copy the installed image on the machine with broken state into a second partition (with separate swap) so I can resume the machine now in broken and in working state as I wish.

I'm analyzing the machine right now.

intel_stepping says Device 0x0046, Revision: 0x12 (??)

There are a few differences, I'll post them here. Beware that the machine booted between two different states (image was cloned, though, so same driver version etc.) - quite some differences could be irrelevant.

Comment 9 Matthias Hopf 2010-04-27 07:52:46 UTC

Ok, the gpu dump in the attachment isn't telling much - a new gpu dump shows a different story:

broken state:

ACTHD: 0x00000000
EIR: 0x00000000
EMR: 0xffffff3f
ESR: 0x00000000
PGTBL_ER: 0x00000000
IPEHR: 0x00000000
IPEIR: 0x00000000
INSTDONE: 0xfffffffe
INSTDONE1: 0xffffffff
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:
0x00000000: HEAD 0x00000000: MI_NOOP
0x00000004:      0x00000000: MI_NOOP
[all MI_NOOP up to the end]


working state:

ACTHD: 0x00003e88
EIR: 0x00000000
EMR: 0xffffff3f
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x01000000
IPEIR: 0x00000000
INSTDONE: 0xfffffffe
INSTDONE1: 0xffffffff
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:
0x00000000:      0x02000000: MI_FLUSH
0x00000004:      0x00000000: MI_NOOP
0x00000008:      0x18800180: MI_BATCH_BUFFER_START
0x0000000c:      0x0243c000:    dword 1
0x00000010:      0x02000004: MI_FLUSH
0x00000014:      0x00000000: MI_NOOP
0x00000018:      0x10800001: MI_STORE_DATA_INDEX
0x0000001c:      0x00000080:    dword 1
0x00000020:      0x00000001:    dword 2
0x00000024:      0x01000000: MI_USER_INTERRUPT
[etc. pp until 0x00003e84:, then MI_NOOP at HEAD]


I don't see a TAIL in either of those dumps, so either I don't understand, or the tool still has a bug here.

Comment 10 Matthias Hopf 2010-04-27 07:53:09 UTC

Comment on attachment 34122 [details]
intel_gpu_dump.txt.gz

Old dump is obsolete.

Comment 11 Matthias Hopf 2010-04-27 07:55:44 UTC

Created attachment 35309 [details]
intel_reg_dumper output in broken state

Only difference to working state:

--- broken.reg_dumper   2010-04-27 16:38:03.000000000 +0200
+++ works.reg_dumper    2010-04-27 16:35:10.000000000 +0200
@@ -117,1 +117,1 @@
-   TRANSC_DP_LINK_N2: 0x00ffffff (val 0xffffff 16777215)
+   TRANSC_DP_LINK_N2: 0x00000000 (val 0x0 0)

Comment 12 Matthias Hopf 2010-04-27 08:04:54 UTC

Created attachment 35310 [details]
intel_reg_read -f  on the broken machine

Comment 13 Matthias Hopf 2010-04-27 08:06:08 UTC

Created attachment 35311 [details]
Diff of  intel_reg_read -f  between broken and working state

Comment 14 Jesse Barnes 2010-06-01 12:12:38 UTC

Sorry for the neglect Matthias, looks like another ILK mode setting problem.  Maybe Zhenyu has an idea.

Comment 15 Matthias Hopf 2010-06-02 03:24:53 UTC

Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this is a mode setting issue, but rather that the rendering engine is stalled.

Comment 16 Chris Wilson 2010-07-24 04:41:22 UTC

(In reply to comment #15)
> Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this
> is a mode setting issue, but rather that the rendering engine is stalled.

fails how? EIO or EBUSY? If EIO, please attach the i915_error_state.

Comment 17 Matthias Hopf 2010-07-26 03:52:10 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this
> > is a mode setting issue, but rather that the rendering engine is stalled.
> fails how? EIO or EBUSY? If EIO, please attach the i915_error_state.

With fails I mean

Mar 13 02:00:29 linux-nc5s kernel: [    8.059984] [drm:i915_gem_execbuffer]
*ERROR* Execbuf while wedged

which I conclude is neither.

Comment 18 Chris Wilson 2010-07-26 04:03:31 UTC

That's EIO, but also implies that your kernel is too old to have a meaningful i915_error_state. :(

Comment 19 Matthias Hopf 2010-07-26 04:08:21 UTC

I can *try* to update the kernel - the issue is persistent over reboots. OTOH it is difficult to reproduce. I'll post when I have new results.

Comment 20 Wang Zhenyu 2010-08-15 20:00:11 UTC

Is this still there with recent linus's fix?

Comment 21 Chris Wilson 2010-09-11 01:30:57 UTC

Matthias, this is most likely the memory corruption bug. So I am marking fixed, please reopen it reoccurs. Thanks for the report.

Comment 22 Matthias Hopf 2010-09-14 05:35:25 UTC

Will do. Sorry for not having time to re-test this lately.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.