Bug 26974 - [i915] OOPS in intel_crt_detect() after suspend/hibernate
[i915] OOPS in intel_crt_detect() after suspend/hibernate
Status: RESOLVED FIXED
Product: xorg
Classification: Unclassified
Component: Driver/intel
git
x86 (IA32) Linux (All)
: medium major
Assigned To: Carl Worth
Xorg Project Team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-09 02:41 UTC by Milan Bouchet-Valat
Modified: 2010-08-01 02:22 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (16.07 KB, text/x-log)
2010-03-09 02:41 UTC, Milan Bouchet-Valat
no flags Details
:0.log (12.84 KB, text/x-log)
2010-03-09 02:45 UTC, Milan Bouchet-Valat
no flags Details
Xorg.0.log.old (62.02 KB, text/plain)
2010-03-09 02:59 UTC, Milan Bouchet-Valat
no flags Details
gdb trace of the SIGPIPE (1.68 KB, text/plain)
2010-03-23 04:03 UTC, Milan Bouchet-Valat
no flags Details
second gdb trace of the crash, SIGPIPE handling disabled (1.60 KB, text/plain)
2010-03-24 06:24 UTC, Milan Bouchet-Valat
no flags Details
output of intel_reg_dumper before suspending (8.43 KB, text/plain)
2010-03-24 06:25 UTC, Milan Bouchet-Valat
no flags Details
output of intel_reg_dumper after returning from suspend (8.35 KB, text/plain)
2010-03-24 06:26 UTC, Milan Bouchet-Valat
no flags Details
output of intel_reg_dumper after lock (during gdb interruption) (8.42 KB, text/plain)
2010-03-24 06:29 UTC, Milan Bouchet-Valat
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Milan Bouchet-Valat 2010-03-09 02:41:04 UTC
Created attachment 33881 [details]
Xorg.0.log

I've been experiencing freezes after suspend and hibernate for a long time (see  Bug 15187). Current drm-intel-next branch (2.6.33-997) has fixed warning messages, but the freeze seems to have become a crash now.

I'm attaching Xorg.0.log, :0.log. A particularly interesting line is, in :0.log:
X: ../../src/i830_batchbuffer.h:79: intel_batch_emit_dword: Assertion `pI830->batch_ptr != ((void *)0)' failed.


Excerpt from lspci -vnn:
00:02.1 Display controller [0380]: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller [8086:2792] (rev 03)
        Subsystem: Toshiba America Info Systems Device [1179:ff00]
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 0: Memory at 64000000 (32-bit, non-prefetchable) [disabled] [size=512K]
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-


Please just ask if you want more information.
Comment 1 Milan Bouchet-Valat 2010-03-09 02:45:43 UTC
Created attachment 33884 [details]
:0.log
Comment 2 Milan Bouchet-Valat 2010-03-09 02:59:36 UTC
Created attachment 33887 [details]
Xorg.0.log.old

To be clear, I must add that those log files come from the failsafe X session that Ubuntu started after the crash. So they may not contain information about the crash itself, just about the fact that X is not able to restart correctly after the crash has occurred (I couldn't even switch to the consoles).
Comment 3 Milan Bouchet-Valat 2010-03-19 01:08:22 UTC
Still happening with xserver-video-intel 2.10.902+git20100317.31d5f84b. Anything I can do to help debugging?
Comment 4 Milan Bouchet-Valat 2010-03-23 04:03:29 UTC
Created attachment 34354 [details]
gdb trace of the SIGPIPE

Here's a gdb trace I could get of the crash, with kernel vmlinuz-2.6.33-997-generic (drm-intel-next), X.Org server 1.6.4, Intel driver 2.9.0. (Note this happens with more recent versions too, it's just that I'm using these as they don't suffer from other small issues.)

A funny thing is that while attached to gdb, X doesn't actually crash: I only get a SIGPIPE signal, and everything just works if I type 'continue'. When I detached Xorg, the signal killed the server (I guess that's intended).

Note that at the top of the trace, the call
__libc_writev (fd=-1218895884, vector=0xbfd61438, count=1)
is always exactly the same accross different SIGPIPES. Values of the parameters were the same the first time and the second time I received that signal (without restarting X).

Hope this helps, please, please ask if you need more details!
Comment 5 Chris Wilson 2010-03-23 04:16:08 UTC
This is a gpu hang, so the most interesting information would be i915_error_state and register dumps [as suspend and resume is complicit we need to ensure we are restoring the gpu state correctly].

In terms of driver packages, the most important one to make sure is up-to-date is perhaps libdrm, preferably from drm.git but 2.4.19 at a minimum.
Comment 6 Milan Bouchet-Valat 2010-03-23 04:36:13 UTC
Ah, thanks for the feedback! When should I get the GPU dump? Right after the crash occurred, while gdb is blocking Xorg?
Comment 7 Chris Wilson 2010-03-23 04:49:01 UTC
If you grab a intel-gpu-tools/tools/intel_reg_dump before suspending and after resume, and if xorg-edgers is recent enough, then the gpu dump will be in /sys/kernel/debug/dri/0/i915_error_state following a hang.
Comment 8 Julien Cristau 2010-03-23 04:53:06 UTC
> --- Comment #4 from Milan Bouchet-Valat <nalimilan@club.fr>  2010-03-23 04:03:29 PST ---
> A funny thing is that while attached to gdb, X doesn't actually crash: I only
> get a SIGPIPE signal, and everything just works if I type 'continue'. When I
> detached Xorg, the signal killed the server (I guess that's intended).
> 
X ignores SIGPIPE, you need to do the same in gdb.  'handle SIGPIPE
noprint nostop' at the gdb prompt should do the trick.
Comment 9 Milan Bouchet-Valat 2010-03-24 06:24:41 UTC
Created attachment 34403 [details]
second gdb trace of the crash, SIGPIPE handling disabled

So here's a new gdb trace with SIGPIPE handling disabled, as asked above.

The screen turned uniformly orange-pink, and typing 'continue' didn't change anything to it. Hitting Ctrl+Alt+F[1-8] provoked another interruption in gdb, continuing didn't trigger anything new; hitting Ctrl+Alt+F[1-8] again had no effect, but going back to Ctrl+Alt+F7 provoked interruption in gdb, without changing the screen state.

Software versions:
xserver-xorg-core 1.6.5+git20091107+server-1.6-branch.2dbcb06a
xserver-xorg-video-intel 2.10.902+git20100317.31d5f84b
libdrm-intel1 2.4.19+git20100318.56712821
kernel drm-intel-next 2.6.33-997
Comment 10 Milan Bouchet-Valat 2010-03-24 06:25:21 UTC
Created attachment 34404 [details]
output of intel_reg_dumper before suspending
Comment 11 Milan Bouchet-Valat 2010-03-24 06:26:07 UTC
Created attachment 34405 [details]
output of intel_reg_dumper after returning from suspend
Comment 12 Milan Bouchet-Valat 2010-03-24 06:29:47 UTC
Created attachment 34406 [details]
output of intel_reg_dumper after lock (during gdb interruption)

Here are the GPU dumps. Hope this is what you need, I didn't completely understand your comment about grabbing 'a intel-gpu-tools/tools/intel_reg_dump".

/sys/kernel/debug/dri/0/i915_error_state always said there was no error to report, at all of the 3 stages I checked it.

Does that help debugging?
Comment 13 Milan Bouchet-Valat 2010-03-30 10:37:43 UTC
I've just found out that a report in Ubuntu's Launchpad has 165 people marked as affected, with about 30 duplicate reports. I think that deserves a higher priority - it's been more than a year suspend is broken on i915 chips!

See https://bugs.launchpad.net/ubuntu/lucid/+source/xserver-xorg-video-intel/+bug/447159, where more similar stacktraces are available.
Comment 14 Milan Bouchet-Valat 2010-04-01 03:26:42 UTC
Thanks to the new report mechanism in Ubuntu 10.04, I've been able to get traces for the kernel oops that occurs before the X crash, and of that X crash at the same time. Do you think I should open a bug in bugzilla.kernel.org rather?

See 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/553176
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/553174

Particularly interesting is:
http://launchpadlibrarian.net/42769271/OopsText.txt

In which we can see the trace leading to the oops:
 [<f857cac9>] ? intel_crt_detect+0x69/0xe0 [i915]
 [<f80ceeee>] ? drm_helper_probe_single_connector_modes+0x26e/0x300 [drm_kms_helper]
 [<f8368d5e>] ? drm_mode_object_find+0x4e/0x70 [drm]
 [<f8369b7f>] ? drm_mode_getconnector+0x2df/0x380 [drm]
 [<c0589b59>] ? mutex_lock+0x19/0x40
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f835e7cd>] ? drm_ioctl+0x25d/0x3e0 [drm]
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f83698a0>] ? drm_mode_getconnector+0x0/0x380 [drm]
 [<f835e570>] ? drm_ioctl+0x0/0x3e0 [drm]
 [<c0215f71>] ? vfs_ioctl+0x21/0x90
 [<c0216259>] ? do_vfs_ioctl+0x79/0x310
 [<c058d210>] ? do_page_fault+0x160/0x3a0
 [<c0216557>] ? sys_ioctl+0x67/0x80
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c01033ec>] ? syscall_call+0x7/0xb
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
Comment 15 Chris Wilson 2010-04-01 03:41:03 UTC
Ah, an OOPS! That makes a little more sense.

The userspace stacktraces are irrelevant, as any GPU hang or OOPS may trigger such a trace -- that one identical symptom may imply any number of bugs, i.e. all the duplicates are not necessary duplicate bugs.
Comment 16 Milan Bouchet-Valat 2010-04-23 14:32:21 UTC
Can I do anything to ease debugging on this bug? I'd really like to help and get this fixed, this is quite annoying, and it seems to affect many users, seeing the number of duplicates only in Ubuntu.
Comment 17 Milan Bouchet-Valat 2010-05-23 03:09:44 UTC
Still happening with 2.6.34:
[ 2329.012081] PM: resume of devices complete after 1945.462 msecs
[ 2329.012241] PM: resume devices took 1.948 seconds
[ 2329.012273] PM: Finishing wakeup.
[ 2329.012276] Restarting tasks ... done.
[ 2329.050531] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[wait one hour or so]
[ 3529.748173] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3529.748409] render error detected, EIR: 0x00000000
[ 3529.748455] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 527114 at 527113)
Comment 18 Chris Wilson 2010-07-10 07:14:14 UTC
Massive memory corruption following hibernation should be fixed with:

commit 985b823b919273fe1327d56d2196b4f92e5d0fae
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jul 2 10:04:42 2010 +1000

    drm/i915: fix hibernation since i915 self-reclaim fixes
    
    Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
    Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
    i915 page allocator where we weren't before due to some over-eager
    removal of the page mapping gfp_flags games the code used to play.
    
    This caused hibernate on Intel hardware to result in a lot of memory
    corruptions on resume.  See for example
    
      http://bugzilla.kernel.org/show_bug.cgi?id=13811

I suspect that is is the memory corruption that is the root cause here.
Comment 19 Chris Wilson 2010-07-24 04:36:45 UTC
2.6.35-rc6 has a further fix for corruption on hibernation which nobody has been able to break (so far).
Comment 20 Milan Bouchet-Valat 2010-07-29 09:06:41 UTC
Sorry, but it's still here, but in a different form (apparently no oops):
$ uname -r
2.6.35-020635rc6-generic

/var/log/kern.log:
[ 1467.408347] PM: Finishing wakeup.
[ 1467.408350] Restarting tasks ... done.
[ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...]
[ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 11072 at 11071)

At this point, the X server is killed, and won't restart:
Fatal server error:
Failed to submit batchbuffer: Input/output error

Should I try with a more recent X version? Seems to me that the bug is still in the kernel itself, so it may not change anything.

As always, please just ask if you need more testing. This bug is really a bitch...
Comment 21 Chris Wilson 2010-07-29 11:52:40 UTC
That doesn't appear to be the same bug. And I should have pointed that out in comment 17...

The original bug with the OOPs could only be the result of memory corruption. The invalid framebuffer id could have been a symptom of the same memory corruption but now appears to be a more subtle issue.

Milan, please open a fresh bug report that focuses on the framebuffer id error. This is simply to try and keep the report coherent and so easier to review. [Otherwise when developers read the first few comments to familiarise themselves with the bug, then skip to the end to catch the new updates, the report no longer makes any sense.]
Comment 22 Milan Bouchet-Valat 2010-08-01 01:57:36 UTC
Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488

Sorry, I know mixing problems on a single report is messy, but I've already tracked this bug using about five different reports, which were closed, and I'm losing track myself... ;-)
Comment 23 Chris Wilson 2010-08-01 02:22:32 UTC
> --- Comment #22 from Milan Bouchet-Valat <nalimilan@club.fr> 2010-08-01 01:57:36 PDT ---
> Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488
> 
> Sorry, I know mixing problems on a single report is messy, but I've already
> tracked this bug using about five different reports, which were closed, and I'm
> losing track myself... ;-)

Thanks. We are getting closer, it looks like there may be a few related
bugs across the components that are complicating the issue,
e.g. bug 29320.