Summary: | [i915] OOPS in intel_crt_detect() after suspend/hibernate | ||
---|---|---|---|
Product: | xorg | Reporter: | Milan Bouchet-Valat <nalimilan> |
Component: | Driver/intel | Assignee: | Carl Worth <cworth> |
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | major | ||
Priority: | medium | CC: | jbarnes, mat |
Version: | git | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Milan Bouchet-Valat
2010-03-09 02:41:04 UTC
Created attachment 33884 [details]
:0.log
Created attachment 33887 [details]
Xorg.0.log.old
To be clear, I must add that those log files come from the failsafe X session that Ubuntu started after the crash. So they may not contain information about the crash itself, just about the fact that X is not able to restart correctly after the crash has occurred (I couldn't even switch to the consoles).
Still happening with xserver-video-intel 2.10.902+git20100317.31d5f84b. Anything I can do to help debugging? Created attachment 34354 [details]
gdb trace of the SIGPIPE
Here's a gdb trace I could get of the crash, with kernel vmlinuz-2.6.33-997-generic (drm-intel-next), X.Org server 1.6.4, Intel driver 2.9.0. (Note this happens with more recent versions too, it's just that I'm using these as they don't suffer from other small issues.)
A funny thing is that while attached to gdb, X doesn't actually crash: I only get a SIGPIPE signal, and everything just works if I type 'continue'. When I detached Xorg, the signal killed the server (I guess that's intended).
Note that at the top of the trace, the call
__libc_writev (fd=-1218895884, vector=0xbfd61438, count=1)
is always exactly the same accross different SIGPIPES. Values of the parameters were the same the first time and the second time I received that signal (without restarting X).
Hope this helps, please, please ask if you need more details!
This is a gpu hang, so the most interesting information would be i915_error_state and register dumps [as suspend and resume is complicit we need to ensure we are restoring the gpu state correctly]. In terms of driver packages, the most important one to make sure is up-to-date is perhaps libdrm, preferably from drm.git but 2.4.19 at a minimum. Ah, thanks for the feedback! When should I get the GPU dump? Right after the crash occurred, while gdb is blocking Xorg? If you grab a intel-gpu-tools/tools/intel_reg_dump before suspending and after resume, and if xorg-edgers is recent enough, then the gpu dump will be in /sys/kernel/debug/dri/0/i915_error_state following a hang. > --- Comment #4 from Milan Bouchet-Valat <nalimilan@club.fr> 2010-03-23 04:03:29 PST ---
> A funny thing is that while attached to gdb, X doesn't actually crash: I only
> get a SIGPIPE signal, and everything just works if I type 'continue'. When I
> detached Xorg, the signal killed the server (I guess that's intended).
>
X ignores SIGPIPE, you need to do the same in gdb. 'handle SIGPIPE
noprint nostop' at the gdb prompt should do the trick.
Created attachment 34403 [details]
second gdb trace of the crash, SIGPIPE handling disabled
So here's a new gdb trace with SIGPIPE handling disabled, as asked above.
The screen turned uniformly orange-pink, and typing 'continue' didn't change anything to it. Hitting Ctrl+Alt+F[1-8] provoked another interruption in gdb, continuing didn't trigger anything new; hitting Ctrl+Alt+F[1-8] again had no effect, but going back to Ctrl+Alt+F7 provoked interruption in gdb, without changing the screen state.
Software versions:
xserver-xorg-core 1.6.5+git20091107+server-1.6-branch.2dbcb06a
xserver-xorg-video-intel 2.10.902+git20100317.31d5f84b
libdrm-intel1 2.4.19+git20100318.56712821
kernel drm-intel-next 2.6.33-997
Created attachment 34404 [details]
output of intel_reg_dumper before suspending
Created attachment 34405 [details]
output of intel_reg_dumper after returning from suspend
Created attachment 34406 [details]
output of intel_reg_dumper after lock (during gdb interruption)
Here are the GPU dumps. Hope this is what you need, I didn't completely understand your comment about grabbing 'a intel-gpu-tools/tools/intel_reg_dump".
/sys/kernel/debug/dri/0/i915_error_state always said there was no error to report, at all of the 3 stages I checked it.
Does that help debugging?
I've just found out that a report in Ubuntu's Launchpad has 165 people marked as affected, with about 30 duplicate reports. I think that deserves a higher priority - it's been more than a year suspend is broken on i915 chips! See https://bugs.launchpad.net/ubuntu/lucid/+source/xserver-xorg-video-intel/+bug/447159, where more similar stacktraces are available. Thanks to the new report mechanism in Ubuntu 10.04, I've been able to get traces for the kernel oops that occurs before the X crash, and of that X crash at the same time. Do you think I should open a bug in bugzilla.kernel.org rather? See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/553176 https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/553174 Particularly interesting is: http://launchpadlibrarian.net/42769271/OopsText.txt In which we can see the trace leading to the oops: [<f857cac9>] ? intel_crt_detect+0x69/0xe0 [i915] [<f80ceeee>] ? drm_helper_probe_single_connector_modes+0x26e/0x300 [drm_kms_helper] [<f8368d5e>] ? drm_mode_object_find+0x4e/0x70 [drm] [<f8369b7f>] ? drm_mode_getconnector+0x2df/0x380 [drm] [<c0589b59>] ? mutex_lock+0x19/0x40 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140 [<f835e7cd>] ? drm_ioctl+0x25d/0x3e0 [drm] [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140 [<f83698a0>] ? drm_mode_getconnector+0x0/0x380 [drm] [<f835e570>] ? drm_ioctl+0x0/0x3e0 [drm] [<c0215f71>] ? vfs_ioctl+0x21/0x90 [<c0216259>] ? do_vfs_ioctl+0x79/0x310 [<c058d210>] ? do_page_fault+0x160/0x3a0 [<c0216557>] ? sys_ioctl+0x67/0x80 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140 [<c01033ec>] ? syscall_call+0x7/0xb [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140 Ah, an OOPS! That makes a little more sense. The userspace stacktraces are irrelevant, as any GPU hang or OOPS may trigger such a trace -- that one identical symptom may imply any number of bugs, i.e. all the duplicates are not necessary duplicate bugs. Can I do anything to ease debugging on this bug? I'd really like to help and get this fixed, this is quite annoying, and it seems to affect many users, seeing the number of duplicates only in Ubuntu. Still happening with 2.6.34: [ 2329.012081] PM: resume of devices complete after 1945.462 msecs [ 2329.012241] PM: resume devices took 1.948 seconds [ 2329.012273] PM: Finishing wakeup. [ 2329.012276] Restarting tasks ... done. [ 2329.050531] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id [wait one hour or so] [ 3529.748173] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 3529.748409] render error detected, EIR: 0x00000000 [ 3529.748455] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 527114 at 527113) Massive memory corruption following hibernation should be fixed with: commit 985b823b919273fe1327d56d2196b4f92e5d0fae Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri Jul 2 10:04:42 2010 +1000 drm/i915: fix hibernation since i915 self-reclaim fixes Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915: Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the i915 page allocator where we weren't before due to some over-eager removal of the page mapping gfp_flags games the code used to play. This caused hibernate on Intel hardware to result in a lot of memory corruptions on resume. See for example http://bugzilla.kernel.org/show_bug.cgi?id=13811 I suspect that is is the memory corruption that is the root cause here. 2.6.35-rc6 has a further fix for corruption on hibernation which nobody has been able to break (so far). Sorry, but it's still here, but in a different form (apparently no oops): $ uname -r 2.6.35-020635rc6-generic /var/log/kern.log: [ 1467.408347] PM: Finishing wakeup. [ 1467.408350] Restarting tasks ... done. [ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id [ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...] [ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 11072 at 11071) At this point, the X server is killed, and won't restart: Fatal server error: Failed to submit batchbuffer: Input/output error Should I try with a more recent X version? Seems to me that the bug is still in the kernel itself, so it may not change anything. As always, please just ask if you need more testing. This bug is really a bitch... That doesn't appear to be the same bug. And I should have pointed that out in comment 17... The original bug with the OOPs could only be the result of memory corruption. The invalid framebuffer id could have been a symptom of the same memory corruption but now appears to be a more subtle issue. Milan, please open a fresh bug report that focuses on the framebuffer id error. This is simply to try and keep the report coherent and so easier to review. [Otherwise when developers read the first few comments to familiarise themselves with the bug, then skip to the end to catch the new updates, the report no longer makes any sense.] Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488 Sorry, I know mixing problems on a single report is messy, but I've already tracked this bug using about five different reports, which were closed, and I'm losing track myself... ;-) > --- Comment #22 from Milan Bouchet-Valat <nalimilan@club.fr> 2010-08-01 01:57:36 PDT --- > Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488 > > Sorry, I know mixing problems on a single report is messy, but I've already > tracked this bug using about five different reports, which were closed, and I'm > losing track myself... ;-) Thanks. We are getting closer, it looks like there may be a few related bugs across the components that are complicating the issue, e.g. bug 29320. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.