Summary: | [845G] Xserver cannot be started at all ("Aborted") | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Stefan Dirsch <sndirsch> | ||||||||||||||
Component: | Driver/intel | Assignee: | Jesse Barnes <jbarnes> | ||||||||||||||
Status: | RESOLVED DUPLICATE | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||
Severity: | critical | ||||||||||||||||
Priority: | highest | CC: | antoni, eich, jmdorfman, kent.liu, libv, mat, mrmazda, neogw, quanxian.wang, zhenyu.z.wang | ||||||||||||||
Version: | 7.4 (2008.09) | Keywords: | NEEDINFO | ||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Description
Stefan Dirsch
2008-10-28 08:10:41 UTC
Created attachment 19909 [details]
Xorg.0.log
Initially reported for a i845 Fujitsu Siemens machine I see exactly the same issue on an i845 Dell Optiplex 260 machine. I'll attach this logfile as well. Created attachment 19911 [details]
Xorg.0.log.dell
Looks like we're destroying an I2C bus we never created. Can you get a backtrace of the problem? Just use gdb on the X server from an ssh session or something... Thanks, Jesse Not sure why, but after rebooting the machine I'm now stumbling across a slightly different issue. [...] Could not init font path element /usr/share/fonts/OTF, removing from list! intel_bufmgr_fake.c:392: Error waiting for fence: Device or resource busy. Aborted I'll attach the new logfile. Created attachment 20065 [details]
Xorg.0.log
The second time you start the Xserver it gets even worse. [...] 00000020: 00000000 MI_NOOP 1 Ring end space: 130996 wanted 131064 (II) intel(0): [drm] removed 1 reserved context for kernel (II) intel(0): [drm] unmapping 8192 bytes of SAREA 0xe08f5000 at 0xb7a7a000 (II) intel(0): [drm] Closed DRM master. Fatal server error: lockup _fence_emit_internal: drm_i915_irq_emit: -9 Aborted I'll attach the logfile from second run. Created attachment 20066 [details]
Xorg.0.log.2nd_run
# gdb /usr/bin/Xorg [...] (gdb) handle SIGUSR1 nostop Signal Stop Print Pass to program Description SIGUSR1 No Yes Yes User defined signal 1 (gdb) run [...] Could not init font path element /usr/share/fonts/TTF/, removing from list! Could not init font path element /usr/share/fonts/OTF, removing from list! Program received signal SIGABRT, Aborted. 0xffffe430 in __kernel_vsyscall () (gdb) bt #0 0xffffe430 in __kernel_vsyscall () #1 0xb7c979b0 in raise () from /lib/libc.so.6 #2 0xb7c992e8 in abort () from /lib/libc.so.6 #3 0xb7add38d in ?? () from /usr/lib/libdrm_intel.so.1 #4 0xb7ade315 in ?? () from /usr/lib/libdrm_intel.so.1 #5 0xb7ade64e in ?? () from /usr/lib/libdrm_intel.so.1 #6 0xb7adc2be in dri_bo_exec () from /usr/lib/libdrm_intel.so.1 #7 0xb7b0cd84 in intel_batch_flush () from /usr/lib/xorg/modules//drivers/intel_drv.so #8 0xb7b370a0 in ?? () from /usr/lib/xorg/modules//drivers/intel_drv.so #9 0xb7a1158f in exaFillRegionTiled () from /usr/lib/xorg/modules//libexa.so #10 0xb7a11ab8 in ?? () from /usr/lib/xorg/modules//libexa.so #11 0x08173824 in ?? () #12 0x0810f411 in miPaintWindow () #13 0x0810f782 in miWindowExposures () #14 0xb8091097 in DRIWindowExposures () from /usr/lib/xorg/modules//extensions/libdri.so #15 0x080d615f in ?? () #16 0x08076503 in MapWindow () #17 0x080766a0 in InitRootWindow () #18 0x08070ce6 in main () (gdb) Created attachment 20067 [details]
debug.output
This time with -debuginfo,-debugsource packages installed.
Since Kent asked me about the severity/priority. Jesse, do you think comment #9 is something you were looking for a clue? (In reply to comment #12) > Jesse, do you think comment #9 is something you were looking for a clue? I would prefer comment #10. :-) Looks like there are two potential problems here: one that causes the crash that was initially reported (looks like we're using some bogus buffer during DDC probing), and another that appears to be an IRQ related failure that prevents us from seeing completed fences. Does the fence problem happen if you use a GEM enabled kernel? Preferably one from Eric's drm-intel tree (drm-intel-next branch)? Jesse, could we concentrate on the Abort issue first? I would assume that you don't run into the fence issue without the Abort issue, right? Which abort issue (they both have "aborted" in the crash message)? :) If you mean the one in comment #0 that opened the bug, I'm hoping that will be pretty easy to track down. It looks like the kind of problem that a gdb backtrace or stepping of the code could track down pretty easily. If you can't reproduce that one anymore you should try the drm-intel-next branch; it has several IRQ fixes that I hope will get you going again well enough to solve the first crash. Though what's weird is that the first crash doesn't seem to happen everytime... Not sure what might have changed in your config to hide it... Jesse, the gdb backtrace in comment #10 is for the abort in the initial comment. Indeed there is an abort() in _fence_wait_internal(). I can reproduce the first crash (abort()) after each reboot. What about with the most recent kernel bits from the for-airlied branch of the drm-intel tree? Some fixes for interrupt handling went in recently, and afaik 855 users see things working now, so I'm hoping 845 is better now too... Jesse, didn't you state in comment #16 that the initial abort should be easy to track down with a gdb backtrace. Now there is such a gdb backtrace in comment #10, but you're not interested into looking at it? And what do you mean with drm-intel tree/drm-intel-next branch? I couldn't find such branches in drm git. Are you talking of such special drm git trees and special branches in them? > --- Comment #21 from Stefan Dirsch <sndirsch@suse.de> 2008-11-13 15:46:45 PST ---
> And what do you mean with drm-intel tree/drm-intel-next branch? I couldn't find
> such branches in drm git. Are you talking of such special drm git trees
> and special branches in them?
>
he's talking about
git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
It didn't look to me like the backtrace in comment #10 was at all related to the initial crash reported in this bug. If it really is the same crash, then we're not looking at two bugs like I thought, just one. And it seems to be related to buffer object command completion. That's why I asked you to test the latest kernel bits, which have several fixes for 8xx series chips, among others. Kernel DRM bits are queued up in Dave Airlie's tree these days, at git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git. Before that, Intel specific bits are queued in Eric's staging tree at git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git in the drm-intel-next (for the next kernel), for-airlied (critical fixes for the current kernel) and for-review (in progress fixes worth testing). If possible, I'd like you to try the for-review branch. If that fixes things for you, there are probably some IRQ related fixes that need backporting to your kernel. Stefan, Any result for that after your trying? Honestly, I don't understand why it is required to use some random kernel git tree to possibly get his old Intel i845 hardware working again. I think Jessy maybe think there are some solution for this bug. If it works, therefore the solution is in it. And then backporting it. (In reply to comment #25) > Honestly, I don't understand why it is required to use some random kernel git > tree to possibly get his old Intel i845 hardware working again. Stefan, drm-intel is not a random kernel git tree, as this is the only development tree for Intel graphics driver now. It contains all patches pending to Dave Arlie's tree. Ok. I'll try to build and test with such a kernel. With the for-review kernel I see a few lines of the usual X pattern. The rest of the screen remains black. Mouse cursor is viewable and movable. And when moving the cursor you see the following in the Xserver's log. [mi] EQ overflowing. The server is probably stuck in an infinite loop. [mi] mieqEnequeue: out-of-order valuator event; dropping. [mi] EQ overflowing. The server is probably stuck in an infinite loop. [mi] mieqEnequeue: out-of-order valuator event; dropping. [...] Attached gdb shows you: 0xffffe430 in __kernel_vsyscall () (gdb) bt #0 0xffffe430 in __kernel_vsyscall () #1 0xb7bdd279 in ioctl () from /lib/libc.so.6 #2 0xb7aa944f in ?? () from /usr/lib/libdrm.so.2 #3 0xb7aa959a in drmCommandNone () from /usr/lib/libdrm.so.2 #4 0xb79bd380 in ?? () from /usr/lib/xorg/modules//drivers/intel_drv.so #5 0x081666cf in ?? () #6 0x081401da in ?? () #7 0x0808f0c8 in BlockHandler () #8 0x0812cc7d in WaitForSomething () #9 0x0808b1de in Dispatch () #10 0x08070d4d in main () (gdb) I think that's not really an improvement. :-( Would it help to send an affected i845 machine to Intel? The issue is that the vesa driver also has issues with this hardware. Switching to Linux console results in a blank screen and switching back to X doesn't change this. :-( Jesse, do you think the result of comment #29 is still related to buffer objects command completion, or IRQ stuff? Do you have further instructions for Stefan? And would it be better if Stefan ship one i845G to you for checking? Ugg, yeah that sounds like even worse behavior, but thanks for testing, Stefan. Getting the hardware would help reduce the turnaround times for testing fixes, but atm I don't have any theories about what might be causing this problem; I'll look into the code some more and see where we might be missing stuff for 845. Some background on this issue: It looks like the problem is related to fence waiting. Ultimately, the I915_IRQ_WAIT ioctl is timing out, which means we didn't see the sequence number we were waiting for before the timeout occurred (timeout is 3 seconds). This could mean that the chip is hung for some reason (this would happen if we sent down bogus commands), or that our sequence numbers aren't being incremented properly. The IRQ wait ioctl depends on the result of READ_BREADCRUMB(), which in turn depends on the value in the hardware status page. The status page should be set up by i915_initialize in the i915 drm code on this machine (since it doesn't need a status page in the GTT). You can confirm the addresses you're getting by instrumenting that function. The values are actually written by the command streamer when a CMD_STORE_DWORD_IDX command comes along (see i915_emit_irq and i915_emit_breadcrumb). So assuming your hardware status page is set up correctly, you should see those values increment on every command or batch buffer submission and on every IRQ emit call. Which makes me think you're running into a more general chip hang (one of your logs points this way too, with a timeout in I830WaitLpRing). We often see lockups due to render acceleration; does the hang occur if you set ExaNoComposite to true in the intel section of your xorg.conf? *** Bug 18657 has been marked as a duplicate of this bug. *** Stefan, Just as Jesse said in comment 33. do you have a try for that? " does the hang occur if you set ExaNoComposite to true in the intel section of your xorg.conf? " Thanks Wow! Xserver starts fine with 'Option "ExaNoComposite" "true"'. That's good news. *** This bug has been marked as a duplicate of bug 17713 *** Created attachment 21144 [details] Xorg.0.log and xorg.conf We have a bug reported to Novell Bugzilla#442416 that points to this intel bug as resolution Upstream . Problem description: Not able to startX. Problem exist is Sles11 beta1/2/ 3 Error message:Not able to start X with error message " maximum number of X display failures reached. We have tried comment#33. It does not resolve our problem. attached is the log collected from trying comment#33. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.