Bug 18270

Summary: [845G] Xserver cannot be started at all ("Aborted")
Product: xorg Reporter: Stefan Dirsch <sndirsch>
Component: Driver/intelAssignee: Jesse Barnes <jbarnes>
Status: RESOLVED DUPLICATE QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: highest CC: antoni, eich, jmdorfman, kent.liu, libv, mat, mrmazda, neogw, quanxian.wang, zhenyu.z.wang
Version: 7.4 (2008.09)Keywords: NEEDINFO
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
Xorg.0.log.dell
none
Xorg.0.log
none
Xorg.0.log.2nd_run
none
debug.output
none
Xorg.0.log and xorg.conf none

Description Stefan Dirsch 2008-10-28 08:10:41 UTC
Xserver cannot be started at all on i845. :-(

[...]
(II) intel(0): I2C bus "CRTDDC_A" removed.
(II) I2C bus "�qŷ�qŷ�?�?`;`;�ŷ�ŷ�ŷ�ŷ�ŷ�ŷ�ŷ�ŷ�)�)�ŷ�ŷ" removed.
Aborted

I'll attach logfile. Components are Intel Q3 release:

- xorg-server 1.5.2
- xf86-video-intel 2.5.0
- libdrm > 2.4.0 (commit a59ea02)
- Mesa 'intel-2008-q3' branch, commit 46921a5
Comment 1 Stefan Dirsch 2008-10-28 08:12:07 UTC
Created attachment 19909 [details]
Xorg.0.log
Comment 2 Stefan Dirsch 2008-10-28 09:59:40 UTC
Initially reported for a i845 Fujitsu Siemens machine I see exactly the same issue on an i845 Dell Optiplex 260 machine. I'll attach this logfile as well.
Comment 3 Stefan Dirsch 2008-10-28 10:00:11 UTC
Created attachment 19911 [details]
Xorg.0.log.dell
Comment 4 Jesse Barnes 2008-11-04 17:13:53 UTC
Looks like we're destroying an I2C bus we never created.  Can you get a backtrace of the problem?  Just use gdb on the X server from an ssh session or something...

Thanks,
Jesse
Comment 5 Stefan Dirsch 2008-11-05 02:37:57 UTC
Not sure why, but after rebooting the machine I'm now stumbling across a slightly different issue.

[...]
Could not init font path element /usr/share/fonts/OTF, removing from list!
intel_bufmgr_fake.c:392: Error waiting for fence: Device or resource busy.
Aborted

I'll attach the new logfile.
Comment 6 Stefan Dirsch 2008-11-05 02:38:56 UTC
Created attachment 20065 [details]
Xorg.0.log
Comment 7 Stefan Dirsch 2008-11-05 02:40:39 UTC
The second time you start the Xserver it gets even worse.

[...]
        00000020: 00000000      MI_NOOP                                  1
Ring end
space: 130996 wanted 131064
(II) intel(0): [drm] removed 1 reserved context for kernel
(II) intel(0): [drm] unmapping 8192 bytes of SAREA 0xe08f5000 at 0xb7a7a000
(II) intel(0): [drm] Closed DRM master.

Fatal server error:
lockup

_fence_emit_internal: drm_i915_irq_emit: -9
Aborted

I'll attach the logfile from second run.
Comment 8 Stefan Dirsch 2008-11-05 02:41:50 UTC
Created attachment 20066 [details]
Xorg.0.log.2nd_run
Comment 9 Stefan Dirsch 2008-11-05 02:53:58 UTC
# gdb /usr/bin/Xorg 
[...]
(gdb) handle SIGUSR1 nostop
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        Yes     Yes             User defined signal 1
(gdb) run
[...]
Could not init font path element /usr/share/fonts/TTF/, removing from list!
Could not init font path element /usr/share/fonts/OTF, removing from list!

Program received signal SIGABRT, Aborted.
0xffffe430 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb7c979b0 in raise () from /lib/libc.so.6
#2  0xb7c992e8 in abort () from /lib/libc.so.6
#3  0xb7add38d in ?? () from /usr/lib/libdrm_intel.so.1
#4  0xb7ade315 in ?? () from /usr/lib/libdrm_intel.so.1
#5  0xb7ade64e in ?? () from /usr/lib/libdrm_intel.so.1
#6  0xb7adc2be in dri_bo_exec () from /usr/lib/libdrm_intel.so.1
#7  0xb7b0cd84 in intel_batch_flush ()
   from /usr/lib/xorg/modules//drivers/intel_drv.so
#8  0xb7b370a0 in ?? () from /usr/lib/xorg/modules//drivers/intel_drv.so
#9  0xb7a1158f in exaFillRegionTiled () from /usr/lib/xorg/modules//libexa.so
#10 0xb7a11ab8 in ?? () from /usr/lib/xorg/modules//libexa.so
#11 0x08173824 in ?? ()
#12 0x0810f411 in miPaintWindow ()
#13 0x0810f782 in miWindowExposures ()
#14 0xb8091097 in DRIWindowExposures ()
   from /usr/lib/xorg/modules//extensions/libdri.so
#15 0x080d615f in ?? ()
#16 0x08076503 in MapWindow ()
#17 0x080766a0 in InitRootWindow ()
#18 0x08070ce6 in main ()
(gdb) 
Comment 10 Stefan Dirsch 2008-11-05 03:06:08 UTC
Created attachment 20067 [details]
debug.output

This time with -debuginfo,-debugsource packages installed.
Comment 11 Stefan Dirsch 2008-11-05 22:21:36 UTC
Since Kent asked me about the severity/priority.
Comment 12 Kent Liu 2008-11-10 23:54:30 UTC
Jesse, do you think comment #9 is something you were looking for a clue?
Comment 13 Stefan Dirsch 2008-11-11 00:10:22 UTC
(In reply to comment #12)
> Jesse, do you think comment #9 is something you were looking for a clue?

I would prefer comment #10. :-)

Comment 14 Jesse Barnes 2008-11-11 10:27:34 UTC
Looks like there are two potential problems here: one that causes the crash that was initially reported (looks like we're using some bogus buffer during DDC probing), and another that appears to be an IRQ related failure that prevents us from seeing completed fences.  Does the fence problem happen if you use a GEM enabled kernel?  Preferably one from Eric's drm-intel tree (drm-intel-next branch)?
Comment 15 Stefan Dirsch 2008-11-11 11:11:14 UTC
Jesse, could we concentrate on the Abort issue first? I would assume that you don't run into the fence issue without the Abort issue, right?
Comment 16 Jesse Barnes 2008-11-11 11:48:20 UTC
Which abort issue (they both have "aborted" in the crash message)? :)

If you mean the one in comment #0 that opened the bug, I'm hoping that will be pretty easy to track down.  It looks like the kind of problem that a gdb backtrace or stepping of the code could track down pretty easily.

If you can't reproduce that one anymore you should try the drm-intel-next branch; it has several IRQ fixes that I hope will get you going again well enough to solve the first crash.
Comment 17 Jesse Barnes 2008-11-11 11:49:18 UTC
Though what's weird is that the first crash doesn't seem to happen everytime...  Not sure what might have changed in your config to hide it...
Comment 18 Stefan Dirsch 2008-11-11 12:14:15 UTC
Jesse, the gdb backtrace in comment #10 is for the abort in the initial comment. Indeed there is an abort() in _fence_wait_internal().
Comment 19 Stefan Dirsch 2008-11-11 12:16:31 UTC
I can reproduce the first crash (abort()) after each reboot.
Comment 20 Jesse Barnes 2008-11-13 13:18:19 UTC
What about with the most recent kernel bits from the for-airlied branch of the drm-intel tree?  Some fixes for interrupt handling went in recently, and afaik 855 users see things working now, so I'm hoping 845 is better now too...
Comment 21 Stefan Dirsch 2008-11-13 15:46:45 UTC
Jesse, didn't you state in comment #16 that the initial abort should be easy 
to track down with a gdb backtrace. Now there is such a gdb backtrace in comment #10, but you're not interested into looking at it?

And what do you mean with drm-intel tree/drm-intel-next branch? I couldn't find such branches in drm git. Are you talking of such special drm git trees
and special branches in them?
Comment 22 Julien Cristau 2008-11-13 16:24:36 UTC
> --- Comment #21 from Stefan Dirsch <sndirsch@suse.de>  2008-11-13 15:46:45 PST ---
> And what do you mean with drm-intel tree/drm-intel-next branch? I couldn't find
> such branches in drm git. Are you talking of such special drm git trees
> and special branches in them?
> 
he's talking about
git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
Comment 23 Jesse Barnes 2008-11-13 16:28:27 UTC
It didn't look to me like the backtrace in comment #10 was at all related to
the initial crash reported in this bug.  If it really is the same crash, then
we're not looking at two bugs like I thought, just one.  And it seems to be
related to buffer object command completion.  That's why I asked you to test
the latest kernel bits, which have several fixes for 8xx series chips, among
others.

Kernel DRM bits are queued up in Dave Airlie's tree these days, at
git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git.  Before
that, Intel specific bits are queued in Eric's staging tree at
git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git in the
drm-intel-next (for the next kernel), for-airlied (critical fixes for the
current kernel) and for-review (in progress fixes worth testing).

If possible, I'd like you to try the for-review branch.  If that fixes things
for you, there are probably some IRQ related fixes that need backporting to
your kernel.
Comment 24 qwang13 2008-11-17 01:57:16 UTC
Stefan,
Any result for that after your trying?
Comment 25 Stefan Dirsch 2008-11-17 02:46:14 UTC
Honestly, I don't understand why it is required to use some random kernel git tree to possibly get his old Intel i845 hardware working again.
Comment 26 qwang13 2008-11-17 02:57:37 UTC
I think Jessy maybe think there are some solution for this bug. If it works, therefore the solution is in it. And then backporting it.
Comment 27 Kent Liu 2008-11-17 06:04:49 UTC
(In reply to comment #25)
> Honestly, I don't understand why it is required to use some random kernel git
> tree to possibly get his old Intel i845 hardware working again.

Stefan, drm-intel is not a random kernel git tree, as this is the only development tree for Intel graphics driver now. It contains all patches pending to Dave Arlie's tree.
Comment 28 Stefan Dirsch 2008-11-18 14:05:18 UTC
Ok. I'll try to build and test with such a kernel.
Comment 29 Stefan Dirsch 2008-11-20 08:51:18 UTC
With the for-review kernel I see a few lines of the usual X pattern. The rest of the screen remains black. Mouse cursor is viewable and movable. And when moving the cursor you see the following in the Xserver's log.

[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[...]

Attached gdb shows you:

0xffffe430 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb7bdd279 in ioctl () from /lib/libc.so.6
#2  0xb7aa944f in ?? () from /usr/lib/libdrm.so.2
#3  0xb7aa959a in drmCommandNone () from /usr/lib/libdrm.so.2
#4  0xb79bd380 in ?? () from /usr/lib/xorg/modules//drivers/intel_drv.so
#5  0x081666cf in ?? ()
#6  0x081401da in ?? ()
#7  0x0808f0c8 in BlockHandler ()
#8  0x0812cc7d in WaitForSomething ()
#9  0x0808b1de in Dispatch ()
#10 0x08070d4d in main ()
(gdb) 

I think that's not really an improvement. :-(
Comment 30 Stefan Dirsch 2008-11-22 02:33:16 UTC
Would it help to send an affected i845 machine to Intel? The issue is that the vesa driver also has issues with this hardware. Switching to Linux console results in a blank screen and switching back to X doesn't change this. :-(
Comment 31 Kent Liu 2008-11-23 18:31:01 UTC
Jesse, do you think the result of comment #29 is still related to buffer objects command completion, or IRQ stuff? Do you have further instructions for Stefan?

And would it be better if Stefan ship one i845G to you for checking?
Comment 32 Jesse Barnes 2008-11-24 12:28:13 UTC
Ugg, yeah that sounds like even worse behavior, but thanks for testing, Stefan.

Getting the hardware would help reduce the turnaround times for testing fixes, but atm I don't have any theories about what might be causing this problem; I'll look into the code some more and see where we might be missing stuff for 845.
Comment 33 Jesse Barnes 2008-11-24 15:24:43 UTC
Some background on this issue:

It looks like the problem is related to fence waiting.  Ultimately, the I915_IRQ_WAIT ioctl is timing out, which means we didn't see the sequence number we were waiting for before the timeout occurred (timeout is 3 seconds).

This could mean that the chip is hung for some reason (this would happen if we sent down bogus commands), or that our sequence numbers aren't being incremented properly.

The IRQ wait ioctl depends on the result of READ_BREADCRUMB(), which in turn depends on the value in the hardware status page.  The status page should be set up by i915_initialize in the i915 drm code on this machine (since it doesn't need a status page in the GTT).  You can confirm the addresses you're getting by instrumenting that function.

The values are actually written by the command streamer when a CMD_STORE_DWORD_IDX command comes along (see i915_emit_irq and i915_emit_breadcrumb).

So assuming your hardware status page is set up correctly, you should see those values increment on every command or batch buffer submission and on every IRQ emit call.  Which makes me think you're running into a more general chip hang (one of your logs points this way too, with a timeout in I830WaitLpRing).

We often see lockups due to render acceleration; does the hang occur if you set ExaNoComposite to true in the intel section of your xorg.conf?
Comment 34 Gordon Jin 2008-11-25 17:33:50 UTC
*** Bug 18657 has been marked as a duplicate of this bug. ***
Comment 35 qwang13 2008-11-26 06:23:33 UTC
Stefan,
Just as Jesse said in comment 33. do you have a try for that?

"
does the hang occur if you set
ExaNoComposite to true in the intel section of your xorg.conf?
"

Thanks
Comment 36 Stefan Dirsch 2008-11-27 10:05:49 UTC
Wow! Xserver starts fine with 'Option "ExaNoComposite" "true"'. That's good news.

Comment 37 Michael Fu 2008-12-14 20:54:06 UTC

*** This bug has been marked as a duplicate of bug 17713 ***
Comment 38 Guek Wu Neo 2008-12-14 22:03:19 UTC
Created attachment 21144 [details]
Xorg.0.log and xorg.conf

We have a bug reported to Novell Bugzilla#442416 that points  to this intel bug as resolution Upstream .

Problem description: Not able to startX. Problem exist is Sles11 beta1/2/ 3
Error message:Not able to start X with error message " maximum number of X
display failures reached.


We have tried comment#33. It does not resolve our problem. attached is the log collected from trying comment#33.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.