Bug 18922

Summary: [G45] hang in driWaitForVblank with compiz
Product: xorg Reporter: martin <mnemo>
Component: Driver/intelAssignee: Jesse Barnes <jbarnes>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: khashayar.lists, mozilla_bugs, nemesis
Version: unspecifiedKeywords: NEEDINFO
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
backtrace from hung xserver
none
xorg_log with "x has gone into inf loop" backtrace (related to EQ overflow)
none
xorg_log taken at similar freeze (identical stack in gdb) but this time no "EQ overflow" in xorg_log
none
dmesg (saved from ssh while freeze was still in effect, but the gksu segv probably has nothing to do with it, right?)
none
archive containing Xorg log + the output of everything under /proc/dri/0/
none
xorg_log, dmesg, gdb, uname and 2 snapshots of /proc/dri/0/i915_gem_interrupt
none
Xorg log from infinite loop causing mouse non-responsiveness none

Description martin 2008-12-07 06:42:25 UTC
Created attachment 20866 [details]
backtrace from hung xserver

Just now I started Firefox and my xserver froze. I grabbed a backtrace from gdb and X was stucking waiting for some ioctl(). I also took a copy of xorg.lorg which contains a "xorg is probably in an infinite loop backtrace" as such:

[mi] EQ overflowing. The server is probably stuck in an infinite loop.

Backtrace:
0: /usr/X11R6/bin/X(xorg_backtrace+0x26) [0x4ee1a6]
1: /usr/X11R6/bin/X(mieqEnqueue+0x291) [0x4cebd1]
2: /usr/X11R6/bin/X(xf86PostMotionEventP+0xc4) [0x498554]
3: /usr/X11R6/bin/X(xf86PostMotionEvent+0xb1) [0x498731]
4: /usr/lib/xorg/modules/input//evdev_drv.so [0x7f26a0a559b2]
5: /usr/X11R6/bin/X [0x481625]
6: /usr/X11R6/bin/X [0x472147]
7: /lib/libpthread.so.0 [0x7f26b9bbf080]
8: /lib/libc.so.6(ioctl+0x7) [0x7f26b8221d87]
9: /usr/lib/libdrm.so.2 [0x7f26b6dfe8d3]
10: /usr/lib/libdrm.so.2(drmWaitVBlank+0x20) [0x7f26b6dfed70]
11: /usr/lib/dri/i965_dri.so [0x7f26a5ccb85e]
12: /usr/lib/dri/i965_dri.so(driWaitForVBlank+0x110) [0x7f26a5ccbaf0]
13: /usr/lib/dri/i965_dri.so(intelSwapBuffers+0xe5) [0x7f26a5cd53d5]
14: /usr/lib/dri/i965_dri.so [0x7f26a5cccdef]
15: /usr/lib/xorg/modules/extensions//libglx.so [0x7f26b7659b5f]
16: /usr/lib/xorg/modules/extensions//libglx.so [0x7f26b764d936]
17: /usr/lib/xorg/modules/extensions//libglx.so [0x7f26b7650bd2]
18: /usr/X11R6/bin/X(Dispatch+0x364) [0x44d754]
19: /usr/X11R6/bin/X(main+0x45d) [0x43376d]
20: /lib/libc.so.6(__libc_start_main+0xe6) [0x7f26b8162586]
21: /usr/X11R6/bin/X [0x432b49]
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
Comment 1 martin 2008-12-07 06:43:30 UTC
Created attachment 20867 [details]
xorg_log with "x has gone into inf loop" backtrace (related to EQ overflow)
Comment 2 martin 2008-12-07 06:50:21 UTC
Previously when I used 2.6.27 kernel with intel 2.4.1 I never saw this particular freeze at all.

When I upgraded to jaunty I got 2.6.28 kernel and intel 2.5.1 I run into this problem.

This bug is not reproducible by specific steps though, it just happens at random times out of the blue. If you have additional info you need I can write it down on a post-it next to my machine and try to collect that data once the bug happens to be triggered again.

My hardware is a x64 box with intel g45 (some lines from "lspci -nn" below):
00:00.0 Host bridge [0600]: Intel Corporation 4 Series Chipset DRAM Controller [8086:2e20] (rev 03)
00:02.0 VGA compatible controller [0300]: Intel Corporation 4 Series Chipset Integrated Graphics Controller [8086:2e22] (rev 03)
00:02.1 Display controller [0380]: Intel Corporation 4 Series Chipset Integrated Graphics Controller [8086:2e23] (rev 03)

I have "current jaunty versions" which right now means:
ii  libdrm-intel1                             2.4.1-0ubuntu7
ii  xserver-xorg-video-intel                  2:2.5.1-1ubuntu5
ii  linux-image-2.6.28-2-generic              2.6.28-2.3
ii  xserver-xorg                              1:7.4~5ubuntu5
ii  libgl1-mesa-dri                           7.2-1ubuntu2

Comment 3 martin 2008-12-07 12:09:30 UTC
I just hit this exact same freeze again, that's two times today so far. Seems like a pretty hard hitting bug.
Comment 4 Gordon Jin 2008-12-07 21:51:06 UTC
This looks similar to http://bugzilla.kernel.org/show_bug.cgi?id=12166. Are you including vesafb in kernel too? If so, try removing it.
Comment 5 martin 2008-12-08 10:20:19 UTC
It appears that I do _NOT_ have VESA compiled into the kernel (I'm running the stock ubuntu kernel). In fact I don't have anything containing "fb" compiled into the kernel. For details, look below:

mnemo@kingfish:~$ cat /boot/config-2.6.28-2-generic | grep -i vesa
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_UVESA=m
CONFIG_FB_VESA=m
mnemo@kingfish:~$ uname -a
Linux kingfish 2.6.28-2-generic #3-Ubuntu SMP Thu Dec 4 21:49:26 UTC 2008 x86_64 GNU/Linux
mnemo@kingfish:~$ cat /boot/config-2.6.28-2-generic | grep -i vesa
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_UVESA=m
CONFIG_FB_VESA=m
mnemo@kingfish:~$ cat /boot/config-2.6.28-2-generic | grep -i fb | grep -v m
mnemo@kingfish:~$ 
Comment 6 martin 2008-12-09 10:27:02 UTC
Created attachment 20959 [details]
xorg_log taken at similar freeze (identical stack in gdb) but this time no "EQ overflow" in xorg_log

Today I found my jauny xorg hung again, I saw the exact same stack in gdb as the one I reported above. However, this time the xorg_log did not mention "EQ overflow". Could be another bug or maybe this gives some clue about this particular bug. I'm attaching the xorg_log from today that doesn't show any "EQ overflow" reference.
Comment 7 Eric Anholt 2008-12-10 09:59:30 UTC
The following kernel commit may fix things.

Also, please include dmesg with bug reports.

commit 52440211dcdc52c0b757f8b34d122e11b12cdd50
Author: Keith Packard <keithp@keithp.com>
Date:   Tue Nov 18 09:30:25 2008 -0800

    drm: move drm vblank initialization/cleanup to driver load/unload
    
    drm vblank initialization keeps track of the changes in driver-supplied
    frame counts across vt switch and mode setting, but only if you let it by
    not tearing down the drm vblank structure.
    
    Signed-off-by: Keith Packard <keithp@keithp.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>
Comment 8 martin 2008-12-10 14:56:32 UTC
I walked through that commit as specified here:
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff_plain;h=52440211dcdc52c0b757f8b34d122e11b12cdd50;hp=6133047aa64d2fd5b3b79dff74f696ded45615b2

And I have all those changed in my current ubuntu jaunty kernel already. So this bug was not fixed by that commit. I had another two instances of this bug today with this particular fixed included.
Comment 9 martin 2008-12-12 11:36:06 UTC
Created attachment 21100 [details]
dmesg (saved from ssh while freeze was still in effect, but the gksu segv probably has nothing to do with it, right?)
Comment 10 martin 2008-12-15 13:50:35 UTC
Some more info on my kernel config:

mnemo@kingfish:~/src/libexif_apt/libexif-0.6.16$ grep MTRR /boot/config-2.6.28-2-generic 
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
mnemo@kingfish:~/src/libexif_apt/libexif-0.6.16$ grep PREEMPT /boot/config-2.6.28-2-generic 
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y

This bug seems very similar to this kernel/DRI bug:
http://bugzilla.kernel.org/show_bug.cgi?id=12166

(I will try to collect the output of /proc/dri/0/i915_gem_interrupt the next time this bug repro's)
Comment 11 Jesse Barnes 2008-12-17 16:23:04 UTC
Yeah, it would be interesting to know if you're still getting interrupts at the point where things fail.  That will tell us where to look in the interrupt handler...
Comment 12 Khashayar Naderehvandi 2008-12-22 16:47:31 UTC
I'm having this extremely annoying issue too.

Often, the backtrace is similar to what's been reported here. Today, I had this backtrace:

Backtrace:
0: /usr/X11R6/bin/X(xorg_backtrace+0x3b) [0x813161b]
1: /usr/X11R6/bin/X(xf86SigHandler+0x55) [0x80cb635]
2: [0xb809a400]
3: /usr/lib/xorg/modules//libexa.so [0xb7b65f23]
4: /usr/lib/xorg/modules//libexa.so [0xb7b675e2]
5: /usr/X11R6/bin/X [0x8178334]
6: /usr/X11R6/bin/X(miPaintWindow+0x231) [0x8110fb1]
7: /usr/X11R6/bin/X(miWindowExposures+0x142) [0x8111322]
8: /usr/lib/xorg/modules/extensions//libdri.so(DRIWindowExposures+0x97) [0xb7b79e17]
9: /usr/X11R6/bin/X [0x80c11af]
10: /usr/X11R6/bin/X(miHandleValidateExposures+0x74) [0x81290f4]
11: /usr/X11R6/bin/X(UnmapWindow+0x1f8) [0x8076e78]
12: /usr/X11R6/bin/X(DeleteWindow+0x36) [0x807a926]
13: /usr/X11R6/bin/X(FreeClientResources+0xe6) [0x80741e6]
14: /usr/X11R6/bin/X(CloseDownClient+0x6f) [0x808690f]
15: /usr/X11R6/bin/X(Dispatch+0x3e8) [0x808c988]
16: /usr/X11R6/bin/X(main+0x47d) [0x8071d6d]
17: /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7c9a685]
18: /usr/X11R6/bin/X [0x8071151]

If this isn't something completely different, then perhaps it can prove useful.

@Jesse: I don't if this is what you want, but a "cat /proc/dri/0/i915_gem_interrupt" gave me this:

Interrupt enable:    00000053
Interrupt identity:  00000000
Interrupt mask:      fffedfae
Pipe A stat:         00000000
Pipe B stat:         00400206
Interrupts received: 1145771
Current sequence:    1550525
Waiter sequence:     0
IRQ sequence:        1550525

This data was collected after one of these crashes (although not the one with the above backtrace).

FYI, crashes here don't happen during VT switches and I gave no fb drivers in use. They always happen (as far as I can tell anyway), when I use compiz, or if there's an opengl window (like a screensaver, google earth, or even glxgears running). I'm using a GEM kernel (2.6.28-rc9), and the chip is g45.

Regards.
Comment 13 Eric Anholt 2008-12-22 18:16:50 UTC
Khashayar, you have a completely different bug (crash, not a hang, and a different backtrace).  Report your own bug if you want anything to happen.
Comment 14 Khashayar Naderehvandi 2008-12-25 12:52:40 UTC
(In reply to comment #13)
> Khashayar, you have a completely different bug (crash, not a hang, and a
> different backtrace).  Report your own bug if you want anything to happen.
> 

Thanks for confirming that. I thought so, but wasn't completely sure. I'll file a new report about that if I see it again.

Just to make it clear, I'm _also_ having the problem posted by the OP. That is, I've also had backtraces looking similar to the one in the first post here. Like this one, for instance:


[mi] EQ overflowing. The server is probably stuck in an infinite loop.

Backtrace:
0: /usr/X11R6/bin/X(xorg_backtrace+0x3b) [0x813161b]
1: /usr/X11R6/bin/X(mieqEnqueue+0x289) [0x8110bf9]
2: /usr/X11R6/bin/X(xf86PostMotionEventP+0xc2) [0x80ce702]
3: /usr/X11R6/bin/X(xf86PostMotionEvent+0x68) [0x80ce868]
4: /usr/lib/xorg/modules/input//synaptics_drv.so [0xa3b95426]
5: /usr/lib/xorg/modules/input//synaptics_drv.so [0xa3b97ae9]
6: /usr/X11R6/bin/X [0x80cb7c7]
7: /usr/X11R6/bin/X [0x80b133c]
8: [0xb7fd6400]
9: /usr/lib/libdrm.so.2(drmWaitVBlank+0x28) [0xb7a96718]
10: /usr/lib/dri/i965_dri.so [0xa7527ffd]
11: /usr/lib/dri/i965_dri.so(driWaitForVBlank+0xfb) [0xa752828b]
12: /usr/lib/dri/i965_dri.so(intelSwapBuffers+0xc7) [0xa7532597]
13: /usr/lib/dri/i965_dri.so [0xa75295a7]
14: /usr/lib/xorg/modules/extensions//libglx.so [0xb7afeb74]
15: /usr/lib/xorg/modules/extensions//libglx.so [0xb7af12ce]
16: /usr/lib/xorg/modules/extensions//libglx.so [0xb7af4c0a]
17: /usr/X11R6/bin/X(Dispatch+0x34f) [0x808c8ef]
18: /usr/X11R6/bin/X(main+0x47d) [0x8071d6d]
19: /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7bd6685]
20: /usr/X11R6/bin/X [0x8071151]
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.

...and so on
Comment 15 Khashayar Naderehvandi 2008-12-25 14:20:41 UTC
I created this nifty little script and let it run at boot time.

#!/bin/bash
until [ "`tail -n -1 /var/log/Xorg.0.log`" == "[mi] mieqEnequeue: out-of-order valuator event; dropping." ]; do
        sleep 30s
done
mkdir /root/X-bugs
for i in `ls /proc/dri/0/`; do cat /proc/dri/0/$i > /root/X-bugs/$i.output; done
cp /var/log/Xorg.0.log /root/X-bugs/
reboot


Then, I just waited for this to re-occur. I'll now attach an archive containing all those files. I hope it helps. Let me know if there's anything else I can do.

Comment 16 Khashayar Naderehvandi 2008-12-25 14:21:39 UTC
Created attachment 21477 [details]
archive containing Xorg log + the output of everything under /proc/dri/0/
Comment 17 Eric Anholt 2008-12-29 12:13:26 UTC
Please retest with new mesa:
commit 6c01500228014a6cfa133b5dbba8c6d024833e84
Author: Eric Anholt <eric@anholt.net>
Date:   Tue Dec 23 16:08:40 2008 -0800

    dri: Fix driWaitForMSC32 when divisor >= 2 and msc < 0.
Comment 18 Khashayar Naderehvandi 2008-12-29 15:24:49 UTC
Eric,

I compiled mesa 7.2 with that patch applied but after a short while I was hit by this bug again. Are there other post-7.2 commits that I might need? That is, should I give mesa-git a whirl instead of a patched 7.2?

Do you want me to attach the log and output of /proc/dri/0/* this time?
Comment 19 Jesse Barnes 2008-12-30 19:29:43 UTC
Can you check out the patch in 18879?
Comment 20 Khashayar Naderehvandi 2008-12-31 01:11:57 UTC
(In reply to comment #19)
> Can you check out the patch in 18879?
> 

I'll build the drm modules in kernel 2.6.28 with that patch applied and see where that gets me. If you'd rather want me to try the modules from git, let me know. (My stack is basically the latest stable release of everything).
Comment 21 Khashayar Naderehvandi 2008-12-31 03:10:16 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > Can you check out the patch in 18879?
> > 
> 
> I'll build the drm modules in kernel 2.6.28 with that patch applied and see
> where that gets me. If you'd rather want me to try the modules from git, let me
> know. (My stack is basically the latest stable release of everything).
> 

That didn't work. In fact, X hung as soon as GDM started (but it wasn't this particular bug that triggered it). Let me know if you want me to try to catch some logs, but that should be for another bug report, I guess.
Comment 22 Khashayar Naderehvandi 2008-12-31 03:45:02 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > Can you check out the patch in 18879?
> > > 
> > 
> > I'll build the drm modules in kernel 2.6.28 with that patch applied and see
> > where that gets me. If you'd rather want me to try the modules from git, let me
> > know. (My stack is basically the latest stable release of everything).
> > 
> 
> That didn't work. In fact, X hung as soon as GDM started (but it wasn't this
> particular bug that triggered it). Let me know if you want me to try to catch
> some logs, but that should be for another bug report, I guess.
> 

After reading the comments in #18879, I see the patch has been reported not to work with the 2.6.28 drm modules. I'll try drm-intel-next. Will report back here.

Comment 23 Jesse Barnes 2008-12-31 11:31:56 UTC
There's another patch in 18041 that might help too, in case this is a different problem.  Also I just updated the one 18879, so you might try that against the 2.6.28 branch again.
Comment 24 Khashayar Naderehvandi 2009-01-01 06:42:50 UTC
(In reply to comment #23)
> There's another patch in 18041 that might help too, in case this is a different
> problem.  Also I just updated the one 18879, so you might try that against the
> 2.6.28 branch again.
> 

I tried the updated patch. It didn't crash X, but didn't solve the problem either.
I've had some problems gitting. Is it the 'master' one should go for nowadays, or a branch? Is it possible to use drm-git, while the rest of the stack is latest released versions?

I'll try the libdrm patch later.
Comment 25 Jesse Barnes 2009-01-09 17:20:48 UTC
Yeah two new commits in drm-intel-next and drm-intel-2.6.28 might help:

commit e1a6fcee467556a7e955fe1f7ccc134dd2f974e7
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Tue Jan 6 10:21:24 2009 -0800

    drm/i915: set vblank enabled flag correctly across IRQ install/uninstall

commit 9f4f07ceb1716d8796089fcef91621c5f07c872a
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Thu Jan 8 10:42:15 2009 -0800

    drm/i915: don't enable vblanks on disabled pipes

along with libdrm:
commit f4f76a6894b40abd77f0ffbf52972127608b9bca
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Wed Jan 7 10:18:08 2009 -0800

    libdrm: add timeout handling to drmWaitVBlank

Please confirm and close this out if things look good for you now.
Comment 26 Khashayar Naderehvandi 2009-01-10 01:50:26 UTC
Would that be these patches along with both of the patches you mention in comment #23?
Comment 27 martin 2009-01-10 05:58:48 UTC
Created attachment 21864 [details]
xorg_log, dmesg, gdb, uname and 2 snapshots of  /proc/dri/0/i915_gem_interrupt

Today my X hung again with the same gdb stack (but no EQ overflow spam in xorg_log). I had refresh jaunty packages installed which means:

libdrm-intel1 			2.4.1-0ubuntu9
xserver-video-intel 		2.5.1-1ubuntu7
libgl1-mesa-dri                 7.2+git20081209.a0d5c3cf-0ubuntu4
Linux kingfish 2.6.28-4-generic #9-Ubuntu SMP Tue Jan 6 19:33:48 UTC 2009 x86_64 GNU/Linux

This time I sampled /proc/dri/0/i915_gem_interrupt with a couple of seconds in between and I saw that the "Interrupts received" was still being incremented even though X was hung.

This bug happens much more rarely now on ubuntu jaunty and I don't have specific repro steps so I can't really test patches effectively.

Is there any thing in general I can do inside Ubuntu so stress test "vblanking"? Maybe I can run some special part of x11perf or use some screensaver or something that increases the probability that I will hit this bug? Anyone got any ideas?
Comment 28 Khashayar Naderehvandi 2009-01-10 06:38:50 UTC
@martin: Did you apply the patches Jesse referenced + rebuilt affected packages?

@jesse: I've applied the patches and have had no hang so far, a couple of hours of normal usage. If there's no problem during the next 24 hours, I'd feel safe saying the patches have solved the problem. Expect a comment about that no later than tomorrow about this time.
Comment 29 martin 2009-01-10 07:16:16 UTC
No I don't have Jesse's patches yet. Currently this bug repros like once a week for me so I would like to find a better way to repro this bug before I try patches. I really want this bug gone by Jaunty release in April, but to get drm patches backported I need a solid repro.
Comment 30 Khashayar Naderehvandi 2009-01-11 04:38:12 UTC
Alright, as far as I'm concerned, this bug can be closed. These patches solve the issue for me. 

Thank you very much!
Comment 31 Jesse Barnes 2009-01-11 11:37:00 UTC
Thanks for confirming!
Comment 32 Chris Miller 2009-01-13 19:07:07 UTC
I've been seeing a similar problem under Ubuntu Intrepid x86_64 (kernel 2.6.27-9-generic)

I don't get a complete X crash, but firefox hangs, and then my mouse stops responding to button events.  Keyboard shortcuts work, mouse movement occurs, but mouse clicks don't register.   

Full Xorg log attached, but the relevant line seems to be the same as above problems:

[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
[mi] EQ overflowing. The server is probably stuck in an infinite loop.
[mi] mieqEnequeue: out-of-order valuator event; dropping.
. . .
(repeated ad nauseum)


Can someone point me to the patches referenced above, (and perhaps some compilation instructions) so that I can test the fix?  Thanks.
Comment 33 Chris Miller 2009-01-13 19:08:54 UTC
Created attachment 21962 [details]
Xorg log from infinite loop causing mouse non-responsiveness

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.