Bug 16664

Summary: start-and-kill glxgears repeatedly freezes X
Product: xorg Reporter: Zdenek Kabelac <zdenek.kabelac>
Component: Driver/intelAssignee: haihao <haihao.xiang>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Log from boot to GPU crash with udelayed driver
none
Xorg log none

Description Zdenek Kabelac 2008-07-10 11:59:52 UTC
3D application seems to be freezing Xserver as can be seen in the gdb backtrace.

Xserver stops and cannot be normaly killed - after it is being killed -9, it's not possible to restart it without reboot my T61 laptop with Intel graphics.
(vbetool post doesn't help - as might be seen in my other report)

#0  0x00000038452e0b57 in ioctl () from /lib64/libc.so.6
#1  0x000000385680349d in drmCommandWrite () from /usr/lib64/libdrm.so.2
#2  0x00007fd1436b4297 in intelWaitIrq (intel=0xba41e0, seq=12900) at intel_ioctl.c:93
#3  0x00007fd1436b04fe in intel_fence_wait (private=0xd, cookie=1074029637) at intel_context.c:440
#4  0x00007fd1436abda2 in _fence_wait_internal (bufmgr_fake=0xbe09d0, cookie=1074029637)
    at ../common/dri_bufmgr_fake.c:215
#5  0x00007fd1436b03c3 in intelFinish (ctx=0xba41e0) at intel_context.c:381
#6  0x00007fd1436b9a00 in intelSpanRenderStart (ctx=0xd) at intel_span.c:313
#7  0x00007fd1433cc2be in _swrast_ReadPixels (ctx=0xba41e0, x=0, y=188, width=512, height=512, format=6406,
    type=5121, packing=0x40046445, pixels=0x5b150d0) at swrast/s_readpix.c:562
#8  0x00007fd14330fd6e in _mesa_ReadPixels (x=0, y=188, width=512, height=512, format=6406, type=5121,
    pixels=0x5b150d0) at main/drawpix.c:336
#9  0x00007fd155a6a810 in __glXDisp_ReadPixels (cl=<value optimized out>, pc=0x8c82a04 "\233o\t") at singlepix.c:82
#10 0x00007fd155a64af2 in __glXDispatch (client=0xb8b0c0) at glxext.c:492
#11 0x00000000004467d4 in Dispatch () at dispatch.c:454
#12 0x000000000042cc4d in main (argc=3, argv=0x7fff5e515328, envp=<value optimized out>) at main.c:441


Let me know if there are some more details needed.

I'm using Fedora Rawhide,
mesa-libOSMesa-7.1-0.37.fc10.x86_64
xorg-x11-server-Xorg-1.4.99.905-2.20080701.fc10.x86_64
Comment 1 Michael Fu 2008-07-10 22:22:06 UTC
what 3D app(s) do you used to reproduce this kind of deadlock?
Comment 2 Zdenek Kabelac 2008-07-11 00:40:35 UTC
Linden's SL viewer - try to use the patched version from this site http://sldev.free.fr/ - of course you will need to register an avatar for yourself
(it's for free)

It's 32bit application running in 64bit environment if that could make a difference. BTW it also very slow... but I'd like to see the stable version first :)  So far I've not tried to make myself a 64bit version.
Viewer is open source.

Comment 3 Gordon Jin 2008-07-11 21:17:52 UTC
Please attach your full Xorg.0.log. So SL viewer is the only app triggering this as you've found?
Comment 4 Zdenek Kabelac 2008-07-12 02:28:28 UTC
Check attachements from the bug #16648
	
http://bugs.freedesktop.org/attachment.cgi?id=17601

I do not have any other 3D app on my disk except glxgears ;) so I can't tell - but as I've went through the list - I've noticed there are at least 5 other people who are now experiencing most probably same bug with their intel driver (I've tried to add some comments to those bug (just check the list for my recent post) to see which bugs I suspect)

I guess I've some good debug skills - so if you will tell me what debug traces I should give you - you have fairly high chance you will get them :)

I should add that with the exactly same 64bit kernel 2.6.26-rc9 I could run 32bit Debian-unstable on my machine (last updated on January with Xorg 7.3) and it seems to be running quite fast and without any deadlock in the SL

So the problem is not in the kernel, and not in the HW.

I could probably slowly iterate with changing libraries to see when it gets broken - but I assume that checking the source code for deadlock conditions is a better way.
Comment 5 Zdenek Kabelac 2008-07-18 06:45:50 UTC
What is the progress with this bug?

I think you could safely rename the bug to 'glxgears freezes X'
As I'm experiencing the exact same issue with glxgears - thought it takes few runs of glxgears and Xserver restarts I guess.

Comment 6 haihao 2008-07-18 07:39:48 UTC
Could you detail how to reproduce it with glxgears ?
Comment 7 Zdenek Kabelac 2008-07-18 08:56:36 UTC
Here goes brief howto - 

from next machine I start   Xorg :0 -ac
from next terminal on that machine I'm running:

export DISPLAY=:0
xterm &
gdb /usr/bin/glxgears
  run

and breaking/exiting and restarting glxgears again until it fails.
sometimes I had to restart Xorg or eventually reboot the whole machine
if it's stable for too long ;)

But after some time it happens and glxgears doesn't show up a running windows.
- then I could break it in gdb in the ioctl command.

I've now tested usage of    drm debug=1 flag.
But the problem was - the logging in messages was way toooo slow to actually catch anything and the whole output was complete mess anyway.

Trying to store dmesg periodically was no use as well - simply way to much message were flowing in.

So - I've made following change in drm/drm_drv.c  driver - 

--- drm_drv.c	2008-07-18 17:29:40.000000000 +0200
+++ drm_drv.c.test	2008-07-18 17:48:38.000000000 +0200
@@ -515,6 +515,7 @@
 	}
 
       err_i1:
+        udelay(10000);
 	if (kdata)
 		kfree(kdata);
 	atomic_dec(&dev->ioctl_count);



This makes 3D very slow - i.e. gears goes from 630fps -> 4fps
but actually the deadlock happend even in this 'slow' case - and I'll attach this log to this bugzilla - not so long - <1MB (full boot)

you might notice that there are couple restarts of glxgears visible from reopening drm device.

I'm not 100% sure it's exactly the same problem which lead to the initialy bugzilla openning - but behaves similary - i.e. machine has to be restarted to get the 3d operations again.

I hope this helps somehow :) as it took me quite some time....

I'll attach also Xorg.log (hopefully from the same run)
Comment 8 Zdenek Kabelac 2008-07-18 08:58:25 UTC
Created attachment 17748 [details]
Log from boot to GPU crash with udelayed driver
Comment 9 Zdenek Kabelac 2008-07-18 08:59:07 UTC
Created attachment 17749 [details]
Xorg log
Comment 10 haihao 2008-07-21 01:04:52 UTC
(In reply to comment #8)
> Created an attachment (id=17748) [details]
> Log from boot to GPU crash with udelayed driver

Do you mean GPU crash?  It seems X exits normally according your  Xorg log attached at comment #9. 

Comment 11 Zdenek Kabelac 2008-07-28 07:40:53 UTC
Yes - as I said GPU gets stuck.  The freeze is slightly different from the one I'm experiencig with SL actually - as with these freeze 2D Xorg is still fully operational - so I could open terminal, use firefox, I just cannot open any 3D application.   With SL crash whole Xorg is frozen and only mouse could move - but that might be the result of badly kept focus of frozen SL application and when I kill SL it usually leads to whole Xorg crashe because of some badly deallocation drm context   (as could be seen in my other bugzilla).

Anyway I hope I soon will be able to try this patch:
http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=commitdiff;h=97988ccc46c0248177cd71937021ca8cc2a7452b

It looks like this could be the source of my troubles.
I'll add more info, once I'll be able to try Mesa with this patch.
Comment 12 haihao 2008-08-04 20:30:56 UTC
Does glxgears work well for you with the latest mesa ?
Comment 13 Zdenek Kabelac 2008-08-05 14:28:00 UTC
Unfortunately mesa from the git doesn't work for me.

I've compiled drm & mesa and inserted drm & i915 modules.
Mesa for a regular user gives me just ~280FPS
(most probably because the kernel is kept busy with logging of debug traces)
For the root user it prints ~900FPS but with:
Failed to initialize TTM buffer manager.  Falling back to classic.


My kernel prints some debug trace about wrong unlocking:
 SPIN IRQ ALREADY ENABLED
 Pid: 0, comm: swapper Tainted: G        W 2.6.27-rc1 #56
 
 Call Trace:
  <IRQ>  [<ffffffff81315dfd>] _spin_unlock_irqrestore+0x7d/0x90
  [<ffffffffa026449d>] drm_locked_tasklet_func+0x8d/0xb0 [drm]
  [<ffffffff810450dc>] tasklet_hi_action+0xbc/0x160
  [<ffffffff81045c24>] __do_softirq+0x74/0xf0
  [<ffffffff8100d9cc>] call_softirq+0x1c/0x30
  [<ffffffff8100efb5>] do_softirq+0x75/0xb0
  [<ffffffff810456d5>] irq_exit+0xa5/0xb0
  [<ffffffff8100f27c>] do_IRQ+0xac/0x130
  [<ffffffff8106646d>] ? trace_hardirqs_off+0xd/0x10
  [<ffffffff8100cbc6>] ret_from_intr+0x0/0xf
  <EOI>  [<ffffffff8106646d>] ? trace_hardirqs_off+0xd/0x10
  [<ffffffff811e8c1c>] ? acpi_idle_enter_bm+0x2ce/0x35d
  [<ffffffff810677dd>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811e8c26>] ? acpi_idle_enter_bm+0x2d8/0x35d
  [<ffffffff811e8c1c>] ? acpi_idle_enter_bm+0x2ce/0x35d
  [<ffffffff8127c7b0>] ? menu_select+0xd0/0xf0
  [<ffffffff8127b519>] ? cpuidle_idle_call+0xb9/0x100
  [<ffffffff8100b4a6>] ? cpu_idle+0x66/0x100
  [<ffffffff8130e30e>] ? start_secondary+0x134/0x196


- so sorry I'm unable to check yet the functionality of the patch.
I've not yet convinced rpm maintainers to release updated mesa libs
with this patch...


BTW
[drm] Initialized drm 1.1.0 20060810
[drm] Initialized i915 1.13.0 20080312 on minor 0

(is the version 1.1.0 from 20060810 correct?)


And here is another longer trace:
(split into appropriate bugs)

Let me know - once these issues are resolved, so far I'm sorry, but I'm considering git version of mesa unusable on my T61....

My kernel is todays 2.6.27-rc1 git compilation with my home-made patch for checking spin-lock restoration - that means that in your drm_locked_tasklet_func you are actually restoring already enabled IRQ - which is definitely wrong.....


[drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0

======================================================
[ INFO: hard-safe -> hard-unsafe lock order detected ]
2.6.27-rc1 #57
------------------------------------------------------
swapper/0 [HC0[0]:SC1[2]:HE0:SE0] is trying to acquire:
 (&dev->lock.spinlock){-+..}, at: [<ffffffffa0364baa>] drm_lock_take+0x2a/0xf0 [drm]

and this task is already holding:
 (&dev->tasklet_lock){++..}, at: [<ffffffffa03638ad>] drm_locked_tasklet_func+0x2d/0xb0 [drm]
which would create a new lock dependency:
 (&dev->tasklet_lock){++..} -> (&dev->lock.spinlock){-+..}

but this new dependency connects a hard-irq-safe lock:
 (&dev->tasklet_lock){++..}
... which became hard-irq-safe at:
  [<ffffffffffffffff>] 0xffffffffffffffff

to a hard-irq-unsafe lock:
 (&dev->lock.spinlock){-+..}
... which became hard-irq-unsafe at:
...  [<ffffffff81069c6f>] __lock_acquire+0x40f/0xd10
  [<ffffffff8106a608>] lock_acquire+0x98/0xe0
  [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70
  [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm]
  [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm]
  [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm]
  [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0
  [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0
  [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0
  [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

1 lock held by swapper/0:
 #0:  (&dev->tasklet_lock){++..}, at: [<ffffffffa03638ad>] drm_locked_tasklet_func+0x2d/0xb0 [drm]

the hard-irq-safe lock's dependencies:
-> (&dev->tasklet_lock){++..} ops: 2423 {
   initial-use  at:
                                       [<ffffffff81069989>] __lock_acquire+0x129/0xd10
                                       [<ffffffff8106a608>] lock_acquire+0x98/0xe0
                                       [<ffffffff813151c7>] _spin_lock_irqsave+0x47/0x80
                                       [<ffffffffa0364d2a>] drm_unlock+0x3a/0xe0 [drm]
                                       [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm]
                                       [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm]
                                       [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0
                                       [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0
                                       [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0
                                       [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b
                                       [<ffffffffffffffff>] 0xffffffffffffffff
   in-hardirq-W at:
                                       [<ffffffffffffffff>] 0xffffffffffffffff
   in-softirq-W at:
                                       [<ffffffffffffffff>] 0xffffffffffffffff
 }
 ... key      at: [<ffffffffa038b1b0>] __key.25146+0x0/0xfffffffffffea773 [drm]

the hard-irq-unsafe lock's dependencies:
-> (&dev->lock.spinlock){-+..} ops: 34303 {
   initial-use  at:
                                       [<ffffffff81069989>] __lock_acquire+0x129/0xd10
                                       [<ffffffff8106a608>] lock_acquire+0x98/0xe0
                                       [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70
                                       [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm]
                                       [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm]
                                       [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm]
                                       [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0
                                       [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0
                                       [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0
                                       [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b
                                       [<ffffffffffffffff>] 0xffffffffffffffff
   in-softirq-W at:
                                       [<ffffffffffffffff>] 0xffffffffffffffff
   hardirq-on-W at:
                                       [<ffffffff81069c6f>] __lock_acquire+0x40f/0xd10
                                       [<ffffffff8106a608>] lock_acquire+0x98/0xe0
                                       [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70
                                       [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm]
                                       [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm]
                                       [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm]
                                       [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0
                                       [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0
                                       [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0
                                       [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b
                                       [<ffffffffffffffff>] 0xffffffffffffffff
 }
 ... key      at: [<ffffffffa038b1a8>] __key.25147+0x0/0xfffffffffffea77b [drm]

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.27-rc1 #57

Call Trace:
 <IRQ>  [<ffffffff8106861f>] check_usage+0x29f/0x2b0
 [<ffffffff8106a35b>] __lock_acquire+0xafb/0xd10
 [<ffffffffa0364baa>] ? drm_lock_take+0x2a/0xf0 [drm]
 [<ffffffff8106a608>] lock_acquire+0x98/0xe0
 [<ffffffffa0364baa>] ? drm_lock_take+0x2a/0xf0 [drm]
 [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70
 [<ffffffffa0364baa>] drm_lock_take+0x2a/0xf0 [drm]
 [<ffffffffa03638cb>] drm_locked_tasklet_func+0x4b/0xb0 [drm]
 [<ffffffff8104616c>] tasklet_hi_action+0xbc/0x160
 [<ffffffff81046c84>] __do_softirq+0x74/0xf0
 [<ffffffff8100dafc>] call_softirq+0x1c/0x30
 [<ffffffff8100f1e5>] do_softirq+0x75/0xb0
 [<ffffffff81046415>] irq_exit+0xa5/0xb0
 [<ffffffff8100f4df>] do_IRQ+0xdf/0x1e0
 [<ffffffff8100ccf6>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff811e8e8f>] ? acpi_idle_enter_bm+0x27f/0x2c7
 [<ffffffff811e8e85>] ? acpi_idle_enter_bm+0x275/0x2c7
 [<ffffffff8127cd20>] ? menu_select+0xd0/0xf0
 [<ffffffff8127ba89>] ? cpuidle_idle_call+0xb9/0x100
 [<ffffffff8100ac56>] ? cpu_idle+0x66/0x100
 [<ffffffff81301420>] ? rest_init+0x70/0x80

------------[ cut here ]------------
WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0xc1/0x110()
Modules linked in: i915 drm bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipt_REJECT xt_tcpudp xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables binfmt_misc dm_mirror dm_log dm_mod uinput snd_hda_intel snd_seq_oss mmc_block snd_seq_midi_event snd_seq arc4 video ecb snd_seq_device snd_pcm_oss crypto_blkcipher thinkpad_acpi led_class snd_mixer_oss snd_pcm cryptomgr rtc_cmos sdhci_pci sdhci mmc_core backlight rtc_core crypto_algapi nvram iwl3945 rfkill mac80211 sr_mod cdrom i2c_i801 iTCO_wdt output snd_timer psmouse serio_raw i2c_core ac evdev cfg80211 iTCO_vendor_support e1000e snd soundcore snd_page_alloc intel_agp battery rtc_lib button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.27-rc1 #57

Call Trace:
 <IRQ>  [<ffffffff810409ff>] warn_on_slowpath+0x5f/0x90
 [<ffffffff81068655>] ? print_lock_contention_bug+0x25/0x120
 [<ffffffff8106a35b>] ? __lock_acquire+0xafb/0xd10
 [<ffffffffa0364c09>] ? drm_lock_take+0x89/0xf0 [drm]
 [<ffffffff81047031>] local_bh_enable_ip+0xc1/0x110
 [<ffffffff81314f19>] _spin_unlock_bh+0x39/0x40
 [<ffffffffa0364c09>] drm_lock_take+0x89/0xf0 [drm]
 [<ffffffffa03638cb>] drm_locked_tasklet_func+0x4b/0xb0 [drm]
 [<ffffffff8104616c>] tasklet_hi_action+0xbc/0x160
 [<ffffffff81046c84>] __do_softirq+0x74/0xf0
 [<ffffffff8100dafc>] call_softirq+0x1c/0x30
 [<ffffffff8100f1e5>] do_softirq+0x75/0xb0
 [<ffffffff81046415>] irq_exit+0xa5/0xb0
 [<ffffffff8100f4df>] do_IRQ+0xdf/0x1e0
 [<ffffffff8100ccf6>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff811e8e8f>] ? acpi_idle_enter_bm+0x27f/0x2c7
 [<ffffffff811e8e85>] ? acpi_idle_enter_bm+0x275/0x2c7
 [<ffffffff8127cd20>] ? menu_select+0xd0/0xf0
 [<ffffffff8127ba89>] ? cpuidle_idle_call+0xb9/0x100
 [<ffffffff8100ac56>] ? cpu_idle+0x66/0x100
 [<ffffffff81301420>] ? rest_init+0x70/0x80

Comment 14 Zdenek Kabelac 2008-08-20 01:21:57 UTC
Any progress with this bug - I've noticed another  bug which is probably the same bug #15541 

Maybe I should go through the list and merge them all together ?

Is there any Intel developer actually interested in solving this bug with EXA driver or is the UXA/GEM driver now the only way to go and this bug will not be fixed in the current XAA/EXA driver ?
Comment 15 haihao 2008-12-09 18:14:07 UTC
Does this issue still exist with current bits(kernel, xserver, mesa and 2D driver) for you?
Comment 16 haihao 2008-12-18 18:58:50 UTC
I ran the following script on 965GM with current code and couldn't reproduce it.

#! /bin/sh

for ii in `seq 1000`
do 
    glxgears &
    sleep 10
    killall -9 glxgears
done

I marked it as fixed. If you can reproduce it with current code, feel free to reopen it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.