Summary: | start-and-kill glxgears repeatedly freezes X | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Zdenek Kabelac <zdenek.kabelac> | ||||||
Component: | Driver/intel | Assignee: | haihao <haihao.xiang> | ||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||
Severity: | normal | ||||||||
Priority: | medium | ||||||||
Version: | unspecified | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Zdenek Kabelac
2008-07-10 11:59:52 UTC
what 3D app(s) do you used to reproduce this kind of deadlock? Linden's SL viewer - try to use the patched version from this site http://sldev.free.fr/ - of course you will need to register an avatar for yourself (it's for free) It's 32bit application running in 64bit environment if that could make a difference. BTW it also very slow... but I'd like to see the stable version first :) So far I've not tried to make myself a 64bit version. Viewer is open source. Please attach your full Xorg.0.log. So SL viewer is the only app triggering this as you've found? Check attachements from the bug #16648 http://bugs.freedesktop.org/attachment.cgi?id=17601 I do not have any other 3D app on my disk except glxgears ;) so I can't tell - but as I've went through the list - I've noticed there are at least 5 other people who are now experiencing most probably same bug with their intel driver (I've tried to add some comments to those bug (just check the list for my recent post) to see which bugs I suspect) I guess I've some good debug skills - so if you will tell me what debug traces I should give you - you have fairly high chance you will get them :) I should add that with the exactly same 64bit kernel 2.6.26-rc9 I could run 32bit Debian-unstable on my machine (last updated on January with Xorg 7.3) and it seems to be running quite fast and without any deadlock in the SL So the problem is not in the kernel, and not in the HW. I could probably slowly iterate with changing libraries to see when it gets broken - but I assume that checking the source code for deadlock conditions is a better way. What is the progress with this bug? I think you could safely rename the bug to 'glxgears freezes X' As I'm experiencing the exact same issue with glxgears - thought it takes few runs of glxgears and Xserver restarts I guess. Could you detail how to reproduce it with glxgears ? Here goes brief howto - from next machine I start Xorg :0 -ac from next terminal on that machine I'm running: export DISPLAY=:0 xterm & gdb /usr/bin/glxgears run and breaking/exiting and restarting glxgears again until it fails. sometimes I had to restart Xorg or eventually reboot the whole machine if it's stable for too long ;) But after some time it happens and glxgears doesn't show up a running windows. - then I could break it in gdb in the ioctl command. I've now tested usage of drm debug=1 flag. But the problem was - the logging in messages was way toooo slow to actually catch anything and the whole output was complete mess anyway. Trying to store dmesg periodically was no use as well - simply way to much message were flowing in. So - I've made following change in drm/drm_drv.c driver - --- drm_drv.c 2008-07-18 17:29:40.000000000 +0200 +++ drm_drv.c.test 2008-07-18 17:48:38.000000000 +0200 @@ -515,6 +515,7 @@ } err_i1: + udelay(10000); if (kdata) kfree(kdata); atomic_dec(&dev->ioctl_count); This makes 3D very slow - i.e. gears goes from 630fps -> 4fps but actually the deadlock happend even in this 'slow' case - and I'll attach this log to this bugzilla - not so long - <1MB (full boot) you might notice that there are couple restarts of glxgears visible from reopening drm device. I'm not 100% sure it's exactly the same problem which lead to the initialy bugzilla openning - but behaves similary - i.e. machine has to be restarted to get the 3d operations again. I hope this helps somehow :) as it took me quite some time.... I'll attach also Xorg.log (hopefully from the same run) Created attachment 17748 [details]
Log from boot to GPU crash with udelayed driver
Created attachment 17749 [details]
Xorg log
(In reply to comment #8) > Created an attachment (id=17748) [details] > Log from boot to GPU crash with udelayed driver Do you mean GPU crash? It seems X exits normally according your Xorg log attached at comment #9. Yes - as I said GPU gets stuck. The freeze is slightly different from the one I'm experiencig with SL actually - as with these freeze 2D Xorg is still fully operational - so I could open terminal, use firefox, I just cannot open any 3D application. With SL crash whole Xorg is frozen and only mouse could move - but that might be the result of badly kept focus of frozen SL application and when I kill SL it usually leads to whole Xorg crashe because of some badly deallocation drm context (as could be seen in my other bugzilla). Anyway I hope I soon will be able to try this patch: http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=commitdiff;h=97988ccc46c0248177cd71937021ca8cc2a7452b It looks like this could be the source of my troubles. I'll add more info, once I'll be able to try Mesa with this patch. Does glxgears work well for you with the latest mesa ? Unfortunately mesa from the git doesn't work for me. I've compiled drm & mesa and inserted drm & i915 modules. Mesa for a regular user gives me just ~280FPS (most probably because the kernel is kept busy with logging of debug traces) For the root user it prints ~900FPS but with: Failed to initialize TTM buffer manager. Falling back to classic. My kernel prints some debug trace about wrong unlocking: SPIN IRQ ALREADY ENABLED Pid: 0, comm: swapper Tainted: G W 2.6.27-rc1 #56 Call Trace: <IRQ> [<ffffffff81315dfd>] _spin_unlock_irqrestore+0x7d/0x90 [<ffffffffa026449d>] drm_locked_tasklet_func+0x8d/0xb0 [drm] [<ffffffff810450dc>] tasklet_hi_action+0xbc/0x160 [<ffffffff81045c24>] __do_softirq+0x74/0xf0 [<ffffffff8100d9cc>] call_softirq+0x1c/0x30 [<ffffffff8100efb5>] do_softirq+0x75/0xb0 [<ffffffff810456d5>] irq_exit+0xa5/0xb0 [<ffffffff8100f27c>] do_IRQ+0xac/0x130 [<ffffffff8106646d>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8100cbc6>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8106646d>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff811e8c1c>] ? acpi_idle_enter_bm+0x2ce/0x35d [<ffffffff810677dd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811e8c26>] ? acpi_idle_enter_bm+0x2d8/0x35d [<ffffffff811e8c1c>] ? acpi_idle_enter_bm+0x2ce/0x35d [<ffffffff8127c7b0>] ? menu_select+0xd0/0xf0 [<ffffffff8127b519>] ? cpuidle_idle_call+0xb9/0x100 [<ffffffff8100b4a6>] ? cpu_idle+0x66/0x100 [<ffffffff8130e30e>] ? start_secondary+0x134/0x196 - so sorry I'm unable to check yet the functionality of the patch. I've not yet convinced rpm maintainers to release updated mesa libs with this patch... BTW [drm] Initialized drm 1.1.0 20060810 [drm] Initialized i915 1.13.0 20080312 on minor 0 (is the version 1.1.0 from 20060810 correct?) And here is another longer trace: (split into appropriate bugs) Let me know - once these issues are resolved, so far I'm sorry, but I'm considering git version of mesa unusable on my T61.... My kernel is todays 2.6.27-rc1 git compilation with my home-made patch for checking spin-lock restoration - that means that in your drm_locked_tasklet_func you are actually restoring already enabled IRQ - which is definitely wrong..... [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0 ====================================================== [ INFO: hard-safe -> hard-unsafe lock order detected ] 2.6.27-rc1 #57 ------------------------------------------------------ swapper/0 [HC0[0]:SC1[2]:HE0:SE0] is trying to acquire: (&dev->lock.spinlock){-+..}, at: [<ffffffffa0364baa>] drm_lock_take+0x2a/0xf0 [drm] and this task is already holding: (&dev->tasklet_lock){++..}, at: [<ffffffffa03638ad>] drm_locked_tasklet_func+0x2d/0xb0 [drm] which would create a new lock dependency: (&dev->tasklet_lock){++..} -> (&dev->lock.spinlock){-+..} but this new dependency connects a hard-irq-safe lock: (&dev->tasklet_lock){++..} ... which became hard-irq-safe at: [<ffffffffffffffff>] 0xffffffffffffffff to a hard-irq-unsafe lock: (&dev->lock.spinlock){-+..} ... which became hard-irq-unsafe at: ... [<ffffffff81069c6f>] __lock_acquire+0x40f/0xd10 [<ffffffff8106a608>] lock_acquire+0x98/0xe0 [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70 [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm] [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm] [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm] [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0 [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0 [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0 [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 1 lock held by swapper/0: #0: (&dev->tasklet_lock){++..}, at: [<ffffffffa03638ad>] drm_locked_tasklet_func+0x2d/0xb0 [drm] the hard-irq-safe lock's dependencies: -> (&dev->tasklet_lock){++..} ops: 2423 { initial-use at: [<ffffffff81069989>] __lock_acquire+0x129/0xd10 [<ffffffff8106a608>] lock_acquire+0x98/0xe0 [<ffffffff813151c7>] _spin_lock_irqsave+0x47/0x80 [<ffffffffa0364d2a>] drm_unlock+0x3a/0xe0 [drm] [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm] [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm] [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0 [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0 [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0 [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff in-hardirq-W at: [<ffffffffffffffff>] 0xffffffffffffffff in-softirq-W at: [<ffffffffffffffff>] 0xffffffffffffffff } ... key at: [<ffffffffa038b1b0>] __key.25146+0x0/0xfffffffffffea773 [drm] the hard-irq-unsafe lock's dependencies: -> (&dev->lock.spinlock){-+..} ops: 34303 { initial-use at: [<ffffffff81069989>] __lock_acquire+0x129/0xd10 [<ffffffff8106a608>] lock_acquire+0x98/0xe0 [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70 [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm] [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm] [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm] [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0 [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0 [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0 [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff in-softirq-W at: [<ffffffffffffffff>] 0xffffffffffffffff hardirq-on-W at: [<ffffffff81069c6f>] __lock_acquire+0x40f/0xd10 [<ffffffff8106a608>] lock_acquire+0x98/0xe0 [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70 [<ffffffffa0364e76>] drm_lock+0xa6/0x380 [drm] [<ffffffffa0361894>] drm_unlocked_ioctl+0x214/0x380 [drm] [<ffffffffa0361a11>] drm_ioctl+0x11/0x20 [drm] [<ffffffff810dd51d>] vfs_ioctl+0x7d/0xa0 [<ffffffff810dd7cb>] do_vfs_ioctl+0x28b/0x2f0 [<ffffffff810dd8c9>] sys_ioctl+0x99/0xa0 [<ffffffff8100c6eb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff } ... key at: [<ffffffffa038b1a8>] __key.25147+0x0/0xfffffffffffea77b [drm] stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.27-rc1 #57 Call Trace: <IRQ> [<ffffffff8106861f>] check_usage+0x29f/0x2b0 [<ffffffff8106a35b>] __lock_acquire+0xafb/0xd10 [<ffffffffa0364baa>] ? drm_lock_take+0x2a/0xf0 [drm] [<ffffffff8106a608>] lock_acquire+0x98/0xe0 [<ffffffffa0364baa>] ? drm_lock_take+0x2a/0xf0 [drm] [<ffffffff813150c6>] _spin_lock_bh+0x36/0x70 [<ffffffffa0364baa>] drm_lock_take+0x2a/0xf0 [drm] [<ffffffffa03638cb>] drm_locked_tasklet_func+0x4b/0xb0 [drm] [<ffffffff8104616c>] tasklet_hi_action+0xbc/0x160 [<ffffffff81046c84>] __do_softirq+0x74/0xf0 [<ffffffff8100dafc>] call_softirq+0x1c/0x30 [<ffffffff8100f1e5>] do_softirq+0x75/0xb0 [<ffffffff81046415>] irq_exit+0xa5/0xb0 [<ffffffff8100f4df>] do_IRQ+0xdf/0x1e0 [<ffffffff8100ccf6>] ret_from_intr+0x0/0xf <EOI> [<ffffffff811e8e8f>] ? acpi_idle_enter_bm+0x27f/0x2c7 [<ffffffff811e8e85>] ? acpi_idle_enter_bm+0x275/0x2c7 [<ffffffff8127cd20>] ? menu_select+0xd0/0xf0 [<ffffffff8127ba89>] ? cpuidle_idle_call+0xb9/0x100 [<ffffffff8100ac56>] ? cpu_idle+0x66/0x100 [<ffffffff81301420>] ? rest_init+0x70/0x80 ------------[ cut here ]------------ WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0xc1/0x110() Modules linked in: i915 drm bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipt_REJECT xt_tcpudp xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables binfmt_misc dm_mirror dm_log dm_mod uinput snd_hda_intel snd_seq_oss mmc_block snd_seq_midi_event snd_seq arc4 video ecb snd_seq_device snd_pcm_oss crypto_blkcipher thinkpad_acpi led_class snd_mixer_oss snd_pcm cryptomgr rtc_cmos sdhci_pci sdhci mmc_core backlight rtc_core crypto_algapi nvram iwl3945 rfkill mac80211 sr_mod cdrom i2c_i801 iTCO_wdt output snd_timer psmouse serio_raw i2c_core ac evdev cfg80211 iTCO_vendor_support e1000e snd soundcore snd_page_alloc intel_agp battery rtc_lib button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.27-rc1 #57 Call Trace: <IRQ> [<ffffffff810409ff>] warn_on_slowpath+0x5f/0x90 [<ffffffff81068655>] ? print_lock_contention_bug+0x25/0x120 [<ffffffff8106a35b>] ? __lock_acquire+0xafb/0xd10 [<ffffffffa0364c09>] ? drm_lock_take+0x89/0xf0 [drm] [<ffffffff81047031>] local_bh_enable_ip+0xc1/0x110 [<ffffffff81314f19>] _spin_unlock_bh+0x39/0x40 [<ffffffffa0364c09>] drm_lock_take+0x89/0xf0 [drm] [<ffffffffa03638cb>] drm_locked_tasklet_func+0x4b/0xb0 [drm] [<ffffffff8104616c>] tasklet_hi_action+0xbc/0x160 [<ffffffff81046c84>] __do_softirq+0x74/0xf0 [<ffffffff8100dafc>] call_softirq+0x1c/0x30 [<ffffffff8100f1e5>] do_softirq+0x75/0xb0 [<ffffffff81046415>] irq_exit+0xa5/0xb0 [<ffffffff8100f4df>] do_IRQ+0xdf/0x1e0 [<ffffffff8100ccf6>] ret_from_intr+0x0/0xf <EOI> [<ffffffff811e8e8f>] ? acpi_idle_enter_bm+0x27f/0x2c7 [<ffffffff811e8e85>] ? acpi_idle_enter_bm+0x275/0x2c7 [<ffffffff8127cd20>] ? menu_select+0xd0/0xf0 [<ffffffff8127ba89>] ? cpuidle_idle_call+0xb9/0x100 [<ffffffff8100ac56>] ? cpu_idle+0x66/0x100 [<ffffffff81301420>] ? rest_init+0x70/0x80 Any progress with this bug - I've noticed another bug which is probably the same bug #15541 Maybe I should go through the list and merge them all together ? Is there any Intel developer actually interested in solving this bug with EXA driver or is the UXA/GEM driver now the only way to go and this bug will not be fixed in the current XAA/EXA driver ? Does this issue still exist with current bits(kernel, xserver, mesa and 2D driver) for you? I ran the following script on 965GM with current code and couldn't reproduce it. #! /bin/sh for ii in `seq 1000` do glxgears & sleep 10 killall -9 glxgears done I marked it as fixed. If you can reproduce it with current code, feel free to reopen it. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.