[ 81.615971] WARNING: at lib/kref.c:34 kref_get+0x21/0x30() [ 81.615973] Hardware name: [ 81.615976] Pid: 817, comm: lt-cairo-perf-t Not tainted 2.6.36-rc5+ #64 [ 81.615978] Call Trace: [ 81.615983] [<c12edbd6>] ? printk+0x18/0x1a [ 81.615986] [<c1030178>] warn_slowpath_common+0x68/0xa0 [ 81.615989] [<c114f3f1>] ? kref_get+0x21/0x30 [ 81.616016] [<c114f3f1>] ? kref_get+0x21/0x30 [ 81.616019] [<c10301cd>] warn_slowpath_null+0x1d/0x20 [ 81.616031] [<c114f3f1>] kref_get+0x21/0x30 [ 81.616041] [<c11ea27d>] mark_free+0x1d/0x30 [ 81.616050] [<c11ea33f>] i915_gem_evict_something+0xaf/0x430 [ 81.616062] [<c11e507a>] i915_gem_object_bind_to_gtt+0x14a/0x2c0 [ 81.616073] [<c11e5f71>] i915_gem_object_pin+0x151/0x190 [ 81.616083] [<c11e8369>] i915_gem_do_execbuffer+0x499/0x1040 [ 81.616094] [<c108d58f>] ? check_object+0x12f/0x200 [ 81.616103] [<c11e8f55>] ? i915_gem_execbuffer2+0x45/0x1b0 [ 81.616114] [<c1155f98>] ? _copy_from_user+0x38/0x130 [ 81.616124] [<c11e8f87>] i915_gem_execbuffer2+0x77/0x1b0 [ 81.616134] [<c11c8add>] drm_ioctl+0x1cd/0x420 [ 81.616143] [<c11e8f10>] ? i915_gem_execbuffer2+0x0/0x1b0 [ 81.616152] [<c11c8910>] ? drm_ioctl+0x0/0x420 [ 81.616160] [<c10a03bd>] do_vfs_ioctl+0x7d/0x610 [ 81.616170] [<c12661a6>] ? sys_recv+0x36/0x40 [ 81.616180] [<c1267320>] ? sys_socketcall+0x190/0x270 [ 81.616189] [<c10a0989>] sys_ioctl+0x39/0x60 [ 81.616198] [<c1002a10>] sysenter_do_call+0x12/0x26 [ 81.616207] ---[ end trace 9a773246c0cdb582 ]--- [ 81.642630] BUG: unable to handle kernel paging request at 6b6b6d8b [ 81.642686] IP: [<c11e3979>] i915_gem_object_unpin+0x19/0xb0 [ 81.642718] *pde = 00000000 [ 81.642736] Oops: 0000 [#1] SMP [ 81.642758] last sysfs file: /sys/devices/virtual/vc/vcsa2/uevent [ 81.642788] [ 81.642799] Pid: 820, comm: lt-cairo-perf-t Tainted: G W 2.6.36-rc5+ #64 DQ35MP/ [ 81.642841] EIP: 0060:[<c11e3979>] EFLAGS: 00010296 CPU: 0 [ 81.642869] EIP is at i915_gem_object_unpin+0x19/0xb0 [ 81.642894] EAX: f4c875a0 EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: 6b6b6b6b [ 81.642924] ESI: 6b6b6b6b EDI: f712bab4 EBP: f65d8df4 ESP: f65d8de8 [ 81.642954] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 81.642980] Process lt-cairo-perf-t (pid: 820, ti=f65d8000 task=f60518f0 task.ti=f65d8000) [ 81.643019] Stack: [ 81.643031] f4c875a0 6b6b6b6b f712bab4 f65d8e04 c11e5ccf f4c875a0 f712baa0 f65d8e18 [ 81.643081] <0> c11c9be2 f4c875a0 c11c9b80 00000051 f65d8e28 c114f39d f64d4370 f4c875a0 [ 81.643135] <0> f65d8e40 c11c9f27 f64d4394 f64d4370 f712baa0 40086409 f65d8f18 c11c8add [ 81.643190] Call Trace: [ 81.643205] [<c11e5ccf>] ? i915_gem_free_object+0x1f/0x50 [ 81.643233] [<c11c9be2>] ? drm_gem_object_free_unlocked+0x62/0x80 [ 81.643264] [<c11c9b80>] ? drm_gem_object_free_unlocked+0x0/0x80 [ 81.643294] [<c114f39d>] ? kref_put+0x2d/0x60 [ 81.643317] [<c11c9f27>] ? drm_gem_close_ioctl+0x77/0xa0 [ 81.643344] [<c11c8add>] ? drm_ioctl+0x1cd/0x420 [ 81.643368] [<c11c9eb0>] ? drm_gem_close_ioctl+0x0/0xa0 [ 81.643397] [<c104c39e>] ? sched_clock_local+0xbe/0x1a0 [ 81.643424] [<c11c8910>] ? drm_ioctl+0x0/0x420 [ 81.643447] [<c10a03bd>] ? do_vfs_ioctl+0x7d/0x610 [ 81.643471] [<c1001593>] ? __switch_to+0x113/0x1a0 [ 81.643496] [<c1028d13>] ? finish_task_switch+0x33/0x80 [ 81.643522] [<c12edefa>] ? schedule+0x1ba/0x680 [ 81.643547] [<c107fa0f>] ? remove_vma+0x3f/0x60 [ 81.643571] [<c1080fea>] ? do_munmap+0x20a/0x2f0 [ 81.643595] [<c10a0989>] ? sys_ioctl+0x39/0x60 [ 81.643617] [<c1002a10>] ? sysenter_do_call+0x12/0x26 [ 81.643642] Code: 66 90 e9 37 ff ff ff 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08 8b 50 70 8b 48 08 89 d3 <8b> b1 20 02 00 00 c1 eb 0e 81 e2 ff 3f fc ff 83 c3 0f 83 e3 0f [ 81.643867] EIP: [<c11e3979>] i915_gem_object_unpin+0x19/0xb0 SS:ESP 068:f65d8de8 [ 81.643908] CR2: 000000006b6b6d8b running stress: for i in `seq 1 10`; do DISPLAY=:0 CAIRO_TEST_TARGET=xlib ./cairo-perf-trace -i3 /usr/src/cairo-traces/full/firefox-36-20090609.trace >/dev/null 2>/dev/null & done for i in `seq 1 10`; do DISPLAY=:0 CAIRO_TEST_TARGET=gl ./cairo-perf-trace -i3 /usr/src/cairo-traces/full/firefox-36-20090609.trace >/dev/null 2>/dev/null & done
Jumping back to v2.6.34 shows no sign of the kref badness.
I have first seen this behaviour after 2.6.36-rc4 (which was rather stable for me). Sometimes only the graphics seems to be hung, and the machine is reachable over ssh, sometimes it shows a kernel oops on the console, or even hard freezes. Most crashes appeared after running heavy 3D scenes in OpenSceneGraph derived applications. This is the most recent backtrace: ------------[ cut here ]------------ WARNING: at lib/kref.c:34 kref_get+0x23/0x2c() Hardware name: 208252G Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode] Pid: 12736, comm: kwin Not tainted 2.6.36-rc5-00226-g050026f #1 Call Trace: [<ffffffff810390ed>] warn_slowpath_common+0x85/0x9d [<ffffffff8103911f>] warn_slowpath_null+0x1a/0x1c [<ffffffff811e926f>] kref_get+0x23/0x2c [<ffffffff81301e38>] i915_gem_evict_something+0xc4/0x545 [<ffffffff8146235e>] ? _raw_spin_lock+0xe/0x10 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff8102b77d>] ? should_resched+0xe/0x2e [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d [<ffffffff810cafea>] ? fput+0x22/0x1b0 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b ---[ end trace 275b898f5f5f3900 ]--- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/status CPU 0 Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode] Pid: 12736, comm: kwin Tainted: G W 2.6.36-rc5-00226-g050026f #1 208252G/208252G RIP: 0010:[<ffffffff812828fa>] [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0 RSP: 0018:ffff880111ce1b18 EFLAGS: 00010286 RAX: dead000000200200 RBX: ffff880052d76c00 RCX: ffff88013b403300 RDX: ffff8800438ecd40 RSI: 0000000000000037 RDI: ffff8800438ec640 RBP: ffff880111ce1b18 R08: ffff880111e4b7d0 R09: ffff880111e4bb90 R10: ffff88001eec6008 R11: ffff880111ce1938 R12: ffff88008bd71000 R13: 0000000000800000 R14: 0000000000001000 R15: ffff88013b403300 FS: 00007fe6b3609780(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffebb8e0a0 CR3: 000000012d655000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kwin (pid: 12736, threadinfo ffff880111ce0000, task ffff8800bb67b800) Stack: ffff880111ce1bb8 ffffffff813021cb ffff880111ce1b38 ffff880111ce1b68 <0> ffff880111ce1b68 ffff880111ce1b68 ffff880111ce1b78 ffff880111ce1b68 <0> ffff880052d76cb0 ffff88013b4033a0 ffff88004734c6b0 ffff880052fce6b0 Call Trace: [<ffffffff813021cb>] i915_gem_evict_something+0x457/0x545 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff8102b77d>] ? should_resched+0xe/0x2e [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d [<ffffffff810cafea>] ? fput+0x22/0x1b0 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b Code: 55 48 89 e5 0f 1f 44 00 00 48 8b 4f 38 ff 49 54 8a 47 20 a8 02 75 02 0f 0b 83 e0 fc 48 8b 17 88 47 20 48 8b 47 08 48 85 c0 74 42 <40> 8a 70 20 40 f6 c6 04 75 02 0f 0b 4c 8d 40 10 83 e6 fb 4c 8d RIP [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0 RSP <ffff880111ce1b18>
I've pushed some patches which fix the BUG on my workload to drm-intel-staging, in particular the chief culprit was: commit 0e49892ed0bd2c34c1c3461d38a409ae18bfdc64 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 29 23:51:17 2010 +0100 drm: Hold the mutex when dropping the last GEM reference In order to be fully threadsafe we need to check that the drm_gem_object refcount is still 0 after acquiring the mutex in order to call the free function. Otherwise, we may encounter scenarios like: Thread A: Thread B: drm_gem_close unreference_unlocked kref_put mutex_lock ... i915_gem_evict ... kref_get -> BUG ... i915_gem_unbind ... kref_put ... i915_gem_object_free ... mutex_unlock mutex_lock i915_gem_object_free -> BUG i915_gem_object_unbind kfree mutex_unlock To remmedy this, we could break the kref api and do: if (atomic_dec_and_test(&obj->refcount)) { mutex_lock(&dev->mutex); if (atomic_read(&obj->refcount) == 0) dev->funcs->free(obj); mutex_unlock(&dev->mutex); } and similarly remove the BUG_ON(kref_get(obj->refcounts)==0). It is simpler instead to simply rewrite the unlocked unreference variant to take the dev->struct_mutex around the kref_put. Note that no driver is currently using the free_unlocked vfunc and it is scheduled for removal. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30454 Reported-by: Magnus Kessler <Magnus.Kessler@gmx.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org
Applied to airlied/drm-fixes will be upstream and in stable shortly.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.