Summary: | [regression] Badness during evict | ||
---|---|---|---|
Product: | DRI | Reporter: | Chris Wilson <chris> |
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> |
Status: | CLOSED FIXED | QA Contact: | |
Severity: | critical | ||
Priority: | high | CC: | jbarnes |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Description
Chris Wilson
2010-09-29 02:54:31 UTC
Jumping back to v2.6.34 shows no sign of the kref badness. I have first seen this behaviour after 2.6.36-rc4 (which was rather stable for me). Sometimes only the graphics seems to be hung, and the machine is reachable over ssh, sometimes it shows a kernel oops on the console, or even hard freezes. Most crashes appeared after running heavy 3D scenes in OpenSceneGraph derived applications. This is the most recent backtrace: ------------[ cut here ]------------ WARNING: at lib/kref.c:34 kref_get+0x23/0x2c() Hardware name: 208252G Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode] Pid: 12736, comm: kwin Not tainted 2.6.36-rc5-00226-g050026f #1 Call Trace: [<ffffffff810390ed>] warn_slowpath_common+0x85/0x9d [<ffffffff8103911f>] warn_slowpath_null+0x1a/0x1c [<ffffffff811e926f>] kref_get+0x23/0x2c [<ffffffff81301e38>] i915_gem_evict_something+0xc4/0x545 [<ffffffff8146235e>] ? _raw_spin_lock+0xe/0x10 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff8102b77d>] ? should_resched+0xe/0x2e [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d [<ffffffff810cafea>] ? fput+0x22/0x1b0 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b ---[ end trace 275b898f5f5f3900 ]--- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/status CPU 0 Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode] Pid: 12736, comm: kwin Tainted: G W 2.6.36-rc5-00226-g050026f #1 208252G/208252G RIP: 0010:[<ffffffff812828fa>] [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0 RSP: 0018:ffff880111ce1b18 EFLAGS: 00010286 RAX: dead000000200200 RBX: ffff880052d76c00 RCX: ffff88013b403300 RDX: ffff8800438ecd40 RSI: 0000000000000037 RDI: ffff8800438ec640 RBP: ffff880111ce1b18 R08: ffff880111e4b7d0 R09: ffff880111e4bb90 R10: ffff88001eec6008 R11: ffff880111ce1938 R12: ffff88008bd71000 R13: 0000000000800000 R14: 0000000000001000 R15: ffff88013b403300 FS: 00007fe6b3609780(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffebb8e0a0 CR3: 000000012d655000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kwin (pid: 12736, threadinfo ffff880111ce0000, task ffff8800bb67b800) Stack: ffff880111ce1bb8 ffffffff813021cb ffff880111ce1b38 ffff880111ce1b68 <0> ffff880111ce1b68 ffff880111ce1b68 ffff880111ce1b78 ffff880111ce1b68 <0> ffff880052d76cb0 ffff88013b4033a0 ffff88004734c6b0 ffff880052fce6b0 Call Trace: [<ffffffff813021cb>] i915_gem_evict_something+0x457/0x545 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff8102b77d>] ? should_resched+0xe/0x2e [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c [<ffffffff8102b765>] ? need_resched+0x23/0x2d [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d [<ffffffff810cafea>] ? fput+0x22/0x1b0 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b Code: 55 48 89 e5 0f 1f 44 00 00 48 8b 4f 38 ff 49 54 8a 47 20 a8 02 75 02 0f 0b 83 e0 fc 48 8b 17 88 47 20 48 8b 47 08 48 85 c0 74 42 <40> 8a 70 20 40 f6 c6 04 75 02 0f 0b 4c 8d 40 10 83 e6 fb 4c 8d RIP [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0 RSP <ffff880111ce1b18> I've pushed some patches which fix the BUG on my workload to drm-intel-staging, in particular the chief culprit was: commit 0e49892ed0bd2c34c1c3461d38a409ae18bfdc64 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 29 23:51:17 2010 +0100 drm: Hold the mutex when dropping the last GEM reference In order to be fully threadsafe we need to check that the drm_gem_object refcount is still 0 after acquiring the mutex in order to call the free function. Otherwise, we may encounter scenarios like: Thread A: Thread B: drm_gem_close unreference_unlocked kref_put mutex_lock ... i915_gem_evict ... kref_get -> BUG ... i915_gem_unbind ... kref_put ... i915_gem_object_free ... mutex_unlock mutex_lock i915_gem_object_free -> BUG i915_gem_object_unbind kfree mutex_unlock To remmedy this, we could break the kref api and do: if (atomic_dec_and_test(&obj->refcount)) { mutex_lock(&dev->mutex); if (atomic_read(&obj->refcount) == 0) dev->funcs->free(obj); mutex_unlock(&dev->mutex); } and similarly remove the BUG_ON(kref_get(obj->refcounts)==0). It is simpler instead to simply rewrite the unlocked unreference variant to take the dev->struct_mutex around the kref_put. Note that no driver is currently using the free_unlocked vfunc and it is scheduled for removal. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30454 Reported-by: Magnus Kessler <Magnus.Kessler@gmx.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org Applied to airlied/drm-fixes will be upstream and in stable shortly. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.