Bug 30454 - [regression] Badness during evict
Summary: [regression] Badness during evict
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: high critical
Assignee: Chris Wilson
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-29 02:54 UTC by Chris Wilson
Modified: 2017-07-24 23:06 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Chris Wilson 2010-09-29 02:54:31 UTC
[   81.615971] WARNING: at lib/kref.c:34 kref_get+0x21/0x30()
[   81.615973] Hardware name:         
[   81.615976] Pid: 817, comm: lt-cairo-perf-t Not tainted 2.6.36-rc5+ #64
[   81.615978] Call Trace:
[   81.615983]  [<c12edbd6>] ? printk+0x18/0x1a
[   81.615986]  [<c1030178>] warn_slowpath_common+0x68/0xa0
[   81.615989]  [<c114f3f1>] ? kref_get+0x21/0x30
[   81.616016]  [<c114f3f1>] ? kref_get+0x21/0x30
[   81.616019]  [<c10301cd>] warn_slowpath_null+0x1d/0x20
[   81.616031]  [<c114f3f1>] kref_get+0x21/0x30
[   81.616041]  [<c11ea27d>] mark_free+0x1d/0x30
[   81.616050]  [<c11ea33f>] i915_gem_evict_something+0xaf/0x430
[   81.616062]  [<c11e507a>] i915_gem_object_bind_to_gtt+0x14a/0x2c0
[   81.616073]  [<c11e5f71>] i915_gem_object_pin+0x151/0x190
[   81.616083]  [<c11e8369>] i915_gem_do_execbuffer+0x499/0x1040
[   81.616094]  [<c108d58f>] ? check_object+0x12f/0x200
[   81.616103]  [<c11e8f55>] ? i915_gem_execbuffer2+0x45/0x1b0
[   81.616114]  [<c1155f98>] ? _copy_from_user+0x38/0x130
[   81.616124]  [<c11e8f87>] i915_gem_execbuffer2+0x77/0x1b0
[   81.616134]  [<c11c8add>] drm_ioctl+0x1cd/0x420
[   81.616143]  [<c11e8f10>] ? i915_gem_execbuffer2+0x0/0x1b0
[   81.616152]  [<c11c8910>] ? drm_ioctl+0x0/0x420
[   81.616160]  [<c10a03bd>] do_vfs_ioctl+0x7d/0x610
[   81.616170]  [<c12661a6>] ? sys_recv+0x36/0x40
[   81.616180]  [<c1267320>] ? sys_socketcall+0x190/0x270
[   81.616189]  [<c10a0989>] sys_ioctl+0x39/0x60
[   81.616198]  [<c1002a10>] sysenter_do_call+0x12/0x26
[   81.616207] ---[ end trace 9a773246c0cdb582 ]---
[   81.642630] BUG: unable to handle kernel paging request at 6b6b6d8b
[   81.642686] IP: [<c11e3979>] i915_gem_object_unpin+0x19/0xb0
[   81.642718] *pde = 00000000 
[   81.642736] Oops: 0000 [#1] SMP 
[   81.642758] last sysfs file: /sys/devices/virtual/vc/vcsa2/uevent
[   81.642788] 
[   81.642799] Pid: 820, comm: lt-cairo-perf-t Tainted: G        W   2.6.36-rc5+ #64 DQ35MP/        
[   81.642841] EIP: 0060:[<c11e3979>] EFLAGS: 00010296 CPU: 0
[   81.642869] EIP is at i915_gem_object_unpin+0x19/0xb0
[   81.642894] EAX: f4c875a0 EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: 6b6b6b6b
[   81.642924] ESI: 6b6b6b6b EDI: f712bab4 EBP: f65d8df4 ESP: f65d8de8
[   81.642954]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   81.642980] Process lt-cairo-perf-t (pid: 820, ti=f65d8000 task=f60518f0 task.ti=f65d8000)
[   81.643019] Stack:
[   81.643031]  f4c875a0 6b6b6b6b f712bab4 f65d8e04 c11e5ccf f4c875a0 f712baa0 f65d8e18
[   81.643081] <0> c11c9be2 f4c875a0 c11c9b80 00000051 f65d8e28 c114f39d f64d4370 f4c875a0
[   81.643135] <0> f65d8e40 c11c9f27 f64d4394 f64d4370 f712baa0 40086409 f65d8f18 c11c8add
[   81.643190] Call Trace:
[   81.643205]  [<c11e5ccf>] ? i915_gem_free_object+0x1f/0x50
[   81.643233]  [<c11c9be2>] ? drm_gem_object_free_unlocked+0x62/0x80
[   81.643264]  [<c11c9b80>] ? drm_gem_object_free_unlocked+0x0/0x80
[   81.643294]  [<c114f39d>] ? kref_put+0x2d/0x60
[   81.643317]  [<c11c9f27>] ? drm_gem_close_ioctl+0x77/0xa0
[   81.643344]  [<c11c8add>] ? drm_ioctl+0x1cd/0x420
[   81.643368]  [<c11c9eb0>] ? drm_gem_close_ioctl+0x0/0xa0
[   81.643397]  [<c104c39e>] ? sched_clock_local+0xbe/0x1a0
[   81.643424]  [<c11c8910>] ? drm_ioctl+0x0/0x420
[   81.643447]  [<c10a03bd>] ? do_vfs_ioctl+0x7d/0x610
[   81.643471]  [<c1001593>] ? __switch_to+0x113/0x1a0
[   81.643496]  [<c1028d13>] ? finish_task_switch+0x33/0x80
[   81.643522]  [<c12edefa>] ? schedule+0x1ba/0x680
[   81.643547]  [<c107fa0f>] ? remove_vma+0x3f/0x60
[   81.643571]  [<c1080fea>] ? do_munmap+0x20a/0x2f0
[   81.643595]  [<c10a0989>] ? sys_ioctl+0x39/0x60
[   81.643617]  [<c1002a10>] ? sysenter_do_call+0x12/0x26
[   81.643642] Code: 66 90 e9 37 ff ff ff 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08 8b 50 70 8b 48 08 89 d3 <8b> b1 20 02 00 00 c1 eb 0e 81 e2 ff 3f fc ff 83 c3 0f 83 e3 0f 
[   81.643867] EIP: [<c11e3979>] i915_gem_object_unpin+0x19/0xb0 SS:ESP 068:f65d8de8
[   81.643908] CR2: 000000006b6b6d8b

running stress:
for i in `seq 1 10`; do DISPLAY=:0 CAIRO_TEST_TARGET=xlib ./cairo-perf-trace -i3 /usr/src/cairo-traces/full/firefox-36-20090609.trace >/dev/null 2>/dev/null & done
for i in `seq 1 10`; do DISPLAY=:0 CAIRO_TEST_TARGET=gl ./cairo-perf-trace -i3 /usr/src/cairo-traces/full/firefox-36-20090609.trace >/dev/null 2>/dev/null & done
Comment 1 Chris Wilson 2010-09-29 04:19:10 UTC
Jumping back to v2.6.34 shows no sign of the kref badness.
Comment 2 Magnus Kessler 2010-09-29 05:32:12 UTC
I have first seen this behaviour after 2.6.36-rc4 (which was rather stable for me). Sometimes only the graphics seems to be hung, and the machine is reachable over ssh, sometimes it shows a kernel oops on the console, or even hard freezes.

Most crashes appeared after running heavy 3D scenes in OpenSceneGraph derived applications.

This is the most recent backtrace:

------------[ cut here ]------------
WARNING: at lib/kref.c:34 kref_get+0x23/0x2c()
Hardware name: 208252G
Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode]
Pid: 12736, comm: kwin Not tainted 2.6.36-rc5-00226-g050026f #1
Call Trace:
 [<ffffffff810390ed>] warn_slowpath_common+0x85/0x9d
 [<ffffffff8103911f>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff811e926f>] kref_get+0x23/0x2c
 [<ffffffff81301e38>] i915_gem_evict_something+0xc4/0x545
 [<ffffffff8146235e>] ? _raw_spin_lock+0xe/0x10
 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289
 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148
 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c
 [<ffffffff8102b765>] ? need_resched+0x23/0x2d
 [<ffffffff8102b77d>] ? should_resched+0xe/0x2e
 [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22
 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c
 [<ffffffff8102b765>] ? need_resched+0x23/0x2d
 [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385
 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c
 [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80
 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c
 [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1
 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d
 [<ffffffff810cafea>] ? fput+0x22/0x1b0
 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d
 [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b
---[ end trace 275b898f5f5f3900 ]---
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/status
CPU 0 
Modules linked in: act_police sch_hfsc cls_fw cls_u32 sch_htb sch_ingress xt_realm sch_sfq iptable_raw ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ip6t_LOG ipt_ah nf_nat_tftp xt_connlimit nf_nat_snmp_basic ip6table_raw nf_nat_sip xt_comment nf_nat_pptp nf_nat_proto_gre ip6t_REJECT nf_nat_irc ip6table_mangle nf_nat_h323 nf_nat_ftp nf_conntrack_ipv6 nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc ts_kmp nf_conntrack_amanda nf_conntrack_h323 xt_time nf_conntrack_ftp xt_TCPMSS xt_tcpmss xt_policy xt_NFLOG nfnetlink_log xt_pkttype xt_mac xt_limit xt_length ipt_LOG xt_helper iptable_nat xt_dccp nf_nat xt_CLASSIFY nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp iptable_mangle xt_state nfnetlink iptable_filter ipt_addrtype xt_DSCP xt_dscp ip6table_filter xt_string xt_NFQUEUE xt_multiport xt_iprange xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables ip6_tables x_tables tun ipv6 snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device acpi_cpufreq mperf arc4 ecb iwlagn iwlcore mac80211 sdhci_pci sdhci mmc_core cfg80211 pcmcia joydev cdc_acm cdc_wdm i2c_i801 btusb yenta_socket pcmcia_rsrc cdc_ether e1000e bluetooth usbnet tpm_tis pcmcia_core mii pcspkr scsi_transport_iscsi fuse nfs auth_rpcgss nfs_acl lockd sunrpc dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid usb_storage scsi_transport_fc scsi_tgt sg [last unloaded: microcode]

Pid: 12736, comm: kwin Tainted: G        W   2.6.36-rc5-00226-g050026f #1 208252G/208252G
RIP: 0010:[<ffffffff812828fa>]  [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0
RSP: 0018:ffff880111ce1b18  EFLAGS: 00010286
RAX: dead000000200200 RBX: ffff880052d76c00 RCX: ffff88013b403300
RDX: ffff8800438ecd40 RSI: 0000000000000037 RDI: ffff8800438ec640
RBP: ffff880111ce1b18 R08: ffff880111e4b7d0 R09: ffff880111e4bb90
R10: ffff88001eec6008 R11: ffff880111ce1938 R12: ffff88008bd71000
R13: 0000000000800000 R14: 0000000000001000 R15: ffff88013b403300
FS:  00007fe6b3609780(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffebb8e0a0 CR3: 000000012d655000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kwin (pid: 12736, threadinfo ffff880111ce0000, task ffff8800bb67b800)
Stack:
 ffff880111ce1bb8 ffffffff813021cb ffff880111ce1b38 ffff880111ce1b68
<0> ffff880111ce1b68 ffff880111ce1b68 ffff880111ce1b78 ffff880111ce1b68
<0> ffff880052d76cb0 ffff88013b4033a0 ffff88004734c6b0 ffff880052fce6b0
Call Trace:
 [<ffffffff813021cb>] i915_gem_evict_something+0x457/0x545
 [<ffffffff812fd289>] i915_gem_object_bind_to_gtt+0x1d3/0x289
 [<ffffffff812ff264>] i915_gem_object_pin+0xa3/0x148
 [<ffffffff812ff914>] i915_gem_do_execbuffer+0x500/0xe2c
 [<ffffffff8102b765>] ? need_resched+0x23/0x2d
 [<ffffffff8102b77d>] ? should_resched+0xe/0x2e
 [<ffffffff81460e0a>] ? _cond_resched+0xe/0x22
 [<ffffffff8130030e>] i915_gem_execbuffer2+0xce/0x12c
 [<ffffffff8102b765>] ? need_resched+0x23/0x2d
 [<ffffffff81279e0b>] drm_ioctl+0x2a8/0x385
 [<ffffffff81300240>] ? i915_gem_execbuffer2+0x0/0x12c
 [<ffffffff810bee78>] ? free_pages_and_swap_cache+0x26/0x80
 [<ffffffff811e6a85>] ? cpumask_any_but+0x29/0x3c
 [<ffffffff810d70e2>] do_vfs_ioctl+0x4b0/0x4f1
 [<ffffffff810b6575>] ? remove_vma+0x65/0x6d
 [<ffffffff810cafea>] ? fput+0x22/0x1b0
 [<ffffffff810d716a>] sys_ioctl+0x47/0x6d
 [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b
Code: 55 48 89 e5 0f 1f 44 00 00 48 8b 4f 38 ff 49 54 8a 47 20 a8 02 75 02 0f 0b 83 e0 fc 48 8b 17 88 47 20 48 8b 47 08 48 85 c0 74 42 <40> 8a 70 20 40 f6 c6 04 75 02 0f 0b 4c 8d 40 10 83 e6 fb 4c 8d 
RIP  [<ffffffff812828fa>] drm_mm_scan_remove_block+0x2b/0xd0
 RSP <ffff880111ce1b18>
Comment 3 Chris Wilson 2010-09-29 16:04:27 UTC
I've pushed some patches which fix the BUG on my workload to drm-intel-staging, in particular the chief culprit was:

commit 0e49892ed0bd2c34c1c3461d38a409ae18bfdc64
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 29 23:51:17 2010 +0100

    drm: Hold the mutex when dropping the last GEM reference
    
    In order to be fully threadsafe we need to check that the drm_gem_object
    refcount is still 0 after acquiring the mutex in order to call the free
    function. Otherwise, we may encounter scenarios like:
    
    Thread A:                                        Thread B:
    drm_gem_close
    unreference_unlocked
    kref_put                                         mutex_lock
    ...                                              i915_gem_evict
    ...                                              kref_get -> BUG
    ...                                              i915_gem_unbind
    ...                                              kref_put
    ...                                              i915_gem_object_free
    ...                                              mutex_unlock
    mutex_lock
    i915_gem_object_free -> BUG
    i915_gem_object_unbind
    kfree
    mutex_unlock
    
    To remmedy this, we could break the kref api and do:
      if (atomic_dec_and_test(&obj->refcount)) {
    	  mutex_lock(&dev->mutex);
    	  if (atomic_read(&obj->refcount) == 0)
    		  dev->funcs->free(obj);
    	  mutex_unlock(&dev->mutex);
      }
    and similarly remove the BUG_ON(kref_get(obj->refcounts)==0).
    
    It is simpler instead to simply rewrite the unlocked unreference variant
    to take the dev->struct_mutex around the kref_put.
    
    Note that no driver is currently using the free_unlocked vfunc and it is
    scheduled for removal.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30454
    Reported-by: Magnus Kessler <Magnus.Kessler@gmx.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
Comment 4 Chris Wilson 2010-10-01 01:40:59 UTC
Applied to airlied/drm-fixes will be upstream and in stable shortly.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.