[ 105.087133] WARNING: CPU: 0 PID: 155 at drivers/gpu/drm/i915/i915_gem_userptr.c:89 cancel_userptr+0x1de/0x210 [i915] [ 105.087134] WARN_ON(i915_gem_object_put_pages(obj)) [ 105.087135] Modules linked in: [ 105.087136] rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep arc4 iwlmvm mac80211 iTCO_wdt iTCO_vendor_support iwlwifi intel_rapl x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic uvcvideo cfg80211 snd_hda_intel snd_hda_codec videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 btusb videobuf2_core btrtl [ 105.087171] snd_hwdep btbcm i2c_i801 videodev snd_hda_core btintel intel_pch_thermal i2c_smbus bluetooth snd_seq rtsx_pci_ms memstick joydev snd_seq_device lpc_ich media shpchp snd_pcm mei_me mei thinkpad_acpi snd_timer snd soundcore rfkill wmi tpm_tis nfsd tpm_tis_core tpm intel_rst auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt hid_microsoft i915 i2c_algo_bit drm_kms_helper drm rtsx_pci_sdmmc mmc_core e1000e rtsx_pci crct10dif_pclmul crc32_pclmul crc32c_intel ptp serio_raw pps_core fjes video [ 105.087203] CPU: 0 PID: 155 Comm: kworker/u16:5 Tainted: G W 4.8.0-rc1-drm-intel+ #61 [ 105.087204] Hardware name: LENOVO 20BW000FUK/20BW000FUK, BIOS JBET54WW (1.19 ) 11/06/2015 [ 105.087228] Workqueue: i915-userptr-release cancel_userptr [i915] [ 105.087231] 0000000000000286 000000000d997dcc ffff88022bb83d20 ffffffff813dc74d [ 105.087234] ffff88022bb83d70 0000000000000000 ffff88022bb83d60 ffffffff810a750b [ 105.087236] 000000592b951d40 ffff880227640068 ffff880191d9e300 ffffffff81551700 [ 105.087239] Call Trace: [ 105.087244] [<ffffffff813dc74d>] dump_stack+0x63/0x86 [ 105.087248] [<ffffffff810a750b>] __warn+0xcb/0xf0 [ 105.087251] [<ffffffff81551700>] ? fence_context_alloc+0x20/0x20 [ 105.087253] [<ffffffff810a758f>] warn_slowpath_fmt+0x5f/0x80 [ 105.087273] [<ffffffffa01d1fde>] cancel_userptr+0x1de/0x210 [i915] [ 105.087276] [<ffffffff810c0824>] process_one_work+0x184/0x410 [ 105.087278] [<ffffffff810c0afe>] worker_thread+0x4e/0x480 [ 105.087279] [<ffffffff810c0ab0>] ? process_one_work+0x410/0x410 [ 105.087281] [<ffffffff810c0ab0>] ? process_one_work+0x410/0x410 [ 105.087283] [<ffffffff810c6618>] kthread+0xd8/0xf0 [ 105.087286] [<ffffffff817e18bf>] ret_from_fork+0x1f/0x40 [ 105.087288] [<ffffffff810c6540>] ? kthread_worker_fn+0x180/0x180 [ 105.087290] ---[ end trace 50e95d2a797c1d6f ]--- Happens consistently during video playback in vlc. i915_gem_object_put_pages returns -EBUSY, because the previous call to i915_gem_object_unbind doesn't unbind anything in our vma_list, maybe because it's empty? Any ideas Chris?
Try https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fence and tell me what the warn reports (if it fires)?
[ 136.189244] WARNING: CPU: 2 PID: 113 at drivers/gpu/drm/i915/i915_gem_userptr.c:94 cancel_userptr+0x208/0x250 [i915] [ 136.189245] Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0 [ 136.189245] Modules linked in: rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_security ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw iptable_security iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep arc4 iwlmvm mac80211 iTCO_wdt iTCO_vendor_support uvcvideo iwlwifi intel_rapl x86_pkg_temp_thermal coretemp videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_hda_codec_realtek videobuf2_core snd_hda_codec_hdmi snd_hda_codec_generic videodev cfg80211 snd_hda_intel [ 136.189271] btusb snd_hda_codec btrtl media joydev btbcm snd_hwdep rtsx_pci_ms snd_hda_core btintel memstick bluetooth nfsd snd_seq thinkpad_acpi wmi snd_seq_device rfkill snd_pcm auth_rpcgss intel_rst snd_timer nfs_acl snd mei_me lockd mei tpm_tis shpchp i2c_i801 lpc_ich tpm_tis_core intel_pch_thermal soundcore tpm i2c_smbus grace sunrpc dm_crypt hid_microsoft i915 i2c_algo_bit drm_kms_helper rtsx_pci_sdmmc mmc_core drm e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ptp serio_raw rtsx_pci pps_core fjes video [ 136.189294] CPU: 2 PID: 113 Comm: kworker/u16:3 Tainted: G W 4.8.0-rc1-drm-intel+ #62 [ 136.189295] Hardware name: LENOVO 20BW000FUK/20BW000FUK, BIOS JBET54WW (1.19 ) 11/06/2015 [ 136.189312] Workqueue: i915-userptr-release cancel_userptr [i915] [ 136.189313] 0000000000000286 0000000052dfdf5c ffff88022ba13d20 ffffffff813dcf2d [ 136.189315] ffff88022ba13d70 0000000000000000 ffff88022ba13d60 ffffffff810a750b [ 136.189316] 0000005e3dc99538 ffff880228e70068 ffff8801afb68f00 ffffffff81552000 [ 136.189318] Call Trace: [ 136.189321] [<ffffffff813dcf2d>] dump_stack+0x63/0x86 [ 136.189324] [<ffffffff810a750b>] __warn+0xcb/0xf0 [ 136.189325] [<ffffffff81552000>] ? fence_wait_timeout.part.9+0xc0/0xc0 [ 136.189327] [<ffffffff810a758f>] warn_slowpath_fmt+0x5f/0x80 [ 136.189340] [<ffffffffa01e71b8>] cancel_userptr+0x208/0x250 [i915] [ 136.189342] [<ffffffff810c0824>] process_one_work+0x184/0x410 [ 136.189343] [<ffffffff810c0afe>] worker_thread+0x4e/0x480 [ 136.189344] [<ffffffff810c0ab0>] ? process_one_work+0x410/0x410 [ 136.189346] [<ffffffff810c6618>] kthread+0xd8/0xf0 [ 136.189348] [<ffffffff817e38bf>] ret_from_fork+0x1f/0x40 [ 136.189350] [<ffffffff810c6540>] ? kthread_worker_fn+0x180/0x180 [ 136.189351] ---[ end trace 694ecd151f58f1a6 ]---
Created attachment 125781 [details] [review] unbind userptr uninterruptibly Still has a binding, let's restore the uninterruptible unbind.
Still hit the same warning unfortunately.
A bit more information about the unbind fail perhaps? diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fdd7c0a12127..2138e5eea31d 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2773,7 +2773,9 @@ int i915_vma_unbind(struct i915_vma *vma) GEM_BUG_ON(i915_vma_is_active(vma)); } - if (i915_vma_is_pinned(vma)) + if (WARN(i915_vma_is_pinned(vma), + "vma is still pinned [%d], flags=%x\n", + i915_vma_pin_count(vma), vma->flags)) return -EBUSY; if (!drm_mm_node_allocated(&vma->node))
For fun, also diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h index 8292e797d9b5..cec1aa4e1152 100644 --- a/drivers/gpu/drm/i915/i915_gem.h +++ b/drivers/gpu/drm/i915/i915_gem.h @@ -25,10 +25,6 @@ #ifndef __I915_GEM_H__ #define __I915_GEM_H__ -#ifdef CONFIG_DRM_I915_DEBUG_GEM -#define GEM_BUG_ON(expr) BUG_ON(expr) -#else -#define GEM_BUG_ON(expr) -#endif +#define GEM_BUG_ON(expr) WARN_ON(expr) #endif /* __I915_GEM_H__ */
Same thing again. Interestingly I never actually hit the WARN in vma_unbind. hmmm, if I add: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 94fc051..8b66098 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -286,7 +286,7 @@ i915_gem_object_unbind(struct drm_i915_gem_object *obj) { struct i915_vma *vma; LIST_HEAD(still_in_list); - int ret; + int ret = -42; /* The vma will only be freed if it is marked as closed, and if we wait * upon rendering to the vma, we may unbind anything in the list. diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index e20b653..06b3ed1 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -77,6 +77,7 @@ static void cancel_userptr(struct work_struct *work) struct drm_i915_gem_object *obj = mo->obj; struct drm_device *dev = obj->base.dev; bool was_interruptible; + int ret; wait_rendering(obj); @@ -89,10 +90,13 @@ static void cancel_userptr(struct work_struct *work) to_i915(dev)->mm.interruptible = false; /* We are inside a kthread context and can't be interrupted */ - if (i915_gem_object_unbind(obj) == 0) + ret = i915_gem_object_unbind(obj); + if (ret == 0) __i915_gem_object_put_pages(obj); + WARN_ONCE(obj->mm.pages, - "Failed to release pages: bind_count=%d, pages_pin_count=%d, pin_display=%d\n", + "Failed to release pages: ret=%d, bind_count=%d, pages_pin_count=%d, pin_display=%d\n", + ret, obj->bind_count, atomic_read(&obj->mm.pages_pin_count), obj->pin_display); I get: Failed to release pages: ret=-42, bind_count=1, pages_pin_count=1, pin_display=0
Hmm, yup ret is not initialisated for the empty vma_list. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fdd7c0a12127..8f7bc47e5f5d 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -286,7 +286,7 @@ i915_gem_object_unbind(struct drm_i915_gem_object *obj) { struct i915_vma *vma; LIST_HEAD(still_in_list); - int ret; + int ret = 0; is a definite fix. But you have a bind_count != 0, you must have some vma in there. :|
Ah. The obj_link is removed on i915_vma_close() not upon free. Time to think why.
Iirc, my thinking was to remove it upon close so that it was unavailable for lookup immediately afterwards.
Magic fix: index 1f63a45fd6b0..fa9486608ddf 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -279,12 +279,15 @@ static const struct drm_i915_gem_object_ops i915_gem_phys_ops = { .release = i915_gem_object_release_phys, }; -int -i915_gem_object_unbind(struct drm_i915_gem_object *obj) +int i915_gem_object_unbind(struct drm_i915_gem_object *obj) { struct i915_vma *vma; LIST_HEAD(still_in_list); - int ret = 0; + int ret; + + ret = i915_gem_object_wait_rendering(obj, false); + if (ret) + return ret;
Not magic enough...
More subtle fix: struct i915_vma *vma; LIST_HEAD(still_in_list); + unsigned long active; int ret = 0; + active = i915_gem_object_get_active(obj); + for_each_active(active, idx) { + ret = i915_gem_active_retire(&obj->last_read[idx], + &obj->base.dev->struct_mutex); + if (ret) + return ret; + } +
Created attachment 125782 [details] [review] Third go
Nice, that does indeed fix it.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.