Created attachment 88338 [details] dmesg System Environment: -------------------------- Platform: Q965/Pineview/GM45/Ironlake/Sandybridge/Ivybridge/Haswell Kernel: (drm-intel-nightly)d3c493af99141b2ca1f189bce56ba7e8c2257aa6 Bug detailed description: ------------------------- It fails on all platforms with -fixes, -nightly and -queued kernel.It's a new igt case. output: gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_child' failed. Subtest gem-close-race: SUCCESS child 0 died with signal 6, Aborted gem_close_race: drmtest.c:994: igt_fail: Assertion `!test_with_subtests || in_fi gem_close_race: drmtest.c:1161: children_exit_handler: Assertion `kill(test_chil Aborted (core dumped) [root@x-pk2 tests]# gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_ch gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_child' failed. gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed Reproduce steps: ------------------------- 1. ./gem_close_race
See http://lists.freedesktop.org/archives/dri-devel/2013-October/048016.html
Test on latest -nightly kernel. It works well on sandybridge, ivybridge and haswell. It causes call trace on ironlake, and it doesn't exit test. output: Subtest process-exit: SUCCESS Call Trace: [ 962.964724] [<ffffffffa0077091>] ? i915_gem_retire_requests_ring+0x3a/0x8e [i915] [ 962.964777] [<ffffffffa0077c96>] ? i915_gem_busy_ioctl+0x63/0xbb [i915] [ 962.964825] [<ffffffffa0009f94>] ? drm_ioctl+0x26c/0x38d [drm] [ 962.964868] [<ffffffffa0077c33>] ? i915_gem_set_seqno+0x4a/0x4a [i915] [ 962.964913] [<ffffffff8104be83>] ? hrtimer_try_to_cancel+0x67/0x70 [ 962.964954] [<ffffffff810e84ae>] ? vfs_ioctl+0x1e/0x31 [ 962.964988] [<ffffffff810e8c84>] ? do_vfs_ioctl+0x3ad/0x3ef [ 962.965026] [<ffffffff81074d75>] ? SyS_futex+0x12e/0x161 [ 962.965060] [<ffffffff810e8d14>] ? SyS_ioctl+0x4e/0x7e [ 962.965095] [<ffffffff8170ebe2>] ? system_call_fastpath+0x16/0x1b [ 962.965133] Code: 8b 6b 40 48 89 7b 30 4c 89 43 38 48 85 ed 74 40 48 8d 7d 08 e8 44 3c 69 e1 48 8b 53 48 48 8b 43 50 48 b9 00 02 20 00 00 00 ad de <48> 89 42 08 48 89 10 48 ba 00 01 10 00 00 00 ad de 48 89 53 48 [ 962.965413] RIP [<ffffffffa0075cff>] i915_gem_free_request+0x55/0x9e [i915] [ 962.965462] RSP <ffff88009e2b5d48> Also test the comment 1 patch, It has same result as the latest kernel.
Created attachment 88586 [details] dmesg on ilk
Apparently the test works ...
See bug 70874.
(In reply to comment #5) > See bug 70874. I.e. please test the patch attached there.
(In reply to comment #6) > (In reply to comment #5) > > See bug 70874. > > I.e. please test the patch attached there. Test the patch on Ironlake. Run more than 1 hour, the test doesn't exit. output: Subtest process-exit: SUCCESS dmesg: [ 5201.186618] [drm:i915_gem_open], [ 5201.195147] [drm:i915_gem_open], [ 5201.219552] [drm:i915_gem_open], [ 5201.227760] [drm:i915_gem_open], [ 5201.236246] [drm:i915_gem_open], [ 5201.258691] [drm:i915_gem_open], [ 5201.266885] [drm:i915_gem_open], [ 5201.266911] [drm:i915_gem_open], [ 5201.290080] [drm:i915_gem_open], [ 5201.290105] [drm:i915_gem_open], [ 5201.297611] [drm:i915_gem_open], [ 5201.305690] [drm:i915_gem_open],
Created attachment 88667 [details] dmesg with Serialise i915_gem_release() and i915_gem_free_request()
The test works, but is just slow. For some reason, abnormally slow on Ironlake.
Should be "fixed" now ...
(In reply to comment #10) > Should be "fixed" now ... Test on latest -nightly kernel. The fail goes away. It spends about 25 minutes testing. Is it expected? It spends about 2 minutes testing on Sandybridge.
Reopen it. Test on Pineview, It takes about 5 minutes. It takes about 25 minutes on Ironlake.
Is the ilk w/a active? E.g. does diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c index 998f9a0b322a..2adfef0fa6ac 100644 --- a/drivers/gpu/drm/i915/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c @@ -1727,6 +1727,9 @@ static int i915_gmch_probe(struct drm_device *dev, dev_priv->gtt.do_idle_maps = needs_idle_maps(dev_priv->dev); dev_priv->gtt.base.clear_range = i915_ggtt_clear_range; + if (unlikely(dev_priv->gtt.do_idle_maps)) + DRM_INFO("applying Ironlake quirks for intel_iommu\n"); + return 0; } report anything?
(In reply to comment #13) > Is the ilk w/a active? > > E.g. does > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c > b/drivers/gpu/drm/i915/i915_gem_gtt.c > index 998f9a0b322a..2adfef0fa6ac 100644 > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c > @@ -1727,6 +1727,9 @@ static int i915_gmch_probe(struct drm_device *dev, > dev_priv->gtt.do_idle_maps = needs_idle_maps(dev_priv->dev); > dev_priv->gtt.base.clear_range = i915_ggtt_clear_range; > > + if (unlikely(dev_priv->gtt.do_idle_maps)) > + DRM_INFO("applying Ironlake quirks for intel_iommu\n"); > + > return 0; > } > > > report anything? output: Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS dmesg: [ 173.963588] systemd-journald[2374]: Got invalid event from epoll. [ 174.229685] systemd[1]: Unit systemd-journald.service entered failed state. [ 174.313572] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 174.409767] systemd[1]: Stopping Journal Service... [ 174.468447] systemd[1]: Starting Journal Service... [ 174.528119] systemd[1]: Started Journal Service. [ 174.583752] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 280.871090] systemd-journald[14135]: Got invalid event from epoll. [ 281.348185] systemd[1]: Unit systemd-journald.service entered failed state. [ 281.432383] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 281.529290] systemd[1]: Stopping Journal Service... [ 281.588296] systemd[1]: Starting Journal Service... [ 281.648607] systemd[1]: Started Journal Service. [ 281.704748] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 297.277227] systemd-journald[7062]: Got invalid event from epoll. [ 297.887872] systemd[1]: Unit systemd-journald.service entered failed state. [ 297.972935] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 298.070402] systemd[1]: Stopping Journal Service... [ 298.129492] systemd[1]: Starting Journal Service... [ 298.188906] systemd[1]: Started Journal Service. [ 298.245167] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 298.368935] systemd-journald[21308]: Got invalid event from epoll. [ 298.445478] systemd[1]: Unit systemd-journald.service entered failed state. [ 298.529547] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 298.625551] systemd[1]: Stopping Journal Service... [ 298.685082] systemd[1]: Starting Journal Service... [ 298.744512] systemd[1]: Started Journal Service. [ 298.801121] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 407.183744] systemd-journald[21534]: Got invalid event from epoll. [ 407.725206] systemd[1]: Unit systemd-journald.service entered failed state. [ 407.809623] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 407.906516] systemd[1]: Stopping Journal Service... [ 407.965268] systemd[1]: Starting Journal Service... [ 408.025209] systemd[1]: Started Journal Service. [ 408.081274] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 408.195253] systemd-journald[14884]: Got invalid event from epoll. [ 408.273373] systemd[1]: Unit systemd-journald.service entered failed state. [ 408.357218] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 408.453920] systemd[1]: Stopping Journal Service... [ 408.512768] systemd[1]: Starting Journal Service... [ 408.572970] systemd[1]: Started Journal Service. [ 408.629781] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 408.656757] systemd-journald[15133]: Got invalid event from epoll. [ 408.799599] systemd[1]: systemd-journald.service: main process exited, code=exited, status=1/FAILURE [ 408.909586] systemd[1]: Unit systemd-journald.service entered failed state. [ 408.993369] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 409.089380] systemd[1]: Stopping Journal Service... [ 409.148239] systemd[1]: Starting Journal Service... [ 409.207622] systemd[1]: Started Journal Service. [ 409.284042] systemd-journald[15581]: Got invalid event from epoll. [ 409.363928] systemd[1]: Unit systemd-journald.service entered failed state. [ 409.447596] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 409.543684] systemd[1]: Stopping Journal Service... [ 409.606480] systemd[1]: Starting Journal Service... [ 409.666021] systemd[1]: Started Journal Service. [ 409.721899] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
It would be reported during boot. Can you please attach a dmesg from the ilk machine upon booting with the current drm-intel-nightly?
Created attachment 91730 [details] dmesg after clean boot
Created attachment 91731 [details] dmesg after run case
Finally booted an Ironlake (i3-330m). Woohoo! The time here is spent in mutex contention (specifically throwing away cycles in mutes_spin_on_owner whilst we are clflushing). I think this is just a pathologically bad condition and not one to worry about (just yet).
For reference: ickle@u160 ~/intel-gpu-tools/tests $ time sudo ./gem_close_race IGT-Version: 1.5-ga9d1fe0 (x86_64) (Linux: 3.13.0-rc8+ x86_64) Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS real 211m48.483s user 0m5.860s sys 432m50.890s
One idea is to recognise that we have some extremely long mutex hold times, and disabling spinning for thosei, i.e: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 338f80c..ed45276 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3769,7 +3769,10 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj, return false; trace_i915_gem_object_clflush(obj); + + mutex_spin_disable(&obj->base.dev->struct_mutex); drm_clflush_sg(obj->pages); + mutex_spin_enable(&obj->base.dev->struct_mutex); return true; } That's obvious not going to fly upstream. A better suggestion would perhaps be to autotune the spinning by timing out. So something like: diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c index 838dc9e..5e0f27e 100644 --- a/kernel/locking/mcs_spinlock.c +++ b/kernel/locking/mcs_spinlock.c @@ -63,6 +63,7 @@ bool osq_lock(struct optimistic_spin_queue **lock) { struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node); struct optimistic_spin_queue *prev, *next; + unsigned long timeout; node->locked = 0; node->next = NULL; @@ -82,7 +83,11 @@ bool osq_lock(struct optimistic_spin_queue **lock) * cmpxchg in an attempt to undo our queueing. */ + timeout = jiffies + 1; while (!smp_load_acquire(&node->locked)) { + if (time_after(jiffies, timeout)) + goto unqueue; + /* * If we need to reschedule bail... so we can block. */ diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index bc73d33..ac8c435 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -147,8 +147,17 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner) static noinline int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) { + unsigned long timeout; + + if (need_resched()) + return false; + rcu_read_lock(); + timeout = jiffies + 1; while (owner_running(lock, owner)) { + if (time_after(jiffies, timeout)) + break; + if (need_resched())
commit 98165d60bec04d258eeb9623550e58a8bc2f3906 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 11 23:36:15 2014 +0100 igt/gem_close_race: Trim object size The challenge here is to race GPU activity versus the gem_close. Using a large objects makes the window of GPU activity larger - except on !llc systems we then incur massive overhead from clflush, likely destroying any race (due to mutex contention). And that mutex contention is amplified by the number of cores - bad news for Baytrail. Give up and make the objects smaller and hope that the test is run frequently enough to catch the race. References: https://bugs.freedesktop.org/show_bug.cgi?id=71029 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Fixed. IGT-Version: 1.7-ge74dd22 (x86_64) (Linux: 3.16.0-rc4_drm-intel-nightly_271bff_20140713+ x86_64) Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS real 0m9.192s user 0m0.336s sys 0m14.457s
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.