Summary: | [ilk] igt/gem_close_race is slow due to mutex contention vs clflush | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||
Component: | DRM/Intel | Assignee: | Damien Lespiau <damien.lespiau> | ||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||
Severity: | minor | ||||||||||||||
Priority: | low | CC: | intel-gfx-bugs, yi.sun | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux (All) | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
Test on latest -nightly kernel. It works well on sandybridge, ivybridge and haswell. It causes call trace on ironlake, and it doesn't exit test. output: Subtest process-exit: SUCCESS Call Trace: [ 962.964724] [<ffffffffa0077091>] ? i915_gem_retire_requests_ring+0x3a/0x8e [i915] [ 962.964777] [<ffffffffa0077c96>] ? i915_gem_busy_ioctl+0x63/0xbb [i915] [ 962.964825] [<ffffffffa0009f94>] ? drm_ioctl+0x26c/0x38d [drm] [ 962.964868] [<ffffffffa0077c33>] ? i915_gem_set_seqno+0x4a/0x4a [i915] [ 962.964913] [<ffffffff8104be83>] ? hrtimer_try_to_cancel+0x67/0x70 [ 962.964954] [<ffffffff810e84ae>] ? vfs_ioctl+0x1e/0x31 [ 962.964988] [<ffffffff810e8c84>] ? do_vfs_ioctl+0x3ad/0x3ef [ 962.965026] [<ffffffff81074d75>] ? SyS_futex+0x12e/0x161 [ 962.965060] [<ffffffff810e8d14>] ? SyS_ioctl+0x4e/0x7e [ 962.965095] [<ffffffff8170ebe2>] ? system_call_fastpath+0x16/0x1b [ 962.965133] Code: 8b 6b 40 48 89 7b 30 4c 89 43 38 48 85 ed 74 40 48 8d 7d 08 e8 44 3c 69 e1 48 8b 53 48 48 8b 43 50 48 b9 00 02 20 00 00 00 ad de <48> 89 42 08 48 89 10 48 ba 00 01 10 00 00 00 ad de 48 89 53 48 [ 962.965413] RIP [<ffffffffa0075cff>] i915_gem_free_request+0x55/0x9e [i915] [ 962.965462] RSP <ffff88009e2b5d48> Also test the comment 1 patch, It has same result as the latest kernel. Created attachment 88586 [details]
dmesg on ilk
Apparently the test works ... (In reply to comment #5) > See bug 70874. I.e. please test the patch attached there. (In reply to comment #6) > (In reply to comment #5) > > See bug 70874. > > I.e. please test the patch attached there. Test the patch on Ironlake. Run more than 1 hour, the test doesn't exit. output: Subtest process-exit: SUCCESS dmesg: [ 5201.186618] [drm:i915_gem_open], [ 5201.195147] [drm:i915_gem_open], [ 5201.219552] [drm:i915_gem_open], [ 5201.227760] [drm:i915_gem_open], [ 5201.236246] [drm:i915_gem_open], [ 5201.258691] [drm:i915_gem_open], [ 5201.266885] [drm:i915_gem_open], [ 5201.266911] [drm:i915_gem_open], [ 5201.290080] [drm:i915_gem_open], [ 5201.290105] [drm:i915_gem_open], [ 5201.297611] [drm:i915_gem_open], [ 5201.305690] [drm:i915_gem_open], Created attachment 88667 [details]
dmesg with Serialise i915_gem_release() and i915_gem_free_request()
The test works, but is just slow. For some reason, abnormally slow on Ironlake. Should be "fixed" now ... (In reply to comment #10) > Should be "fixed" now ... Test on latest -nightly kernel. The fail goes away. It spends about 25 minutes testing. Is it expected? It spends about 2 minutes testing on Sandybridge. Reopen it. Test on Pineview, It takes about 5 minutes. It takes about 25 minutes on Ironlake. Is the ilk w/a active? E.g. does diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c index 998f9a0b322a..2adfef0fa6ac 100644 --- a/drivers/gpu/drm/i915/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c @@ -1727,6 +1727,9 @@ static int i915_gmch_probe(struct drm_device *dev, dev_priv->gtt.do_idle_maps = needs_idle_maps(dev_priv->dev); dev_priv->gtt.base.clear_range = i915_ggtt_clear_range; + if (unlikely(dev_priv->gtt.do_idle_maps)) + DRM_INFO("applying Ironlake quirks for intel_iommu\n"); + return 0; } report anything? (In reply to comment #13) > Is the ilk w/a active? > > E.g. does > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c > b/drivers/gpu/drm/i915/i915_gem_gtt.c > index 998f9a0b322a..2adfef0fa6ac 100644 > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c > @@ -1727,6 +1727,9 @@ static int i915_gmch_probe(struct drm_device *dev, > dev_priv->gtt.do_idle_maps = needs_idle_maps(dev_priv->dev); > dev_priv->gtt.base.clear_range = i915_ggtt_clear_range; > > + if (unlikely(dev_priv->gtt.do_idle_maps)) > + DRM_INFO("applying Ironlake quirks for intel_iommu\n"); > + > return 0; > } > > > report anything? output: Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS dmesg: [ 173.963588] systemd-journald[2374]: Got invalid event from epoll. [ 174.229685] systemd[1]: Unit systemd-journald.service entered failed state. [ 174.313572] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 174.409767] systemd[1]: Stopping Journal Service... [ 174.468447] systemd[1]: Starting Journal Service... [ 174.528119] systemd[1]: Started Journal Service. [ 174.583752] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 280.871090] systemd-journald[14135]: Got invalid event from epoll. [ 281.348185] systemd[1]: Unit systemd-journald.service entered failed state. [ 281.432383] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 281.529290] systemd[1]: Stopping Journal Service... [ 281.588296] systemd[1]: Starting Journal Service... [ 281.648607] systemd[1]: Started Journal Service. [ 281.704748] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 297.277227] systemd-journald[7062]: Got invalid event from epoll. [ 297.887872] systemd[1]: Unit systemd-journald.service entered failed state. [ 297.972935] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 298.070402] systemd[1]: Stopping Journal Service... [ 298.129492] systemd[1]: Starting Journal Service... [ 298.188906] systemd[1]: Started Journal Service. [ 298.245167] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 298.368935] systemd-journald[21308]: Got invalid event from epoll. [ 298.445478] systemd[1]: Unit systemd-journald.service entered failed state. [ 298.529547] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 298.625551] systemd[1]: Stopping Journal Service... [ 298.685082] systemd[1]: Starting Journal Service... [ 298.744512] systemd[1]: Started Journal Service. [ 298.801121] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 407.183744] systemd-journald[21534]: Got invalid event from epoll. [ 407.725206] systemd[1]: Unit systemd-journald.service entered failed state. [ 407.809623] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 407.906516] systemd[1]: Stopping Journal Service... [ 407.965268] systemd[1]: Starting Journal Service... [ 408.025209] systemd[1]: Started Journal Service. [ 408.081274] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 408.195253] systemd-journald[14884]: Got invalid event from epoll. [ 408.273373] systemd[1]: Unit systemd-journald.service entered failed state. [ 408.357218] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 408.453920] systemd[1]: Stopping Journal Service... [ 408.512768] systemd[1]: Starting Journal Service... [ 408.572970] systemd[1]: Started Journal Service. [ 408.629781] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... [ 408.656757] systemd-journald[15133]: Got invalid event from epoll. [ 408.799599] systemd[1]: systemd-journald.service: main process exited, code=exited, status=1/FAILURE [ 408.909586] systemd[1]: Unit systemd-journald.service entered failed state. [ 408.993369] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 409.089380] systemd[1]: Stopping Journal Service... [ 409.148239] systemd[1]: Starting Journal Service... [ 409.207622] systemd[1]: Started Journal Service. [ 409.284042] systemd-journald[15581]: Got invalid event from epoll. [ 409.363928] systemd[1]: Unit systemd-journald.service entered failed state. [ 409.447596] systemd[1]: systemd-journald.service holdoff time over, scheduling restart. [ 409.543684] systemd[1]: Stopping Journal Service... [ 409.606480] systemd[1]: Starting Journal Service... [ 409.666021] systemd[1]: Started Journal Service. [ 409.721899] systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... It would be reported during boot. Can you please attach a dmesg from the ilk machine upon booting with the current drm-intel-nightly? Created attachment 91730 [details]
dmesg after clean boot
Created attachment 91731 [details]
dmesg after run case
Finally booted an Ironlake (i3-330m). Woohoo! The time here is spent in mutex contention (specifically throwing away cycles in mutes_spin_on_owner whilst we are clflushing). I think this is just a pathologically bad condition and not one to worry about (just yet). For reference: ickle@u160 ~/intel-gpu-tools/tests $ time sudo ./gem_close_race IGT-Version: 1.5-ga9d1fe0 (x86_64) (Linux: 3.13.0-rc8+ x86_64) Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS real 211m48.483s user 0m5.860s sys 432m50.890s One idea is to recognise that we have some extremely long mutex hold times, and disabling spinning for thosei, i.e: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 338f80c..ed45276 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3769,7 +3769,10 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj, return false; trace_i915_gem_object_clflush(obj); + + mutex_spin_disable(&obj->base.dev->struct_mutex); drm_clflush_sg(obj->pages); + mutex_spin_enable(&obj->base.dev->struct_mutex); return true; } That's obvious not going to fly upstream. A better suggestion would perhaps be to autotune the spinning by timing out. So something like: diff --git a/kernel/locking/mcs_spinlock.c b/kernel/locking/mcs_spinlock.c index 838dc9e..5e0f27e 100644 --- a/kernel/locking/mcs_spinlock.c +++ b/kernel/locking/mcs_spinlock.c @@ -63,6 +63,7 @@ bool osq_lock(struct optimistic_spin_queue **lock) { struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node); struct optimistic_spin_queue *prev, *next; + unsigned long timeout; node->locked = 0; node->next = NULL; @@ -82,7 +83,11 @@ bool osq_lock(struct optimistic_spin_queue **lock) * cmpxchg in an attempt to undo our queueing. */ + timeout = jiffies + 1; while (!smp_load_acquire(&node->locked)) { + if (time_after(jiffies, timeout)) + goto unqueue; + /* * If we need to reschedule bail... so we can block. */ diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index bc73d33..ac8c435 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -147,8 +147,17 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner) static noinline int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) { + unsigned long timeout; + + if (need_resched()) + return false; + rcu_read_lock(); + timeout = jiffies + 1; while (owner_running(lock, owner)) { + if (time_after(jiffies, timeout)) + break; + if (need_resched()) commit 98165d60bec04d258eeb9623550e58a8bc2f3906 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 11 23:36:15 2014 +0100 igt/gem_close_race: Trim object size The challenge here is to race GPU activity versus the gem_close. Using a large objects makes the window of GPU activity larger - except on !llc systems we then incur massive overhead from clflush, likely destroying any race (due to mutex contention). And that mutex contention is amplified by the number of cores - bad news for Baytrail. Give up and make the objects smaller and hope that the test is run frequently enough to catch the race. References: https://bugs.freedesktop.org/show_bug.cgi?id=71029 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Fixed. IGT-Version: 1.7-ge74dd22 (x86_64) (Linux: 3.16.0-rc4_drm-intel-nightly_271bff_20140713+ x86_64) Subtest process-exit: SUCCESS Subtest gem-close-race: SUCCESS real 0m9.192s user 0m0.336s sys 0m14.457s Closing old verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 88338 [details] dmesg System Environment: -------------------------- Platform: Q965/Pineview/GM45/Ironlake/Sandybridge/Ivybridge/Haswell Kernel: (drm-intel-nightly)d3c493af99141b2ca1f189bce56ba7e8c2257aa6 Bug detailed description: ------------------------- It fails on all platforms with -fixes, -nightly and -queued kernel.It's a new igt case. output: gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_child' failed. Subtest gem-close-race: SUCCESS child 0 died with signal 6, Aborted gem_close_race: drmtest.c:994: igt_fail: Assertion `!test_with_subtests || in_fi gem_close_race: drmtest.c:1161: children_exit_handler: Assertion `kill(test_chil Aborted (core dumped) [root@x-pk2 tests]# gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_ch gem_close_race: drmtest.c:925: igt_skip: Assertion `!test_child' failed. gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed gem_close_race: intel_drm.c:69: intel_get_drm_devid: Assertion `ret == 0' failed Reproduce steps: ------------------------- 1. ./gem_close_race