Description: ------ running gem_reset_stats@reset-stats-ctx-default on SKL causes a deadlock. What I think is happening is that the test uses both gem_context_destroy() and drop_caches_set() which will contend the struct mutex and if context destroy gets stuck, it will occupy i915->wq -> nothing can progress because retire cannot be scheduled -> drop_caches_set() keeps waiting for idle. Steps: ------ 1. Execute gem_reset_stats@reset-stats-ctx-default Actual results: ------ Driver gets deadlocked, test never completes. Expected results: ------ Test passes. Dmesg output: ------ [ 7484.031148] [IGT] gem_reset_stats: starting subtest reset-stats-ctx-default [ 7613.403760] INFO: task kworker/u8:3:1714 blocked for more than 120 seconds. [ 7613.403815] Tainted: G U 4.15.0-rc9+ #44 [ 7613.403844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7613.403884] kworker/u8:3 D 0 1714 2 0x80000000 [ 7613.403999] Workqueue: i915 __i915_gem_free_work [i915] [ 7613.404007] Call Trace: [ 7613.404026] ? __schedule+0x345/0xc50 [ 7613.404044] schedule+0x39/0x90 [ 7613.404051] schedule_preempt_disabled+0x11/0x20 [ 7613.404057] __mutex_lock+0x3b7/0x8d0 [ 7613.404063] ? __mutex_lock+0x122/0x8d0 [ 7613.404072] ? trace_buffer_unlock_commit_regs+0x37/0x90 [ 7613.404151] ? __i915_gem_free_objects+0x89/0x540 [i915] [ 7613.404243] __i915_gem_free_objects+0x89/0x540 [i915] [ 7613.404319] __i915_gem_free_work+0x51/0x90 [i915] [ 7613.404335] process_one_work+0x1b4/0x5d0 [ 7613.404342] ? process_one_work+0x130/0x5d0 [ 7613.404361] worker_thread+0x4a/0x3e0 [ 7613.404378] kthread+0x100/0x140 [ 7613.404385] ? process_one_work+0x5d0/0x5d0 [ 7613.404390] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 7613.404402] ? do_group_exit+0x46/0xc0 [ 7613.404409] ret_from_fork+0x3a/0x50 [ 7613.404437] Showing all locks held in the system: [ 7613.404447] 1 lock held by khungtaskd/39: [ 7613.404458] #0: (tasklist_lock){.+.+}, at: [<0000000088c6a651>] debug_show_all_locks+0x39/0x1b0 [ 7613.404489] 1 lock held by in:imklog/809: [ 7613.404492] #0: (&f->f_pos_lock){+.+.}, at: [<00000000cf80f1c9>] __fdget_pos+0x3f/0x50 [ 7613.404519] 1 lock held by dmesg/1652: [ 7613.404523] #0: (&user->lock){+.+.}, at: [<00000000dd4aba83>] devkmsg_read+0x3a/0x2f0 [ 7613.404543] 3 locks held by gem_reset_stats/1713: [ 7613.404547] #0: (sb_writers#10){.+.+}, at: [<00000000aadbc565>] vfs_write+0x18a/0x1c0 [ 7613.404571] #1: (&attr->mutex){+.+.}, at: [<000000000e818033>] simple_attr_write+0x35/0xc0 [ 7613.404590] #2: (&dev->struct_mutex){+.+.}, at: [<0000000000b72f77>] i915_drop_caches_set+0x4e/0x1a0 [i915] [ 7613.404669] 3 locks held by kworker/u8:3/1714: [ 7613.404672] #0: ((wq_completion)"i915"){+.+.}, at: [<00000000d83ffa4e>] process_one_work+0x130/0x5d0 [ 7613.404693] #1: ((work_completion)(&i915->mm.free_work)){+.+.}, at: [<00000000d83ffa4e>] process_one_work+0x130/0x5d0 [ 7613.404713] #2: (&dev->struct_mutex){+.+.}, at: [<000000007b02c7ef>] __i915_gem_free_objects+0x89/0x540 [i915] [ 7613.404795] =============================================
Hint: cat /sys/kernel/debug/dri/0/i915_hangcheck_info or see https://patchwork.freedesktop.org/series/37281/
I was just about to try your patch :)
After applying https://patchwork.freedesktop.org/series/37281/, the issue cannot be observed anymore.
commit 889230489b6b138ba97ba2f13fc9644a3d16d0d2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 29 14:41:04 2018 +0000 drm/i915: Always run hangcheck while the GPU is busy Previously, we relied on only running the hangcheck while somebody was waiting on the GPU, in order to minimise the amount of time hangcheck had to run. (If nobody was watching the GPU, nobody would notice if the GPU wasn't responding -- eventually somebody would care and so kick hangcheck into action.) However, this falls apart from around commit 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"), as not all waiters declare themselves to hangcheck and so we could switch off hangcheck and miss GPU hangs even when waiting under the struct_mutex. If we enable hangcheck from the first request submission, and let it run until the GPU is idle again, we forgo all the complexity involved with only enabling around waiters. We just have to remember to be careful that we do not declare a GPU hang when idly waiting for the next request to be come ready, as we will run hangcheck continuously even when the engines are stalled waiting for external events. This should be true already as we should only be tracking requests submitted to hardware for execution as an indicator that the engine is busy. Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion" Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104840 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180129144104.3921-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.