Bug 104840

Summary: missed hangcheck wakeup
Product: DRI Reporter: Antonio Argenziano <antonio.argenziano>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: SKL i915 features: GEM/execlists

Description Antonio Argenziano 2018-01-29 17:49:27 UTC
Description:
------
running gem_reset_stats@reset-stats-ctx-default on SKL causes a deadlock. What I think is happening is that the test uses both gem_context_destroy() and drop_caches_set() which will contend the struct mutex and if context destroy gets stuck, it will occupy i915->wq -> nothing can progress because retire cannot be scheduled -> drop_caches_set() keeps waiting for idle.

Steps:
------
1. Execute gem_reset_stats@reset-stats-ctx-default

Actual results:
------
Driver gets deadlocked, test never completes.

Expected results:
------
Test passes.

Dmesg output:
------
[ 7484.031148] [IGT] gem_reset_stats: starting subtest reset-stats-ctx-default

[ 7613.403760] INFO: task kworker/u8:3:1714 blocked for more than 120 seconds.
[ 7613.403815]       Tainted: G     U           4.15.0-rc9+ #44
[ 7613.403844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7613.403884] kworker/u8:3    D    0  1714      2 0x80000000
[ 7613.403999] Workqueue: i915 __i915_gem_free_work [i915]
[ 7613.404007] Call Trace:
[ 7613.404026]  ? __schedule+0x345/0xc50
[ 7613.404044]  schedule+0x39/0x90
[ 7613.404051]  schedule_preempt_disabled+0x11/0x20
[ 7613.404057]  __mutex_lock+0x3b7/0x8d0
[ 7613.404063]  ? __mutex_lock+0x122/0x8d0
[ 7613.404072]  ? trace_buffer_unlock_commit_regs+0x37/0x90
[ 7613.404151]  ? __i915_gem_free_objects+0x89/0x540 [i915]
[ 7613.404243]  __i915_gem_free_objects+0x89/0x540 [i915]
[ 7613.404319]  __i915_gem_free_work+0x51/0x90 [i915]
[ 7613.404335]  process_one_work+0x1b4/0x5d0
[ 7613.404342]  ? process_one_work+0x130/0x5d0
[ 7613.404361]  worker_thread+0x4a/0x3e0
[ 7613.404378]  kthread+0x100/0x140
[ 7613.404385]  ? process_one_work+0x5d0/0x5d0
[ 7613.404390]  ? kthread_delayed_work_timer_fn+0x80/0x80
[ 7613.404402]  ? do_group_exit+0x46/0xc0
[ 7613.404409]  ret_from_fork+0x3a/0x50
[ 7613.404437] 
               Showing all locks held in the system:
[ 7613.404447] 1 lock held by khungtaskd/39:
[ 7613.404458]  #0:  (tasklist_lock){.+.+}, at: [<0000000088c6a651>] debug_show_all_locks+0x39/0x1b0
[ 7613.404489] 1 lock held by in:imklog/809:
[ 7613.404492]  #0:  (&f->f_pos_lock){+.+.}, at: [<00000000cf80f1c9>] __fdget_pos+0x3f/0x50
[ 7613.404519] 1 lock held by dmesg/1652:
[ 7613.404523]  #0:  (&user->lock){+.+.}, at: [<00000000dd4aba83>] devkmsg_read+0x3a/0x2f0
[ 7613.404543] 3 locks held by gem_reset_stats/1713:
[ 7613.404547]  #0:  (sb_writers#10){.+.+}, at: [<00000000aadbc565>] vfs_write+0x18a/0x1c0
[ 7613.404571]  #1:  (&attr->mutex){+.+.}, at: [<000000000e818033>] simple_attr_write+0x35/0xc0
[ 7613.404590]  #2:  (&dev->struct_mutex){+.+.}, at: [<0000000000b72f77>] i915_drop_caches_set+0x4e/0x1a0 [i915]
[ 7613.404669] 3 locks held by kworker/u8:3/1714:
[ 7613.404672]  #0:  ((wq_completion)"i915"){+.+.}, at: [<00000000d83ffa4e>] process_one_work+0x130/0x5d0
[ 7613.404693]  #1:  ((work_completion)(&i915->mm.free_work)){+.+.}, at: [<00000000d83ffa4e>] process_one_work+0x130/0x5d0
[ 7613.404713]  #2:  (&dev->struct_mutex){+.+.}, at: [<000000007b02c7ef>] __i915_gem_free_objects+0x89/0x540 [i915]

[ 7613.404795] =============================================
Comment 1 Chris Wilson 2018-01-29 20:36:37 UTC
Hint: cat /sys/kernel/debug/dri/0/i915_hangcheck_info or see https://patchwork.freedesktop.org/series/37281/
Comment 2 Antonio Argenziano 2018-01-29 20:57:03 UTC
I was just about to try your patch :)
Comment 3 Antonio Argenziano 2018-01-29 22:53:43 UTC
After applying https://patchwork.freedesktop.org/series/37281/, the issue cannot be observed anymore.
Comment 4 Chris Wilson 2018-01-31 10:13:32 UTC
commit 889230489b6b138ba97ba2f13fc9644a3d16d0d2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 29 14:41:04 2018 +0000

    drm/i915: Always run hangcheck while the GPU is busy
    
    Previously, we relied on only running the hangcheck while somebody was
    waiting on the GPU, in order to minimise the amount of time hangcheck
    had to run. (If nobody was watching the GPU, nobody would notice if the
    GPU wasn't responding -- eventually somebody would care and so kick
    hangcheck into action.) However, this falls apart from around commit
    4680816be336 ("drm/i915: Wait first for submission, before waiting for
    request completion"), as not all waiters declare themselves to hangcheck
    and so we could switch off hangcheck and miss GPU hangs even when
    waiting under the struct_mutex.
    
    If we enable hangcheck from the first request submission, and let it run
    until the GPU is idle again, we forgo all the complexity involved with
    only enabling around waiters. We just have to remember to be careful that
    we do not declare a GPU hang when idly waiting for the next request to
    be come ready, as we will run hangcheck continuously even when the
    engines are stalled waiting for external events. This should be true
    already as we should only be tracking requests submitted to hardware for
    execution as an indicator that the engine is busy.
    
    Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104840
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180129144104.3921-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.