108456 – userptr deadlock in Vulkan CTS: mmu_notifier vs workers vs struct_mutex

Bug 108456 - userptr deadlock in Vulkan CTS: mmu_notifier vs workers vs struct_mutex

Summary: userptr deadlock in Vulkan CTS: mmu_notifier vs workers vs struct_mutex

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-10-16 14:16 UTC by Mark Janes
Modified:	2019-11-29 17:56 UTC (History)
CC List:	8 users (show)

See Also:
i915 platform:	SKL
i915 features:	GEM/Other

Attachments
*/proc//stack report** (30.79 KB, text/plain) 2018-10-16 23:54 UTC, Mark Janes	no flags	Details
dmesg from hung sklgt2, linux 4.15 (84.34 KB, text/plain) 2018-10-16 23:55 UTC, Mark Janes	no flags	Details
backtrace of spinning kernel worker (122.39 KB, text/plain) 2018-10-17 19:16 UTC, Mark Janes	no flags	Details
card error for vulkancts gpu hang on 5255 CI-tested kernel (256.74 KB, text/plain) 2018-12-04 23:33 UTC, Mark Janes	no flags	Details
View All

Description Mark Janes 2018-10-16 14:16:22 UTC

VulkanCTS 1.1.2 causes machines to intermittently hang.  In dmesg, the following output was found:

[78782.499488] deqp-vk         D    0 19687  19392 0x00000000
[78782.499488] Call Trace:
[78782.499489]  ? __schedule+0x291/0x880
[78782.499490]  schedule+0x28/0x80
[78782.499491]  schedule_preempt_disabled+0xa/0x10
[78782.499492]  __mutex_lock.isra.1+0x1a0/0x4e0
[78782.499496]  ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499505]  ? i915_gem_close_object+0x3a/0x160 [i915]
[78782.499514]  i915_gem_close_object+0x3a/0x160 [i915]
[78782.499519]  ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499523]  drm_gem_object_release_handle+0x2c/0x90 [drm]
[78782.499527]  drm_gem_handle_delete+0x57/0x80 [drm]
[78782.499531]  drm_ioctl_kernel+0x59/0xb0 [drm]
[78782.499535]  drm_ioctl+0x2cb/0x380 [drm]
[78782.499539]  ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499541]  do_vfs_ioctl+0xa1/0x610
[78782.499542]  ? do_munmap+0x32e/0x440
[78782.499543]  SyS_ioctl+0x74/0x80
[78782.499544]  entry_SYSCALL_64_fastpath+0x24/0x87
[78782.499545] RIP: 0033:0x7fd6e249af07
[78782.499545] RSP: 002b:00007ffe035cb448 EFLAGS: 00000246
[78782.499546] INFO: task kworker/7:2:19952 blocked for more than 120 seconds.
[78782.499564]       Tainted: G     U           4.15.0 #4
[78782.499576] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78782.499595] kworker/7:2     D    0 19952      2 0x80000000
[78782.499600] Workqueue: events drm_fb_helper_dirty_work [drm_kms_helper]

This error is easier to reproduce on gen8 systems.  Often, the hung machines reported  VK_ERROR_OUT_OF_DEVICE_MEMORY for tests like:  

  dEQP-VK.pipeline.render_to_image.core.cube_array.huge.width_height_layers.r8g8b8a8_unorm_d16_unorm

Machines may complete tests, then hang on subsequent CI runs.

Comment 1 Chris Wilson 2018-10-16 14:37:46 UTC

The oom may be symptomatic of the same failure, if it is struggling to allocate memory, then it is entirely possible for the machine to be starved an no further progress made under the mutex for a long time.

For a report like this, a "cat /proc/*/stack" is useful to see what might be holding the struct_mutex.

Comment 2 Mark Janes 2018-10-16 23:54:59 UTC

Created attachment 142065 [details]
/proc/*/stack report

generated with 

ls /proc/ | xargs -ifoo  sh -c 'echo /proc/foo; cat /proc/foo/cmdline; cat /proc/foo/stack;' >> /tmp/stacks.txt

Comment 3 Mark Janes 2018-10-16 23:55:30 UTC

Created attachment 142066 [details]
dmesg from hung sklgt2, linux 4.15

Comment 4 Daniel Vetter 2018-10-17 17:16:13 UTC

Smells like deadlock in userptr. Tons of backtraces stuck in places where most likely they want to acquire the dev->struct_mutex, except one:

[<0>] 0xffffffffffffffff
/proc/16502
/tmp/build_root/m64/opt/deqp/modules/vulkan/deqp-vk�--deqp-log-images=disable�--deqp-gl-config-name=rgba8888d24s8�--deqp-surface-width=400�--deqp-surface-height=300�--deqp-visibility=hidden�--deqp-surface-type=fbo�--deqp-shadercache=disable�--deqp-log-filename=TestResults-6.qpa�--deqp-caselist-file=mesa-ci-caselist-6.txt�[<0>] flush_workqueue+0x132/0x420
[<0>] i915_gem_userptr_mn_invalidate_range_start+0x174/0x1d0 [i915]
[<0>] __mmu_notifier_invalidate_range_start+0x4f/0x70
[<0>] try_to_unmap_one+0x894/0x960
[<0>] rmap_walk_file+0xec/0x240
[<0>] try_to_unmap+0x93/0xf0
[<0>] migrate_pages+0x7be/0xaa0
[<0>] compact_zone+0x5d9/0x860
[<0>] compact_zone_order+0xa0/0xd0
[<0>] try_to_compact_pages+0xa5/0x250
[<0>] __alloc_pages_direct_compact+0x4a/0xf0
[<0>] __alloc_pages_slowpath+0x462/0xd80
[<0>] __alloc_pages_nodemask+0x233/0x250
[<0>] setup_scratch_page+0xfd/0x240 [i915]
[<0>] i915_ppgtt_create+0xad/0xa10 [i915]
[<0>] i915_gem_create_context+0x254/0x440 [i915]
[<0>] i915_gem_context_create_ioctl+0x55/0x100 [i915]
[<0>] drm_ioctl_kernel+0x59/0xb0 [drm]
[<0>] drm_ioctl+0x2cb/0x380 [drm]
[<0>] do_vfs_ioctl+0xa1/0x610
[<0>] SyS_ioctl+0x74/0x80
[<0>] entry_SYSCALL_64_fastpath+0x24/0x87
[<0>] 0xffffffffffffffff

Well I lied, there's more than one stuck on that flush_workqueue():

[<0>] 0xffffffffffffffff
/proc/65
[<0>] flush_workqueue+0x132/0x420
[<0>] i915_gem_userptr_mn_invalidate_range_start+0x174/0x1d0 [i915]
[<0>] __mmu_notifier_invalidate_range_start+0x4f/0x70
[<0>] try_to_unmap_one+0x894/0x960
[<0>] rmap_walk_file+0xec/0x240
[<0>] try_to_unmap+0x93/0xf0
[<0>] migrate_pages+0x7be/0xaa0
[<0>] compact_zone+0x5d9/0x860
[<0>] compact_zone_order+0xa0/0xd0
[<0>] try_to_compact_pages+0xa5/0x250
[<0>] __alloc_pages_direct_compact+0x4a/0xf0
[<0>] __alloc_pages_slowpath+0x462/0xd80
[<0>] __alloc_pages_nodemask+0x233/0x250
[<0>] khugepaged+0xb4d/0x2060
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x35/0x40

I think this is one for Tvrtko and Chris.

Comment 5 Mark Janes 2018-10-17 18:12:41 UTC

/proc/meminfo from hung system:

MemTotal:        7848548 kB
MemFree:          328460 kB
MemAvailable:    6171272 kB
Buffers:          224404 kB
Cached:           460808 kB
SwapCached:        49444 kB
Active:          1120888 kB
Inactive:         787176 kB
Active(anon):     768736 kB
Inactive(anon):   481944 kB
Active(file):     352152 kB
Inactive(file):   305232 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8068092 kB
SwapFree:        7879420 kB
Dirty:               328 kB
Writeback:             0 kB
AnonPages:       1180400 kB
Mapped:            94784 kB
Shmem:             27744 kB
Slab:            5528728 kB
SReclaimable:    5494752 kB
SUnreclaim:        33976 kB
KernelStack:        3232 kB
PageTables:        11108 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    11992364 kB
Committed_AS:    2017224 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:    520192 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       99176 kB
DirectMap2M:     7962624 kB
DirectMap1G:           0 kB

Comment 6 Daniel Vetter 2018-10-17 19:13:38 UTC

Note there's a few worker_thread stuck in i915, one of them in cancel_userptr. Jerome Glisse is also betting on a userptr deadlock.

Comment 7 Mark Janes 2018-10-17 19:16:55 UTC

Created attachment 142074 [details]
backtrace of spinning kernel worker

on a hung system, a single kernel thread is spinning at 100%.  This trace was obtained with:

# dmesg -c > /dev/null
# echo 1 > /proc/sys/kernel/sysrq
# echo 1 > /proc/sysrq-trigger
# dmesg

Comment 8 Mark Janes 2018-10-17 19:45:04 UTC

in the previous attachment, pid 16562 is spinning

Comment 9 Daniel Vetter 2018-10-17 20:58:26 UTC

(In reply to Mark Janes from comment #8)
> in the previous attachment, pid 16562 is spinning

Seems to be process_srcu. drm uses srcu to handle drm_dev_unplug(), and I guess srcu is getting mightly pissed that we never srcu_unlock anymore. Guess, because I don't understand the patalogicals of srcu at all, so can't say with any certainty. But at least the spinner isn't surprising.

Comment 10 Daniel Vetter 2018-10-18 07:05:54 UTC

Reassigning to kernel, since clearly a kernel bug. Note originally discovered on 4.15, 4.18 still goes boom, and from what I've read, still the same code in upstream. At least the flush_workqueue is still in the mmu notifier callback.

Comment 11 Daniel Vetter 2018-10-26 09:04:00 UTC

Probably related to bug #108375

Comment 12 Francesco Balestrieri 2018-11-01 13:36:34 UTC

Joonas, could you take a look?

Comment 13 Eero Tamminen 2018-11-12 12:25:04 UTC

(In reply to Chris Wilson from comment #1)
> The oom may be symptomatic of the same failure, if it is struggling to
> allocate memory, then it is entirely possible for the machine to be starved
> an no further progress made under the mutex for a long time.

(In reply to Mark Janes from comment #5)
> /proc/meminfo from hung system:
> 
> MemTotal:        7848548 kB
> MemFree:          328460 kB
> MemAvailable:    6171272 kB
...
> SwapTotal:       8068092 kB
> SwapFree:        7879420 kB

Memory usage at this point doesn't seem to be problem, but I think that's because the OOM kill causing program may have already gone.  You would need to track memory usage before the OOM.

Btw. for i915, it's relevant to track also driver memory as that's not really visible in meminfo until other things start getting pushed to swap:
  sudo watch cat /sys/kernel/debug/dri/0/i915_gem_objects


> VmallocTotal:   34359738367 kB

34 TB allocated on machine with 8 GB of RAM.  That's a lot of (currently) unused memory...

Comment 14 Emil Velikov 2018-11-29 13:46:21 UTC

Mark, is this a regression? How often does the failure occur?

I'm wondering if we could drop this from the Mesa 18.3.0 blocker list.
We're at RC5, as we normally aim to have around 4-6 release candidates.

What do you think?

Comment 15 Mark Janes 2018-11-29 17:11:20 UTC

This bug does appear to be a regression.  Something

Comment 16 Mark Janes 2018-11-29 17:51:42 UTC

(In reply to Emil Velikov from comment #14)
> Mark, is this a regression? How often does the failure occur?

I've been treating this as a regression, because it began in the past few months.  Because it is intermittent, I have been unable to figure out what triggered the hangs.

The failure occurs on about 30% of our CI runs.  Each run invokes vulkancts on more than a dozen machines.  The per-machine failure rate is about 1-2%.  Unfortunately, the impact is high: the machines are unusable until they are rebooted.

It's possible that a kernel update or some other change (eg CTS version) triggered these failures.  I'll do more testing on our 18.2 branches to rule that out.  If the hang reproduces on 18.2, then there is no reason to block the release.

> I'm wondering if we could drop this from the Mesa 18.3.0 blocker list.
> We're at RC5, as we normally aim to have around 4-6 release candidates.

Rafael has been working to eliminate userptr in Mesa, which should fix this bug.  Ideally, we would backport his work to 18.3 for the release.

Comment 17 Mark Janes 2018-11-30 22:22:44 UTC

I was able to reproduce this on mesa 18.2, so it cannot be a blocker for 18.3

Comment 18 Francesco Balestrieri 2018-12-01 11:37:38 UTC

There is a proposed fix (still under debate) here: 

https://patchwork.freedesktop.org/series/52209/ 
 
Meanwhile, this patch that was merged a few weeks ago should have provided a workaround: 

https://patchwork.freedesktop.org/patch/258901/ 

Did it help?

Comment 19 Mark Janes 2018-12-01 17:09:15 UTC

(In reply to Francesco Balestrieri from comment #18)
> There is a proposed fix (still under debate) here: 
> 
> https://patchwork.freedesktop.org/series/52209/ 
>  
> Meanwhile, this patch that was merged a few weeks ago should have provided a
> workaround: 
> 
> https://patchwork.freedesktop.org/patch/258901/ 
> 
> Did it help?

Mesa i965 CI tests with upstream kernels, to ensure stability for the developers testing Mesa patches.

I built a kernel with these patches, and deployed it on a subset of systems in Mesa CI.  Due to the intermittent nature of the bug, it will be days before we have the data to say it helped.

Comment 20 Mark Janes 2018-12-03 23:20:36 UTC

(In reply to Francesco Balestrieri from comment #18)
> https://patchwork.freedesktop.org/patch/258901/ 
> 
> Did it help?

I ran vulkancts on this system, and found that found that it encountered GPU hang on at least 25% of vulkancts runs.  Usually, this test crashed together with the GPU Hang.

dEQP-VK.memory.pipeline_barrier.host_write_uniform_buffer.1048576


The GPU hang was unrecoverable.  The system locked up and required a manual power off.  I was able to catch the dmesg log:

[ 1166.700509] [drm] GPU HANG: ecode 9:0:0x84d5fffb, in deqp-vk [18358], reason: no progress on rcs0, action: reset
[ 1166.700510] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1166.700510] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1166.700511] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1166.700511] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1166.700512] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1166.700591] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0


Because the system was hard-hung, I could not retrieve the error state.

Comment 21 Mark Janes 2018-12-04 00:09:57 UTC

(In reply to Mark Janes from comment #20)
> I ran vulkancts on this system, and found that found that it encountered GPU
> hang on at least 25% of vulkancts runs. 

The same hard-hang occurs with drm-tip.  Unfortunately, I can't help verify Chris's patches until there is a working drm-tip kernel.

Comment 22 Tvrtko Ursulin 2018-12-04 13:22:27 UTC

I have just pushed some patches to drm-tip which fix the "system hangs after a GPU hang" symptom. So if you re-test now things might be better in this respect.

Comment 23 Mark Janes 2018-12-04 16:52:14 UTC

Is there a CI tag that I can use to build a tested drm-tip?  I have no way to recover machines that are taken offline by an issue like this.

Also, my previous testing was on top of the CI_DRM_5103 tag.  Did the CI testing show the issue you fixed?  I'd like to understand how to look at intel-gfx-ci to check that a drm-tip has passed CI.

Comment 24 Tvrtko Ursulin 2018-12-04 17:02:35 UTC

CI_DRM_5255 or newer runs contain the fix.

You look for the "success" in the summary block in the results email, but it is not 100% accurate. There are false positives, and of course false negatives as well, for all the issues which have no sufficient test coverage.

The one my fix handles is unfortunately one of those - which answers the second part of your question. So you won't see the fail -> pass transition reliably before and after this fix. It does include new tests so from this point onward it will be caught.

Note that this is not fixing the userptr problem, just the system hang after GPU hang which should enable more comfortable testing of Chris' proposed userptr fixes.

Comment 25 Chris Wilson 2018-12-04 20:17:09 UTC

(In reply to Tvrtko Ursulin from comment #24)
> CI_DRM_5255 or newer runs contain the fix.
> 
> Note that this is not fixing the userptr problem, just the system hang after
> GPU hang which should enable more comfortable testing of Chris' proposed
> userptr fixes.

Also note that mesa kbl hangs were quite capable of locking up the system before v4.18... To fully protect ourselves we have to prevent any of the incremental mesa batches from continuing to execute after a hang, as the context contains none of the register settings or accrued GPU state that the batches depend upon.

Comment 26 Mark Janes 2018-12-04 23:33:47 UTC

Created attachment 142727 [details]
card error for vulkancts gpu hang on 5255 CI-tested kernel

Comment 27 Mark Janes 2018-12-04 23:35:43 UTC

With the newer kernel gpu hang is recoverable, and I have attached the error state.  GPU hang occurs reliably running vulkancts.

I haven't seen the userptr hang yet, but the gpu hang certainly needs to be fixed.

Comment 28 Mark Janes 2018-12-05 01:04:02 UTC

(In reply to Chris Wilson from comment #25)
> Also note that mesa kbl hangs were quite capable of locking up the system
> before v4.18... To fully protect ourselves we have to prevent any of the
> incremental mesa batches from continuing to execute after a hang, as the
> context contains none of the register settings or accrued GPU state that the
> batches depend upon.

This comment and my own statements in this bug have generated some confusion as to how well the kernel recovers from GPU hang.  I'd like to provide some background data on the effectiveness gpu hang recovery, from the i965 CI runs.

Mesa CI runs production kernels 4.14 - 4.18.  Occasionally (<1% of runs), a random GPU hang may occur on older systems (esp SNB) or on newer systems due to a bug with a mesa developer's patch.  In the vast majority of cases, the system recovers from gpu hang.

In rare instances, Mesa developers test patches that are broken enough to cause unrecoverable gpu hangs.  This happens a few times per month, requiring manual/physical reboot on many systems (holding down power button or cutting power).

drm-tip kernel currently generates a gpu hang on 80% of vulkancts runs with our stable Mesa release.  Before the CI_DRM_5255 tag, 100% of the gpu hangs were unrecoverable.  After the tag, the gpu hangs recover in the same way that the production kernels do.

The drm-tip gpu hangs are separate from *this* bug, which is about deadlock.  Unlike gpu hangs, we have no way to recover from the kernel deadlock documented here.  The reboot command does work in this situation, so we can get the machines back online in an automated way -- but the system will not continue/finish the test workload.

I can't verify Chris's patch (which are on top of drm-tip) until the gpu hang is fixed.  Chris, if you backport the patches to 4.18, I can test them in CI.  I imagine they will need to be backported anyways, since the deadlock has been seen as far back as 4.14.

Comment 29 Francesco Balestrieri 2018-12-05 05:04:10 UTC

> With the newer kernel gpu hang is recoverable, and I have attached the error state.  > GPU hang occurs reliably running vulkancts.

Would it make sense to create a separate bug out of this?

Comment 30 Chris Wilson 2019-01-15 17:23:52 UTC

commit 484d9a844d0d0aeaa4cd3cec20885b7de9986a55 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 15 12:44:42 2019 +0000

    drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
    
    Since commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
    notifiers") we have been able to report failure from
    mmu_invalidate_range_start which allows us to use a trylock on the
    struct_mutex to avoid potential recursion and report -EBUSY instead.
    Furthermore, this allows us to pull the work into the main callback and
    avoid the sleight-of-hand in using a workqueue to avoid lockdep.
    
    However, not all paths to mmu_invalidate_range_start are prepared to
    handle failure, so instead of reporting the recursion, deal with it by
    propagating the failure upwards, who can decide themselves to handle it
    or report it.
    
    v2: Mark up the recursive lock behaviour and comment on the various weak
    points.
    
    v3: Follow commit 3824e41975ae ("drm/i915: Use mutex_lock_killable() from
    inside the shrinker") and also use mutex_lock_killable().
    v3.1: No leak on EINTR.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108375
    References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190115124442.3500-1-chris@chris-wilson.co.uk

Comment 31 Mark Janes 2019-01-29 17:02:53 UTC

It seems likely that this bug was caused by the kernel leaks fixed by this patch:

  https://patchwork.freedesktop.org/patch/252573/

The fix for bug 107899 is in v4.19 and was applied to earlier kernels, however some distros have not yet shipped the fix.

Mesa i965 switched to softpin for gen8+ due to this bug.  The userptr deadlocks ceased, but systems became unresponsive or OOMed after several hours of running graphics workloads.  As a result, we started thinking more about a kernel memory leak, and Jason remembered his fix.

*** This bug has been marked as a duplicate of bug 107899 ***

Comment 32 Mark Janes 2019-01-31 18:12:20 UTC

Subsequent evidence has shown that this bug is not a duplicate of bug 107899.  After deploying patched kernels (4.19) we continue to see the deadlock on HSW systems.  BDW and later have switched to softpin to avoid this bug.

If there is a backport to 4.19 of the reported kernel fix, I can verify.

Comment 33 Francesco Balestrieri 2019-06-03 05:13:01 UTC

As this doesn't seem to block Mesa CI anymore, I'm changing the priority to medium.

Comment 34 Martin Peres 2019-11-29 17:56:09 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/174.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.