VulkanCTS 1.1.2 causes machines to intermittently hang. In dmesg, the following output was found:
[78782.499488] deqp-vk D 0 19687 19392 0x00000000
[78782.499488] Call Trace:
[78782.499489] ? __schedule+0x291/0x880
[78782.499496] ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499505] ? i915_gem_close_object+0x3a/0x160 [i915]
[78782.499514] i915_gem_close_object+0x3a/0x160 [i915]
[78782.499519] ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499523] drm_gem_object_release_handle+0x2c/0x90 [drm]
[78782.499527] drm_gem_handle_delete+0x57/0x80 [drm]
[78782.499531] drm_ioctl_kernel+0x59/0xb0 [drm]
[78782.499535] drm_ioctl+0x2cb/0x380 [drm]
[78782.499539] ? drm_gem_handle_create+0x40/0x40 [drm]
[78782.499542] ? do_munmap+0x32e/0x440
[78782.499545] RIP: 0033:0x7fd6e249af07
[78782.499545] RSP: 002b:00007ffe035cb448 EFLAGS: 00000246
[78782.499546] INFO: task kworker/7:2:19952 blocked for more than 120 seconds.
[78782.499564] Tainted: G U 4.15.0 #4
[78782.499576] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78782.499595] kworker/7:2 D 0 19952 2 0x80000000
[78782.499600] Workqueue: events drm_fb_helper_dirty_work [drm_kms_helper]
This error is easier to reproduce on gen8 systems. Often, the hung machines reported VK_ERROR_OUT_OF_DEVICE_MEMORY for tests like:
Machines may complete tests, then hang on subsequent CI runs.
The oom may be symptomatic of the same failure, if it is struggling to allocate memory, then it is entirely possible for the machine to be starved an no further progress made under the mutex for a long time.
For a report like this, a "cat /proc/*/stack" is useful to see what might be holding the struct_mutex.
Created attachment 142065 [details]
ls /proc/ | xargs -ifoo sh -c 'echo /proc/foo; cat /proc/foo/cmdline; cat /proc/foo/stack;' >> /tmp/stacks.txt
Created attachment 142066 [details]
dmesg from hung sklgt2, linux 4.15
Smells like deadlock in userptr. Tons of backtraces stuck in places where most likely they want to acquire the dev->struct_mutex, except one:
[<0>] i915_gem_userptr_mn_invalidate_range_start+0x174/0x1d0 [i915]
[<0>] setup_scratch_page+0xfd/0x240 [i915]
[<0>] i915_ppgtt_create+0xad/0xa10 [i915]
[<0>] i915_gem_create_context+0x254/0x440 [i915]
[<0>] i915_gem_context_create_ioctl+0x55/0x100 [i915]
[<0>] drm_ioctl_kernel+0x59/0xb0 [drm]
[<0>] drm_ioctl+0x2cb/0x380 [drm]
Well I lied, there's more than one stuck on that flush_workqueue():
[<0>] i915_gem_userptr_mn_invalidate_range_start+0x174/0x1d0 [i915]
I think this is one for Tvrtko and Chris.
/proc/meminfo from hung system:
MemTotal: 7848548 kB
MemFree: 328460 kB
MemAvailable: 6171272 kB
Buffers: 224404 kB
Cached: 460808 kB
SwapCached: 49444 kB
Active: 1120888 kB
Inactive: 787176 kB
Active(anon): 768736 kB
Inactive(anon): 481944 kB
Active(file): 352152 kB
Inactive(file): 305232 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 8068092 kB
SwapFree: 7879420 kB
Dirty: 328 kB
Writeback: 0 kB
AnonPages: 1180400 kB
Mapped: 94784 kB
Shmem: 27744 kB
Slab: 5528728 kB
SReclaimable: 5494752 kB
SUnreclaim: 33976 kB
KernelStack: 3232 kB
PageTables: 11108 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 11992364 kB
Committed_AS: 2017224 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 520192 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
Hugepagesize: 2048 kB
DirectMap4k: 99176 kB
DirectMap2M: 7962624 kB
DirectMap1G: 0 kB
Note there's a few worker_thread stuck in i915, one of them in cancel_userptr. Jerome Glisse is also betting on a userptr deadlock.
Created attachment 142074 [details]
backtrace of spinning kernel worker
on a hung system, a single kernel thread is spinning at 100%. This trace was obtained with:
# dmesg -c > /dev/null
# echo 1 > /proc/sys/kernel/sysrq
# echo 1 > /proc/sysrq-trigger
in the previous attachment, pid 16562 is spinning
(In reply to Mark Janes from comment #8)
> in the previous attachment, pid 16562 is spinning
Seems to be process_srcu. drm uses srcu to handle drm_dev_unplug(), and I guess srcu is getting mightly pissed that we never srcu_unlock anymore. Guess, because I don't understand the patalogicals of srcu at all, so can't say with any certainty. But at least the spinner isn't surprising.
Reassigning to kernel, since clearly a kernel bug. Note originally discovered on 4.15, 4.18 still goes boom, and from what I've read, still the same code in upstream. At least the flush_workqueue is still in the mmu notifier callback.
Probably related to bug #108375
Joonas, could you take a look?
(In reply to Chris Wilson from comment #1)
> The oom may be symptomatic of the same failure, if it is struggling to
> allocate memory, then it is entirely possible for the machine to be starved
> an no further progress made under the mutex for a long time.
(In reply to Mark Janes from comment #5)
> /proc/meminfo from hung system:
> MemTotal: 7848548 kB
> MemFree: 328460 kB
> MemAvailable: 6171272 kB
> SwapTotal: 8068092 kB
> SwapFree: 7879420 kB
Memory usage at this point doesn't seem to be problem, but I think that's because the OOM kill causing program may have already gone. You would need to track memory usage before the OOM.
Btw. for i915, it's relevant to track also driver memory as that's not really visible in meminfo until other things start getting pushed to swap:
sudo watch cat /sys/kernel/debug/dri/0/i915_gem_objects
> VmallocTotal: 34359738367 kB
34 TB allocated on machine with 8 GB of RAM. That's a lot of (currently) unused memory...
Mark, is this a regression? How often does the failure occur?
I'm wondering if we could drop this from the Mesa 18.3.0 blocker list.
We're at RC5, as we normally aim to have around 4-6 release candidates.
What do you think?
This bug does appear to be a regression. Something
(In reply to Emil Velikov from comment #14)
> Mark, is this a regression? How often does the failure occur?
I've been treating this as a regression, because it began in the past few months. Because it is intermittent, I have been unable to figure out what triggered the hangs.
The failure occurs on about 30% of our CI runs. Each run invokes vulkancts on more than a dozen machines. The per-machine failure rate is about 1-2%. Unfortunately, the impact is high: the machines are unusable until they are rebooted.
It's possible that a kernel update or some other change (eg CTS version) triggered these failures. I'll do more testing on our 18.2 branches to rule that out. If the hang reproduces on 18.2, then there is no reason to block the release.
> I'm wondering if we could drop this from the Mesa 18.3.0 blocker list.
> We're at RC5, as we normally aim to have around 4-6 release candidates.
Rafael has been working to eliminate userptr in Mesa, which should fix this bug. Ideally, we would backport his work to 18.3 for the release.
I was able to reproduce this on mesa 18.2, so it cannot be a blocker for 18.3
There is a proposed fix (still under debate) here:
Meanwhile, this patch that was merged a few weeks ago should have provided a workaround:
Did it help?
(In reply to Francesco Balestrieri from comment #18)
> There is a proposed fix (still under debate) here:
> Meanwhile, this patch that was merged a few weeks ago should have provided a
> Did it help?
Mesa i965 CI tests with upstream kernels, to ensure stability for the developers testing Mesa patches.
I built a kernel with these patches, and deployed it on a subset of systems in Mesa CI. Due to the intermittent nature of the bug, it will be days before we have the data to say it helped.
(In reply to Francesco Balestrieri from comment #18)
> Did it help?
I ran vulkancts on this system, and found that found that it encountered GPU hang on at least 25% of vulkancts runs. Usually, this test crashed together with the GPU Hang.
The GPU hang was unrecoverable. The system locked up and required a manual power off. I was able to catch the dmesg log:
[ 1166.700509] [drm] GPU HANG: ecode 9:0:0x84d5fffb, in deqp-vk , reason: no progress on rcs0, action: reset
[ 1166.700510] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1166.700510] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1166.700511] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1166.700511] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1166.700512] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1166.700591] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0
Because the system was hard-hung, I could not retrieve the error state.
(In reply to Mark Janes from comment #20)
> I ran vulkancts on this system, and found that found that it encountered GPU
> hang on at least 25% of vulkancts runs.
The same hard-hang occurs with drm-tip. Unfortunately, I can't help verify Chris's patches until there is a working drm-tip kernel.
I have just pushed some patches to drm-tip which fix the "system hangs after a GPU hang" symptom. So if you re-test now things might be better in this respect.
Is there a CI tag that I can use to build a tested drm-tip? I have no way to recover machines that are taken offline by an issue like this.
Also, my previous testing was on top of the CI_DRM_5103 tag. Did the CI testing show the issue you fixed? I'd like to understand how to look at intel-gfx-ci to check that a drm-tip has passed CI.
CI_DRM_5255 or newer runs contain the fix.
You look for the "success" in the summary block in the results email, but it is not 100% accurate. There are false positives, and of course false negatives as well, for all the issues which have no sufficient test coverage.
The one my fix handles is unfortunately one of those - which answers the second part of your question. So you won't see the fail -> pass transition reliably before and after this fix. It does include new tests so from this point onward it will be caught.
Note that this is not fixing the userptr problem, just the system hang after GPU hang which should enable more comfortable testing of Chris' proposed userptr fixes.
(In reply to Tvrtko Ursulin from comment #24)
> CI_DRM_5255 or newer runs contain the fix.
> Note that this is not fixing the userptr problem, just the system hang after
> GPU hang which should enable more comfortable testing of Chris' proposed
> userptr fixes.
Also note that mesa kbl hangs were quite capable of locking up the system before v4.18... To fully protect ourselves we have to prevent any of the incremental mesa batches from continuing to execute after a hang, as the context contains none of the register settings or accrued GPU state that the batches depend upon.
Created attachment 142727 [details]
card error for vulkancts gpu hang on 5255 CI-tested kernel
With the newer kernel gpu hang is recoverable, and I have attached the error state. GPU hang occurs reliably running vulkancts.
I haven't seen the userptr hang yet, but the gpu hang certainly needs to be fixed.
(In reply to Chris Wilson from comment #25)
> Also note that mesa kbl hangs were quite capable of locking up the system
> before v4.18... To fully protect ourselves we have to prevent any of the
> incremental mesa batches from continuing to execute after a hang, as the
> context contains none of the register settings or accrued GPU state that the
> batches depend upon.
This comment and my own statements in this bug have generated some confusion as to how well the kernel recovers from GPU hang. I'd like to provide some background data on the effectiveness gpu hang recovery, from the i965 CI runs.
Mesa CI runs production kernels 4.14 - 4.18. Occasionally (<1% of runs), a random GPU hang may occur on older systems (esp SNB) or on newer systems due to a bug with a mesa developer's patch. In the vast majority of cases, the system recovers from gpu hang.
In rare instances, Mesa developers test patches that are broken enough to cause unrecoverable gpu hangs. This happens a few times per month, requiring manual/physical reboot on many systems (holding down power button or cutting power).
drm-tip kernel currently generates a gpu hang on 80% of vulkancts runs with our stable Mesa release. Before the CI_DRM_5255 tag, 100% of the gpu hangs were unrecoverable. After the tag, the gpu hangs recover in the same way that the production kernels do.
The drm-tip gpu hangs are separate from *this* bug, which is about deadlock. Unlike gpu hangs, we have no way to recover from the kernel deadlock documented here. The reboot command does work in this situation, so we can get the machines back online in an automated way -- but the system will not continue/finish the test workload.
I can't verify Chris's patch (which are on top of drm-tip) until the gpu hang is fixed. Chris, if you backport the patches to 4.18, I can test them in CI. I imagine they will need to be backported anyways, since the deadlock has been seen as far back as 4.14.
> With the newer kernel gpu hang is recoverable, and I have attached the error state. > GPU hang occurs reliably running vulkancts.
Would it make sense to create a separate bug out of this?
commit 484d9a844d0d0aeaa4cd3cec20885b7de9986a55 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <firstname.lastname@example.org>
Date: Tue Jan 15 12:44:42 2019 +0000
drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
Since commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") we have been able to report failure from
mmu_invalidate_range_start which allows us to use a trylock on the
struct_mutex to avoid potential recursion and report -EBUSY instead.
Furthermore, this allows us to pull the work into the main callback and
avoid the sleight-of-hand in using a workqueue to avoid lockdep.
However, not all paths to mmu_invalidate_range_start are prepared to
handle failure, so instead of reporting the recursion, deal with it by
propagating the failure upwards, who can decide themselves to handle it
or report it.
v2: Mark up the recursive lock behaviour and comment on the various weak
v3: Follow commit 3824e41975ae ("drm/i915: Use mutex_lock_killable() from
inside the shrinker") and also use mutex_lock_killable().
v3.1: No leak on EINTR.
References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
Signed-off-by: Chris Wilson <email@example.com>
Cc: Tvrtko Ursulin <firstname.lastname@example.org>
Reviewed-by: Tvrtko Ursulin <email@example.com>
It seems likely that this bug was caused by the kernel leaks fixed by this patch:
The fix for bug 107899 is in v4.19 and was applied to earlier kernels, however some distros have not yet shipped the fix.
Mesa i965 switched to softpin for gen8+ due to this bug. The userptr deadlocks ceased, but systems became unresponsive or OOMed after several hours of running graphics workloads. As a result, we started thinking more about a kernel memory leak, and Jason remembered his fix.
*** This bug has been marked as a duplicate of bug 107899 ***
Subsequent evidence has shown that this bug is not a duplicate of bug 107899. After deploying patched kernels (4.19) we continue to see the deadlock on HSW systems. BDW and later have switched to softpin to avoid this bug.
If there is a backport to 4.19 of the reported kernel fix, I can verify.
As this doesn't seem to block Mesa CI anymore, I'm changing the priority to medium.