Setup 1: * HW: BXT J4205 * OS: ClearLinux * kernel: drm-tip compiled from git * media: MediaSDK and its deps compiled from git (GitHub) * FFmpeg: month old Git version: 2019-05-08 c636dc9819 "libavfilter/dnn: add more data type support for dnn model input" * GUI: Weston / Wayland / Mesa compiled from git Setup 2 (differences from setup 1): * OS: Ubuntu 18.04 * FFmpeg: latest git version * GUI: Unity from Ubuntu with X server & MEsa compiled from git Test-case: 1. Run 3D benchmarks 2. Do several runs of 50 parallel instances of following H264 transcode operations: ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264 3. Do a single non-parallel run of above Expected outcome: * Everything works fine, like with month older 5.1 drm-tip kernel, or with GEN9 Core devices Actual outcome: * FFmpeg freezes at step 3 * No errors or warnings in dmesg * Anything else trying to use GPU (even just glxinfo, clinfo) freezes * Some processes not using GPU also freeze when started There are no warnings or errors in dmesg. First time I saw Media tests freezing was around 28th of May. So the regression may have come between following drm-tip commits: * 2019-05-27 14:42:27 e8f06c34fa: drm-tip: 2019y-05m-27d-14h-41m-23s UTC integration manifest * 2019-05-28 15:48:05 8991a80f85: drm-tip: 2019y-05m-28d-15h-47m-22s UTC integration manifest What Media case freezes, differs a bit e.g. based on FFmpeg version, and from week to week. It's not always the same, but it's always single instance test freezing after parallel tests have seemingly [1] finished. Both FFmpeg QSV and MediaSDK sample application H264 transcode cases have frozen. Which one, differs between setup 1 & setup 2. If I run just Media cases after boot, I don't see the freeze, so there's some interaction with what 3D benchmarks do. [1] With last night drm-tip kernel, "ps" output for processes in D state, and few other relevant process is following: --------------------------------------------- ... 38 ? DN 0:00 [khugepaged] ... 396 tty7 Ssl+ 5:08 /opt/install/bin/weston --tty=7 --idle-time=0 --xwayland ... 545 tty7 Dl+ 0:00 Xwayland :0 -rootless -listen 60 -listen 61 -wm 62 -terminate ... 8444 ? D 0:06 [ffmpeg] 8447 ? Zl 0:06 [ffmpeg] <defunct> 8448 ? Zl 0:06 [ffmpeg] <defunct> 8449 ? Zl 0:06 [ffmpeg] <defunct> 8451 ? Zl 0:06 [ffmpeg] <defunct> 8453 ? D 0:06 [ffmpeg] 8483 ? Zl 0:06 [ffmpeg] <defunct> 8497 ? Zl 0:06 [ffmpeg] <defunct> 8512 ? Zl 0:06 [ffmpeg] <defunct> 8525 ? Zl 0:06 [ffmpeg] <defunct> 8531 ? Zl 0:06 [ffmpeg] <defunct> 8546 ? Zl 0:06 [ffmpeg] <defunct> 8559 ? Zl 0:06 [ffmpeg] <defunct> 8574 ? Zl 0:06 [ffmpeg] <defunct> 8585 ? Zl 0:06 [ffmpeg] <defunct> 8603 ? Zl 0:06 [ffmpeg] <defunct> 8623 ? D 0:06 [ffmpeg] 8642 ? Zl 0:06 [ffmpeg] <defunct> 8650 ? D 0:06 [ffmpeg] 8678 ? Zl 0:06 [ffmpeg] <defunct> 8697 ? Zl 0:06 [ffmpeg] <defunct> 8704 ? D 0:06 [ffmpeg] 8711 ? D 0:06 [ffmpeg] 8723 ? D 0:06 [ffmpeg] 8733 ? D 0:06 [ffmpeg] 8756 ? Zl 143:22 [ffmpeg] <defunct> 8793 ? Zl 0:06 [ffmpeg] <defunct> 8822 ? Zl 0:06 [ffmpeg] <defunct> 8837 ? Zl 0:06 [ffmpeg] <defunct> 8845 ? D 0:06 [ffmpeg] 8851 ? Zl 0:06 [ffmpeg] <defunct> 8858 ? Zl 0:06 [ffmpeg] <defunct> 8871 ? Zl 0:06 [ffmpeg] <defunct> 8893 ? Zl 0:06 [ffmpeg] <defunct> 8942 ? Zl 0:06 [ffmpeg] <defunct> 8958 ? D 0:06 [ffmpeg] 8983 ? Zl 0:06 [ffmpeg] <defunct> 8991 ? D 0:06 [ffmpeg] 8999 ? Zl 0:06 [ffmpeg] <defunct> 9013 ? D 0:06 [ffmpeg] 9017 ? D 0:06 [ffmpeg] 9035 ? Zl 0:06 [ffmpeg] <defunct> 9058 ? Zl 0:06 [ffmpeg] <defunct> 9071 ? Zl 0:06 [ffmpeg] <defunct> 9117 ? Zl 0:06 [ffmpeg] <defunct> 9122 ? D 0:06 [ffmpeg] 9152 ? Zl 0:06 [ffmpeg] <defunct> 9165 ? Zl 0:06 [ffmpeg] <defunct> 9180 ? D 0:06 [ffmpeg] 9250 ? D 0:06 [ffmpeg] 9758 ? Ds 0:00 ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264 ... 10184 ? D 0:00 top ... --------------------------------------------- 50 ffmpeg instances can use a bit of GEM objects (and FFmpeg QSV uses more of them than MediaSDK alone, or FFmpeg VAAPI), which can put memory pressure on the system, so kernel "khugepaged" thread being in D state is suspicious. Other notes: * I had also a process running in the background which uses ftrace uprobes to track frame update functions from the 3D and Media processes, and tracking "i915:intel_gpu_freq_change" events * Killing that process after freeze, caused ssh to stop working, so it's possible that there's some connection to ftrace
What's the latest tip tested? We're carrying a couple of workarounds for mm/ bugs: commit f27a5d91201639161d6f6e25af1c89c9cbb3cac7 Author: Hugh Dickins <hughd@google.com> Date: Wed May 29 09:25:40 2019 +0200 x86/fpu: Use fault_in_pages_writeable() for pre-faulting commit 2e5e960659d76767ff9e9ef04306dbf36f07f46f (topic/core-for-CI) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jun 3 08:08:43 2019 +0100 mm: Band aid for 5fd4ca2d84b249f0858ce28cf637cf25b61a398f
(In reply to Chris Wilson from comment #1) > What's the latest tip tested? Yesterday evening: 2019-06-05 16:28:24 7a984cf096: drm-tip: 2019y-06m-05d-16h-27m-26s UTC integration manifest (In reply to Eero Tamminen from comment #0) > 50 ffmpeg instances can use a bit of GEM objects (and FFmpeg QSV uses more > of them than MediaSDK alone, or FFmpeg VAAPI) About this much (in total 2GB which shouldn't be a problem): ------------------------------------------ 12769 shrinkable objects, 2378752000 bytes 491 unbound objects, 29552640 bytes 12277 bound objects, 2349133824 bytes 51 purgeable objects, 10121216 bytes 220 mapped objects, 11239424 bytes 2220 huge-paged objects (2M, 64K, 4K) 2080702464 bytes 12 display objects (globally pinned), 8851456 bytes 4294967296 [0x0000000010000000] gtt total Supported page sizes: 2M, 64K, 4K ------------------------------------------ On each round the max number of huge-paged objects seems to grow, although it drops back to same number after tests have finished.
(In reply to Eero Tamminen from comment #0) > Test-case: > 1. Run 3D benchmarks > 2. Do several runs of 50 parallel instances of following H264 transcode > operations: > ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i > 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf > scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264 And on each round, kill the whole process group with the 50 parallel transcoders after few minutes. > 3. Do a single non-parallel run of above > > Expected outcome: > * Everything works fine, like with month older 5.1 drm-tip kernel, or with > GEN9 Core devices I'm pretty sure the reason this happens only on BXT, is it being much slower than core devices, and this test often hitting specified timeout before it finishes. At timeout, the whole transcoding process group is killed. => I think the trigger for this issue is killing of large number of media processes, and there being some issue / race / deadlock in the kernel media resource de-allocation path. => I'll try to get sysrq task info / bt for them, after I've reproduced the issue again. > What Media case freezes, differs a bit e.g. based on FFmpeg version, and > from week to week. It's not always the same, but it's always single > instance test freezing after parallel tests have seemingly [1] finished. The reason why it's the single process that seems to freeze after parallel run, is that when the parallel transcoding process group is killed, only the parallelization wrapper actually dies because the transcoding processes are in D state. Because the wrapper died, test runner thinks that test finished, and starts next one. And as the single transcoding process directly invoked after them freezes and refuses to die, it looks like that's where it froze, although freeze happened already in previous test. PS: while set of 3D tests run before media tests in setup 1 and setup 2 differ, some of the tests in setup 1 are Wayland/EGL variants of the X/GLX tests in setup 2. Those are GfxBench 5.0 and SynMark2 v7. X based GPU memory bandwidth tests are same in both setups. Rest of the tests differ, but are also simpler, so most likely they don't matter for this bug.
When it gets stuck, grab the stacktraces of all processes, sysrq-t or cat /proc/*/stack.
Created attachment 144476 [details] Kernel stacks for all tasks when the problem happens # for i in /proc/[0-9]*; do egrep '^(Name|State|Pid)' $i/status; cat $i/stack; done | grep -B1 -A4 "disk sleep" Name: khugepaged State: D (disk sleep) Pid: 38 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 -- Name: kworker/u8:0+i915 State: D (disk sleep) Pid: 7274 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0 -- Name: ffmpeg State: D (disk sleep) Pid: 7293 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 --- lots of similar ffmpeg tasks --- Name: ffmpeg State: D (disk sleep) Pid: 8606 [<0>] chrdev_open+0xa3/0x1b0 [<0>] do_dentry_open+0x1c4/0x380 [<0>] path_openat+0x564/0x11f0 --- lots of similar ffmpeg tasks --- Name: ps State: D (disk sleep) Pid: 8751 [<0>] proc_pid_cmdline_read+0x1e3/0x330 [<0>] vfs_read+0x91/0x140 [<0>] ksys_read+0x91/0xe0 I guess the ffmpeg processes in exit stage are ones from previous test, and the ones trying to open device file ffmpeg processes from the next test. To me it looks like kworker doing __i915_gem_free_work() stuff would be blocking khugepaged task and everything else. PS. Freeze when "ps" reads PID command lines seems to happen for running processes: $ for i in /proc/[0-9]*; do egrep '^(Name|State|Pid)' $i/status; cat $i/cmdline; done ... Name: ffmpeg State: Z (zombie) Pid: 7937 Name: ffmpeg State: Z (zombie) Pid: 7977 Name: ffmpeg State: R (running) Pid: 8008 <freeze>
From the full ffmpeg stack it's clearer what is the trigger. SIGINT is sent for the transcoding process group, kernel reacts to that and on do_exit() for the FFmpeg it calls task_work_run(), which I assume to be adding stuff for the kworker that then calls __i915_gem_free_work()? Name: kworker/u8:0+i915 State: D (disk sleep) Pid: 7274 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 Name: ffmpeg State: Z (zombie) Pid: 7285 Name: ffmpeg State: Z (zombie) Pid: 7292 Name: ffmpeg State: D (disk sleep) Pid: 7293 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 [<0>] do_group_exit+0x33/0xa0 [<0>] get_signal+0x121/0x910 [<0>] do_signal+0x30/0x6b0 [<0>] exit_to_usermode_loop+0x96/0xb0 [<0>] prepare_exit_to_usermode+0x8c/0xb0 [<0>] retint_user+0x8/0x8 (In reply to Eero Tamminen from comment #3) > PS: while set of 3D tests run before media tests in setup 1 and setup 2 > differ, some of the tests in setup 1 are Wayland/EGL variants of the X/GLX > tests in setup 2. Those are GfxBench 5.0 and SynMark2 v7. X based GPU > memory bandwidth tests are same in both setups. Rest of the tests differ, > but are also simpler, so most likely they don't matter for this bug. Running just gfxbench, synmark & membw 3D tests before media tests wasn't enough either, so for now I can can test this only after full 3-4h test round => can't really bisect this to steps that would take only minutes, iterating from full set down to something like that would take way more time than I'm allowed for this. :-/
I've now verified that every time this freeze happens, these are stuck: ----------------------------------------------- Name: khugepaged State: D (disk sleep) Pid: 38 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 -- Name: kworker/u8:2+i915 State: D (disk sleep) Pid: 7911 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 ----------------------------------------------- in addition to the stuck transcoding tasks: ----------------------------------------------- ... Name: sample_multi_tr State: D (disk sleep) Pid: 20118 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 [<0>] do_group_exit+0x33/0xa0 [<0>] get_signal+0x121/0x910 [<0>] do_signal+0x30/0x6b0 [<0>] exit_to_usermode_loop+0x96/0xb0 [<0>] do_syscall_64+0x103/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 -- Name: sample_multi_tr State: D (disk sleep) Pid: 20221 [<0>] chrdev_open+0xa3/0x1b0 [<0>] do_dentry_open+0x1c4/0x380 [<0>] path_openat+0x564/0x11f0 [<0>] path_openat+0x564/0x11f0 [<0>] do_filp_open+0x9b/0x110 [<0>] do_sys_open+0x1bd/0x260 [<0>] do_syscall_64+0x4f/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ... ----------------------------------------------- Above is from last evening drm-tip git build. For some reason on ClearLinux (comment 5) ffmpeg git build gets stuck, and on Ubuntu 18.04 (this comment) it's the MSDK git build test app. In both setups, kernel & Media stack are using same binaries. Main difference between these setups are: - HW is supposed to be identical, but they are separate machines - Weston vs. X server running in background (which aren't used in Media tests) - optimizations in ClearLinux libc I think the differences in freezing app are just due to some differences in timings.
Created attachment 144520 [details] Kernel stacks for all tasks when the problem happens (In reply to Eero Tamminen from comment #7) > I've now verified that every time this freeze happens, these are stuck: > ----------------------------------------------- > Name: khugepaged > State: D (disk sleep) > Pid: 38 > [<0>] kthread+0x113/0x130 > [<0>] ret_from_fork+0x35/0x40 > -- > Name: kworker/u8:2+i915 > State: D (disk sleep) > Pid: 7911 > [<0>] __i915_gem_free_work+0x5f/0x90 [i915] > [<0>] process_one_work+0x1e9/0x410 > [<0>] worker_thread+0x2d/0x3d0 > [<0>] kthread+0x113/0x130 > [<0>] ret_from_fork+0x35/0x40 > ----------------------------------------------- > > in addition to the stuck transcoding tasks: [...] Today, there was no frozen khugepaged, just these: ----------------------------------------------- Name: kworker/u8:0+i915 State: D (disk sleep) Pid: 7194 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 ... Name: ffmpeg State: D (disk sleep) Pid: 9407 [<0>] drm_file_free.part.0+0x1c8/0x280 [<0>] drm_release+0x9c/0xd0 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 [<0>] do_group_exit+0x33/0xa0 [<0>] get_signal+0x121/0x910 [<0>] do_signal+0x30/0x6b0 [<0>] exit_to_usermode_loop+0x96/0xb0 [<0>] do_syscall_64+0x103/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ----------------------------------------------- In addition to the earlier listed transcode processes stuck at fput() and chrdev_open(). See attachment. => I assume khugepaged stuff isn't related to the issue, it was just innocent bystander.
Those stacks are all waiting for struct_mutex, but nothing appears to be actually holding it.
(In reply to Chris Wilson from comment #9) > Those stacks are all waiting for struct_mutex, but nothing appears to be > actually holding it. Any suggestions on how to get more info for you on the issue, either with some kernel command line option [1], or after the issue has triggered? [1] anything affecting performance, could change when/how the issue triggers.
(In reply to Chris Wilson from comment #9) > Those stacks are all waiting for struct_mutex, but nothing appears to be > actually holding it. It would be a kernel bug, but is it possible that mutex had been taken for one of the killed processes, and it had been killed before freeing the mutex, and kernel somehow missed freeing the lock for the killed process?
It is possible; that is the type of bug lockdep complains loudly about. If you do get the opportunity to run with lockdep and khungtaskd, that should dump all locks held and their owners when stuck.
In can trigger also with smaller number of parallel media processes, 5 seems enough: # ps ax|grep -e ' Z' -e ' D' 38 ? DN 0:00 [khugepaged] 41 ? D 0:05 [kworker/u8:1+i9] 18218 ? Zl 146:04 [sample_multi_tr] <defunct> 18219 ? Zl 0:05 [sample_multi_tr] <defunct> 18221 ? Zl 0:05 [sample_multi_tr] <defunct> 18225 ? D 0:05 [sample_multi_tr] 18244 ? D 0:05 [sample_multi_tr] 18415 ? Ds 0:00 sample_multi_transcode -i::mpeg2 1920x1080i_29.97_20mb_mpeg2_high.mpv -o::h264 0024_HD17i7_1.0.h264 -b 6000 -u 7 -n 2400 -async 4 -hw # for i in /proc/[0-9]*; do egrep '^(Name|State|Pid)' $i/status; cat $i/stack; done | grep -B1 -A4 "disk sleep" Name: sample_multi_tr State: D (disk sleep) Pid: 18225 [<0>] drm_file_free.part.0+0x1c8/0x280 [<0>] drm_release+0x9c/0xd0 [<0>] __fput+0xae/0x200 -- Name: sample_multi_tr State: D (disk sleep) Pid: 18244 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 -- Name: sample_multi_tr State: D (disk sleep) Pid: 18415 [<0>] chrdev_open+0xa3/0x1b0 [<0>] do_dentry_open+0x1c4/0x380 [<0>] path_openat+0x564/0x11f0 -- Name: khugepaged State: D (disk sleep) Pid: 38 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 -- Name: kworker/u8:1+i915 State: D (disk sleep) Pid: 41 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0
(In reply to Chris Wilson from comment #12) > It is possible; that is the type of bug lockdep complains loudly about. If > you do get the opportunity to run with lockdep and khungtaskd, that should > dump all locks held and their owners when stuck. I could give it a quick try. Which exact kernel build CONFIG_*, and kernel boot command line options I need for these with drm-tip git kernel version?
Created attachment 144597 [details] KBL GT3e i915 error state for recoverable GPU hang in same use-case I'm not sure whether they are at all relevant for this deadlock(?) freeze issue, but I'm seeing also some random recoverable GPU hangs in the same test-case on Core devices.
Large number of parallel media workloads, which get a group KILL signal (because they took too much time), still continue to deadlock i915 and other parts of kernel in most annoying way with latest drm-tip 5.2.0-rc7 git kernel. Device doesn't appear frozen to normal freeze checks, but "random" operations on it get stuck: # for i in /proc/[0-9]*; do egrep '^(Name|State|Pid)' $i/status; cat $i/stack; done | grep -B1 -A6 "disk sleep" Name: kworker/u8:1+i915 State: D (disk sleep) Pid: 10660 [<0>] __i915_gem_free_work+0x5f/0x90 [i915] [<0>] process_one_work+0x1e9/0x410 [<0>] worker_thread+0x2d/0x3d0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 -- Name: ffmpeg State: D (disk sleep) Pid: 10704 [<0>] __fput+0xae/0x200 [<0>] task_work_run+0x84/0xa0 [<0>] do_exit+0x308/0xba0 [<0>] do_group_exit+0x33/0xa0 [<0>] get_signal+0x121/0x910 -- <17 similar FFmpeg processes> -- Name: compiz State: D (disk sleep) Pid: 1195 [<0>] i915_gem_create_ioctl+0x17/0x30 [i915] [<0>] drm_ioctl_kernel+0x88/0xf0 [<0>] drm_ioctl+0x2f8/0x3b0 [<0>] do_vfs_ioctl+0xa4/0x630 [<0>] ksys_ioctl+0x3a/0x70 -- Name: ffmpeg State: D (disk sleep) Pid: 12011 [<0>] chrdev_open+0xa3/0x1b0 [<0>] do_dentry_open+0x1c4/0x380 [<0>] path_openat+0x564/0x11f0 [<0>] do_filp_open+0x9b/0x110 [<0>] do_sys_open+0x1bd/0x260 -- Name: ps State: D (disk sleep) Pid: 16106 [<0>] proc_pid_cmdline_read+0x1e3/0x330 [<0>] vfs_read+0x91/0x140 [<0>] ksys_read+0x91/0xe0 [<0>] do_syscall_64+0x4f/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 -- Name: khugepaged State: D (disk sleep) Pid: 38 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 (This semi-zombie device state breaks our automation because it somehow manages to freeze Jenkins job that shouldn't be doing anything on the device, just trigger start of testing on that machine with a forced reboot job.)
So far the smallest set of tests with which I've managed to reproduce the freeze is this set of tests (in this order): * running 3x 33 Synmark tests in fullscreen * running 5x 3 GLBenchmark windowed tests, all both onscreen & offscreen * running 3x 10 GpuTest windowed & fullscreen tests * running 3x 5 different FFmpeg VAAPI transcode operations, alternating multiple processes (parallel process count being from 5 to 50) and single transcode process * at end doing process group kill for the 50 parallel processes case before next single transcode process However, that freeze wasn't reproducible on next boot. From earlier data, it seems that running with this amount of 3D tests before media tests results in approximately 1/2 reproducibility, and 2x larger set of 3D tests before the Media tests, makes it nearly fully reproducible (not sure whether use of MediaSDK made it also more reproducible, as I hadn't tried plain VAAPI backend before). I haven't been able to trigger the media freeze by using just fullscreen 3D applications, so I assume some parallel 3D engine usage is needed (X or Weston compositing in addition to 3D benchmark itself) before parallel media engine usage. But running just the (above) windowed 3D benchmarks before media tests hasn't yet been enough to trigger the freeze either. => I think i915 would need some specific IGT tests that: * access i915 from multiple processes in parallel * do that for each supported engine in turn, and then all of them at same time * some of those processes are in separate process group and get killed as a group during testing, and tester just forks more of them as replacement * is run under lockdep
(In reply to Eero Tamminen from comment #17) > So far the smallest set of tests with which I've managed to reproduce the > freeze is this set of tests (in this order): [...] I have slightly smaller test-case, with freeze reproducibility of about 1/3: 1. Start Ubuntu 18.04 with X/Unity & latest git gfx/media stack 2. Run large number of Wayland test cases which fail immediately at start because there's no Wayland. I don't think these are relevant, but if some of them briefly open window before exiting, it does exercise i915 3. Run 3x 7 (very heavy / low FPS) windowed GPU bandwidth tests 4. Run 3x 8 different FFmpeg QSV (=using MediaSDK) transcode operations, alternating betwee multiple processes (parallel process count being from 5 to 50) and single transcode process versions of them 5. Run same things as in 4), but using MediaSDK (transcode sample app) directly In some of these 1/3 freeze occurrences, freezing happens in step 3), sometimes in step 4). After Media freeze, reboot seems stuck at shutdown, device needs to be power-cycled.
The patch we identified as being problematic was reverted just before 5.2; tbf I don't have a lot of confidence that was the only problematic patch in that mm/ series. e8f06c34fa..8991a80f85 does contain more fences, and so increased opportunity for a missed wakeup. They would however be shown as processes stuck waiting for a fence inside i915; whereas what your kernel stacks show are processes stuck waiting for page io (if I read that correctly, generally from __dput).
(In reply to Chris Wilson from comment #19) > e8f06c34fa..8991a80f85 does contain more fences, and so increased > opportunity for a missed wakeup. They would however be shown as processes > stuck waiting for a fence inside i915; whereas what your kernel stacks show > are processes stuck waiting for page io (if I read that correctly, generally > from __dput). Do you have detailed instructions on how to enable lockdep (google didn't tell)? I.e. hat *specific* kernel config options need to be enabled, what set on kernel command line etc? I could give it a try after I come from summer vacation, if this is still happening.
$ ./scripts/config --enable CONFIG_PROVE_LOCKING selects everything required for lockdep, i.e. # # Lock Debugging (spinlocks, mutexes, etc...) # CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y # CONFIG_LOCK_STAT is not set CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y CONFIG_DEBUG_RWSEMS=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y # CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_ATOMIC_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set # CONFIG_WW_MUTEX_SELFTEST is not set # end of Lock Debugging (spinlocks, mutexes, etc...)
Quick testing didn't give any warnings from lockdep, but I wasn't able to trigger the freeze yet either. David said that he could try full test set with lockdep some time next week.
Created attachment 144814 [details] dmesg from a Broxton stuck during media testing
I managed to find an easy way to reproduce unrecoverable hangs. The full test set isn't necessary; it's enough to run 5-10 iterations of the gstreamer_hd22_p50 testcase. Eero's lockdep enabled configuration wasn't enough (I'm assuming he used the options you provided), but I enabled a few more debugging options which gave quite a lot of output--hopefully some of it useful. dmesg attached (well, as much as fit in the scrollback on the console where I followed it).
dmesg too short, it needed the stacktrace for kworker/u8:1 and khugepaged
Damn. OK, I'll try to get a longer trace.
If you do get good at reproducing this :), could you try https://patchwork.freedesktop.org/patch/318722/?series=63857&rev=1
Created attachment 144824 [details] dmesg from a Broxton stuck during media testing w/ Chris's patch applied I tested with your patch applied, but I can still reproduce a hang. I've attached the log; hopefully it'll be some use.
This is the only process of interest that is active [ 3810.865923] h264parse0:sink R running task 0 10012 1 0x00004004 [ 3810.865928] Call Trace: [ 3810.865931] ? mark_held_locks+0x49/0x70 [ 3810.865936] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 3810.865939] ? lockdep_hardirqs_on+0xeb/0x180 [ 3810.865943] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 3810.865950] ? retint_kernel+0x2b/0x2b [ 3810.866009] ? i915_gem_evict_something+0x2df/0x4e0 [i915] [ 3810.866054] ? i915_gem_evict_something+0x2fc/0x4e0 [i915] [ 3810.866105] ? i915_gem_gtt_insert+0x1ce/0x270 [i915] [ 3810.866152] ? __i915_vma_do_pin+0x40c/0x530 [i915] [ 3810.866200] ? i915_gem_object_ggtt_pin+0x11f/0x1e0 [i915] [ 3810.866241] ? i915_gem_fault+0x291/0x6d0 [i915] [ 3810.866245] ? __lock_acquire+0x2bf/0x22e0 [ 3810.866255] ? __do_fault+0x31/0xb0 [ 3810.866262] ? __handle_mm_fault+0x9ab/0xfc0 [ 3810.866274] ? handle_mm_fault+0x15a/0x350 [ 3810.866280] ? __do_page_fault+0x24d/0x500 [ 3810.866287] ? page_fault+0x8/0x30 [ 3810.866291] ? page_fault+0x1e/0x30 What does the profiler say is going on? From the above, it seems the system is still running, just struggling? That eviction is going to struggle to clear context objects...
The kernel still has ggtt_flush() inside evict, so it should still be flushing context objects from the ggtt.
Hmmm, so you're saying that given enough time the machine could/should recover? I left the machine on 4h after the test concluded (hung), but no luck. The desktop clock of the machine stopped updating at the end of the task, the machine stopped replying to ping, and when I checked 4h later the machine still didn't respond to keyboard input.
(In reply to David Weinehall from comment #31) > Hmmm, so you're saying that given enough time the machine could/should > recover? > > I left the machine on 4h after the test concluded (hung), but no luck. The > desktop clock of the machine stopped updating at the end of the task, the > machine stopped replying to ping, and when I checked 4h later the machine > still didn't respond to keyboard input. There are a few corner cases where oom just eats itself, but nowadays they usually result in a panic + reboot. (Do you have reboot on panic set?) Give or take a few HW faults, a lockup usually means a panic. I would personally use remote logging (via netconsole) in case it is a panic. If not, and with lockdep and kernel NMI watchdog running, it just quietly hangs, we found ourselves a more "interesting" issue.
Fwiw, commit 515b8b7e935ef3f59c4efda04a3b05353ed6fbb7 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Aug 2 22:21:37 2019 +0100 drm/i915: Flush the freed object list on file close As we increase the number of RCU objects, it becomes easier for us to have several hundred thousand objects in the deferred RCU free queues. An example is gem_ctx_create/files which continually creates active contexts, which are not immediately freed upon close as they are kept alive by outstanding requests. This lack of backpressure allows the context objects to persist until they overwhelm and starve the system. We can increase our backpressure by flushing the freed object queue upon closing the device fd which should then not impact other clients. Testcase: igt/gem_ctx_create/*files Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-2-chris@chris-wilson.co.uk commit 1aff1903d0ff53f055088a77948ac8d8224d42db Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Aug 2 22:21:36 2019 +0100 drm/i915: Hide unshrinkable context objects from the shrinker The shrinker cannot touch objects used by the contexts (logical state and ring). Currently we mark those as "pin_global" to let the shrinker skip over them, however, if we remove them from the shrinker lists entirely, we don't event have to include them in our shrink accounting. By keeping the unshrinkable objects in our shrinker tracking, we report a large number of objects available to be shrunk, and leave the shrinker deeply unsatisfied when we fail to reclaim those. The shrinker will persist in trying to reclaim the unavailable objects, forcing the system into a livelock (not even hitting the dread oomkiller). v2: Extend unshrinkable protection for perma-pinned scratch and guc allocations (Tvrtko) v3: Notice that we should be pinned when marking unshrinkable and so the link cannot be empty; merge duplicate paths. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-1-chris@chris-wilson.co.uk It didn't seem to help earlier though -- but I still think are related.
Created attachment 144960 [details] Netconsole log with drm-tip from 2019y-08m-05d-17h-28m-24s Both of the patches you reference in comment 33 are in the drm-tip used for this test run; the machine still hangs. The log has been taken from netconsole after boot (hence why the entire dmesg isn't there). Reboot on panic was set, but a reboot wasn't triggered for some reason. I had htop running in a separate connection to the machine; the memory usage at the point it stopped responding was high, but not alarmingly so (maybe 5 out of 8GB). What I find interesting is that in most cases all testruns complete; the machine hangs when finalising the test run.
(In reply to David Weinehall from comment #34) > Created attachment 144960 [details] > Netconsole log with drm-tip from 2019y-08m-05d-17h-28m-24s At the instant of that hungtask, it was waiting for [ 2582.003387] R13: 0000000000000000 R14: 00007fc30c02be50 R15: 0000000000000008 [ 2582.003405] h264parse0:sink R running task 0 6195 1 0x00004004 [ 2582.003418] Call Trace: [ 2582.003429] ? mark_held_locks+0x49/0x70 [ 2582.003437] ? trace_hardirqs_on_thunk+0x1a/0x20 [ 2582.003445] ? lockdep_hardirqs_on+0xeb/0x180 [ 2582.003454] ? trace_hardirqs_on_thunk+0x1a/0x20 [ 2582.003465] ? retint_kernel+0x2b/0x2b [ 2582.003481] ? drm_mm_scan_remove_block+0x23/0x50 [ 2582.003543] ? i915_gem_evict_something+0x2fc/0x4e0 [i915] [ 2582.003601] ? i915_gem_gtt_insert+0x1ce/0x270 [i915] [ 2582.003655] ? __i915_vma_do_pin+0x40c/0x530 [i915] [ 2582.003707] ? i915_gem_object_ggtt_pin+0x11f/0x1e0 [i915] [ 2582.003755] ? i915_gem_fault+0x291/0x6d0 [i915] [ 2582.003787] ? __do_fault+0x31/0xb0 [ 2582.003801] ? __handle_mm_fault+0x90d/0xfc0 [ 2582.003820] ? handle_mm_fault+0x15a/0x360 [ 2582.003832] ? __do_page_fault+0x24e/0x500 [ 2582.003845] ? page_fault+0x34/0x40 Other than complaining about the stupidity of using mmap_gtt, I guess it means we were badly fragmented and eviction was slow.
Well, define "slow"? Half an hour? Hour? Multiple hours?
Have you heard the tale of the pagefault of doom? It's not a story any developer would tell you... Under exactly the wrong conditions, eviction of entire surfaces (up to the GGTT) can occur between every singe byte read and written via the mmap. We are not meant to be that stupid any more (although it comes close), and instead of evicting the entire GGTT it may just be ending up in a ping-pong of the memcpy(dst, src). For moderately sized buffers (say a few 10s of MB) you are not likely to see that complete within a year. I hope we haven't regressed to that level of stupidity, but I wouldn't put it past me.
Is there anyway to debug whether that may be the case?
That shows up in perf as a process spending 100% inside the pagefault performing eviction.
Try https://cgit.freedesktop.org/~ickle/linux-2.6/patch/?id=3b7910f6317981233154d4cb6473648a1d0404af as a potential idea
Sadly the patch in comment #40 isn't enough to fix the issue.
Is it still making "forward" progress though? So far the only lead has been that one 'R' process still doing a map_gtt pagefault
(In reply to David Weinehall from comment #24) > I managed to find an easy way to reproduce unrecoverable hangs. The full > test set isn't necessary; it's enough to run 5-10 iterations of the > gstreamer_hd22_p50 testcase. I.e. no 3D tests need to be run before that, just booting (Ubuntu 18.04 Unity) desktop is enough. It hasn't been tested whether the issue can be reproduced on headless setup without X11, but I assume it would.
So let's try and reproduce that. A script or download & setup instructions, or disk image of the test setup?
FWIW with Chris's patch from Comment #40 the tests complete without any ominous log messages in dmesg (the kernel is configured with lockdep + netconsole, and I only got the log messages telling what test was on-going). BUT: it still hangs the same way as before. So the patch probably fixes something, but it isn't enough to make things work.
(In reply to Chris Wilson from comment #44) > So let's try and reproduce that. A script or download & setup instructions, > or disk image of the test setup? I sent instructions to your Intel email.
Created attachment 145076 [details] MediaSDK sample transcode app input file for running 5 parallel MPEG2->AVC transcoding streams After removing the above test-case from regular testing, there's another one that hangs, a bit simpler one. As the earlier test-case can trigger deadlock alone, I'm assuming this one can also. 1. Build media-driver & MediaSDK: https://github.com/intel/media-driver/wiki/How-to-build-and-use-media-driver%3F https://github.com/Intel-Media-SDK/MediaSDK/wiki/Build-Media-SDK-on-Ubuntu 2. Run 5 instances of following multi-stream process in parallel in a same process group, with the attached parameter file & suitably sized MPEG2 input file: sample_multi_transcode -par parameters.txt 3. Kill the process group before above command finishes 4. Repeat from 2)
I've never seen this happen with FFmpeg that uses media-driver directly through libva, only when media-driver is used in (FFmpeg ->) MediaSDK -> libva -> media-driver -> kernel/i915 pipeline (I've running much less tests with FFmpeg / VA-API though). I've seen in FFmpeg/MediaSDK bug trackers some mentions that MediaSDK would do some media-driver access directly, without always going through libva, potentially something related to VPP (video post-processing, not sure whether it uses e.g. SFC or shaders).
(In reply to Eero Tamminen from comment #48) > I've never seen this happen with FFmpeg that uses media-driver directly > through libva, only when media-driver is used in (FFmpeg ->) MediaSDK -> > libva -> media-driver -> kernel/i915 pipeline (I've running much less tests > with FFmpeg / VA-API though). And few days afterwards I got test machine disconnect with FFmpeg VA-API after 50x test-case. This was followed by automated reboot, but I'm pretty sure the cause for the issue was same as with MSDK using media applications. I haven't seen this with KBL anymore, so I'm removing that from list platforms (I assume it's fast enough not to trigger the deadlock). In case it matters, all of these machines have 8GB of RAM. Updating also title, as David found out that triggering the issue doesn't need 3D test-cases.
If you get the chance, it would be nice to have some feedback on https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=wip-mutex under the presumption that this code is stuck under struct_mutex in the shrinker/evict and that branch eliminates struct_mutex from those paths.
Testing with a kernel using your wip-mutex branch gives the following: [ 774.826181] general protection fault: 0000 [#1] PREEMPT SMP PTI [ 774.826192] CPU: 3 PID: 4859 Comm: h264parse0:sink Not tainted 5.3.0-rc6-CI-Custom_4349+ #1 [ 774.826195] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017 [ 774.826253] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915] [ 774.826258] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8 [ 774.826261] RSP: 0018:ffffc900018b39b0 EFLAGS: 00010283 [ 774.826265] RAX: ffff888272178800 RBX: ffff8881ca561080 RCX: 0000000000000000 [ 774.826267] RDX: ffff888263f80000 RSI: ffffffff82661d60 RDI: ffff888204495f40 [ 774.826270] RBP: ffffc900018b39c8 R08: 0000000000000001 R09: ffffffffa0208c00 [ 774.826273] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881e2813d48 [ 774.826275] R13: ffffc900018b39c8 R14: ffff8881e2813c80 R15: dead000000000122 [ 774.826279] FS: 00007f927aef6700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 774.826281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 774.826284] CR2: 00007f1da83b2000 CR3: 00000001e467e000 CR4: 00000000003406e0 [ 774.826287] Call Trace: [ 774.826338] i915_request_await_object+0x248/0x380 [i915] [ 774.826345] ? __mutex_unlock_slowpath+0x4b/0x2b0 [ 774.826387] i915_gem_do_execbuffer+0x18a1/0x1e00 [i915] [ 774.826404] ? __lock_acquire+0x25f/0x1cc0 [ 774.826452] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 774.826492] i915_gem_execbuffer2_ioctl+0x103/0x420 [i915] [ 774.826534] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 774.826540] drm_ioctl_kernel+0x88/0xf0 [ 774.826546] drm_ioctl+0x2f8/0x3b0 [ 774.826587] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 774.826598] do_vfs_ioctl+0xa5/0x6f0 [ 774.826605] ? __fget+0x114/0x200 [ 774.826611] ksys_ioctl+0x3a/0x70 [ 774.826616] __x64_sys_ioctl+0x16/0x20 [ 774.826621] do_syscall_64+0x5a/0x1c0 [ 774.826627] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 774.826631] RIP: 0033:0x7f92861395d7 [ 774.826635] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 [ 774.826638] RSP: 002b:00007f927aef5068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 774.826641] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f92861395d7 [ 774.826644] RDX: 00007f927aef5110 RSI: 00000000c0406469 RDI: 0000000000000006 [ 774.826647] RBP: 00007f927aef5090 R08: 0000555c6cfa1890 R09: 0000000000000000 [ 774.826649] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f927aef5110 [ 774.826652] R13: 00000000c0406469 R14: 0000000000000006 R15: 0000555c6cfa1770 [ 774.826661] Modules linked in: fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_intel_nhlt lpc_ich snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 pinctrl_broxton mei_me pinctrl_intel realtek mei sunrpc [ 774.826681] ---[ end trace 867abd8886fa79e0 ]--- [ 774.826724] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915] [ 774.826728] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8 [ 774.826731] RSP: 0018:ffffc900018b39b0 EFLAGS: 00010283 [ 774.826734] RAX: ffff888272178800 RBX: ffff8881ca561080 RCX: 0000000000000000 [ 774.826737] RDX: ffff888263f80000 RSI: ffffffff82661d60 RDI: ffff888204495f40 [ 774.826740] RBP: ffffc900018b39c8 R08: 0000000000000001 R09: ffffffffa0208c00 [ 774.826742] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881e2813d48 [ 774.826745] R13: ffffc900018b39c8 R14: ffff8881e2813c80 R15: dead000000000122 [ 774.826748] FS: 00007f927aef6700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 774.826751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 774.826754] CR2: 00007f1da83b2000 CR3: 00000001e467e000 CR4: 00000000003406e0
Round 2? commit 33437a8f0703f9eed186a11484b9f475de1c743f (HEAD, fdo/wip-mutex) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jul 18 13:39:56 2019 +0100 drm/i915: Move global activity tracking from GEM to GT
I'll give it a whirl tomorrow; the machine in question is currently busy doing nightly testing.
The latest patch still isn't quite perfect; already during the first test iteration the following happens: [ 256.976769] general protection fault: 0000 [#1] PREEMPT SMP PTI [ 256.976781] CPU: 3 PID: 2601 Comm: h264parse0:sink Not tainted 5.3.0-rc6-CI-Custom_4350+ #1 [ 256.976784] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017 [ 256.976852] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915] [ 256.976857] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8 [ 256.976861] RSP: 0018:ffffc90001f839b0 EFLAGS: 00010283 [ 256.976865] RAX: ffff88818075c600 RBX: ffff88811cf0aec0 RCX: 0000000000000000 [ 256.976869] RDX: ffff88827180c000 RSI: 0000000000000000 RDI: 00000000ffffffff [ 256.976872] RBP: ffffc90001f839c8 R08: ffff88827180c550 R09: 0000000000000000 [ 256.976875] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88811a8cd148 [ 256.976878] R13: ffffc90001f839c8 R14: ffff88811a8cd080 R15: dead000000000122 [ 256.976882] FS: 00007f932cba4700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 256.976886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 256.976889] CR2: 00007fbae3374000 CR3: 0000000270288000 CR4: 00000000003406e0 [ 256.976892] Call Trace: [ 256.976954] i915_request_await_object+0x248/0x380 [i915] [ 256.976963] ? __mutex_unlock_slowpath+0x4b/0x2b0 [ 256.977014] i915_gem_do_execbuffer+0x18a1/0x1e00 [i915] [ 256.977034] ? __lock_acquire+0x25f/0x1cc0 [ 256.977114] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 256.977154] i915_gem_execbuffer2_ioctl+0x103/0x420 [i915] [ 256.977197] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 256.977202] drm_ioctl_kernel+0x88/0xf0 [ 256.977209] drm_ioctl+0x2f8/0x3b0 [ 256.977249] ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915] [ 256.977260] do_vfs_ioctl+0xa5/0x6f0 [ 256.977268] ? __fget+0x114/0x200 [ 256.977273] ksys_ioctl+0x3a/0x70 [ 256.977278] __x64_sys_ioctl+0x16/0x20 [ 256.977283] do_syscall_64+0x5a/0x1c0 [ 256.977289] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 256.977293] RIP: 0033:0x7f9337de75d7 [ 256.977298] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 [ 256.977300] RSP: 002b:00007f932cba3068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 256.977304] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f9337de75d7 [ 256.977307] RDX: 00007f932cba3110 RSI: 00000000c0406469 RDI: 0000000000000006 [ 256.977310] RBP: 00007f932cba3090 R08: 00005561beddcf90 R09: 0000000000000000 [ 256.977312] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f932cba3110 [ 256.977315] R13: 00000000c0406469 R14: 0000000000000006 R15: 00005561beddce70 [ 256.977323] Modules linked in: fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_nhlt snd_hda_codec r8169 realtek lpc_ich snd_hwdep snd_hda_core snd_pcm pinctrl_broxton pinctrl_intel mei_me mei sunrpc [ 256.978130] ---[ end trace 7db3409bc5d5dfb1 ]--- [ 256.978194] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915] [ 256.978199] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8 [ 256.978204] RSP: 0018:ffffc90001f839b0 EFLAGS: 00010283 [ 256.978209] RAX: ffff88818075c600 RBX: ffff88811cf0aec0 RCX: 0000000000000000 [ 256.978211] RDX: ffff88827180c000 RSI: 0000000000000000 RDI: 00000000ffffffff [ 256.978214] RBP: ffffc90001f839c8 R08: ffff88827180c550 R09: 0000000000000000 [ 256.978217] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88811a8cd148 [ 256.978219] R13: ffffc90001f839c8 R14: ffff88811a8cd080 R15: dead000000000122 [ 256.978223] FS: 00007f932cba4700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 256.978225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 256.978228] CR2: 00007fbae3374000 CR3: 0000000270288000 CR4: 00000000003406e0 The machine doesn't hang, but that's about the only good news; after this the other iterations continue (without additional errors in dmesg), but none of the test runs are successful. A point of reference might be that doing: cat /sys/kernel/debug/dri/0/i915_gem_objects seems either take roughly forever, or is stuck somehow.
commit d22dd615143516b49ca61d6a800719b7fbafafce (fdo/wip-mutex) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jul 18 13:39:56 2019 +0100 drm/i915: Move global activity tracking from GEM to GT ? That's annoying, I was sure that was ready. Could I get you squeeze a lockdep + CONFIG_KASAN run in?
That kernel is (well, should be) built with lockdep. Not KASAN though.
Actually, the kernel I tested with has KASAN enable too.
(In reply to David Weinehall from comment #57) > Actually, the kernel I tested with has KASAN enable too. Could you attach the entire dmesg please?
Created attachment 145219 [details] dmesg with latest wip-mutex applied
Ok, it hasn't picked up CONFIG_KASAN. It's complaining about a dead list, but I'm not sure which. Kasan should ideally show which particular object is being used-after-free. Could you double check what happened to kasan in the build :)
Sorry, my bad--I just grepped for KASAN, noticed CONFIG_HAVE_ARCH_KASAN=y and thought nothing more about it. New kernel built, will run tests.
[ 2138.371643] ================================================================== [ 2138.371911] BUG: KASAN: use-after-free in per_file_stats+0x43/0x380 [i915] [ 2138.371924] Read of size 8 at addr ffff888223651000 by task cat/8293 [ 2138.371947] CPU: 0 PID: 8293 Comm: cat Not tainted 5.3.0-rc6-CI-Custom_4352+ #1 [ 2138.371953] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017 [ 2138.371959] Call Trace: [ 2138.371974] dump_stack+0x7c/0xbb [ 2138.372099] ? per_file_stats+0x43/0x380 [i915] [ 2138.372108] print_address_description+0x73/0x3a0 [ 2138.372231] ? per_file_stats+0x43/0x380 [i915] [ 2138.372352] ? per_file_stats+0x43/0x380 [i915] [ 2138.372362] __kasan_report+0x14e/0x192 [ 2138.372489] ? per_file_stats+0x43/0x380 [i915] [ 2138.372502] kasan_report+0xe/0x20 [ 2138.372625] per_file_stats+0x43/0x380 [i915] [ 2138.372751] ? i915_panel_show+0x110/0x110 [i915] [ 2138.372761] idr_for_each+0xa7/0x160 [ 2138.372773] ? idr_get_next_ul+0x110/0x110 [ 2138.372782] ? do_raw_spin_lock+0x10a/0x1d0 [ 2138.372923] print_context_stats+0x264/0x510 [i915] [ 2138.373051] ? i915_interrupt_info+0x1140/0x1140 [i915] [ 2138.373065] ? preempt_count_sub+0x14/0xc0 [ 2138.373074] ? __mutex_lock+0x656/0xcb0 [ 2138.373092] ? __mutex_add_waiter+0x90/0x90 [ 2138.373121] ? seq_vprintf+0xb0/0xb0 [ 2138.373262] i915_gem_object_info+0xc8/0xe0 [i915] [ 2138.373276] seq_read+0x1a4/0x6b0 [ 2138.373306] full_proxy_read+0x8e/0xc0 [ 2138.373323] vfs_read+0xc3/0x1e0 [ 2138.373338] ksys_read+0x116/0x170 [ 2138.373348] ? kernel_write+0xb0/0xb0 [ 2138.373361] ? lockdep_hardirqs_off+0xb5/0x100 [ 2138.373368] ? mark_held_locks+0x1a/0x90 [ 2138.373385] do_syscall_64+0x72/0x260 [ 2138.373397] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2138.373405] RIP: 0033:0x7f7da3356081 [ 2138.373415] Code: fe ff ff 48 8d 3d 67 9c 0a 00 48 83 ec 08 e8 a6 4c 02 00 66 0f 1f 44 00 00 48 8d 05 81 08 2e 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53 [ 2138.373421] RSP: 002b:00007ffeed129258 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 2138.373429] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7da3356081 [ 2138.373435] RDX: 0000000000020000 RSI: 00007f7da381b000 RDI: 0000000000000003 [ 2138.373441] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000 [ 2138.373447] R10: 0000000000000022 R11: 0000000000000246 R12: 00007f7da381b000 [ 2138.373453] R13: 0000000000000003 R14: 00007f7da381b00f R15: 0000000000020000 [ 2138.373494] Allocated by task 7737: [ 2138.373507] __kasan_kmalloc.constprop.0+0xcb/0x130 [ 2138.373515] kmem_cache_alloc+0xdf/0x2d0 [ 2138.373641] i915_gem_object_create_shmem.part.1+0x1d/0x2c0 [i915] [ 2138.373767] i915_gem_create+0x96/0x140 [i915] [ 2138.373776] drm_ioctl_kernel+0x114/0x190 [ 2138.373783] drm_ioctl+0x4ba/0x580 [ 2138.373790] do_vfs_ioctl+0x134/0x9d0 [ 2138.373797] ksys_ioctl+0x3a/0x70 [ 2138.373803] __x64_sys_ioctl+0x3d/0x50 [ 2138.373810] do_syscall_64+0x72/0x260 [ 2138.373817] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2138.373832] Freed by task 187: [ 2138.373843] __kasan_slab_free+0x146/0x200 [ 2138.373851] kmem_cache_free+0xb3/0x390 [ 2138.373975] __i915_gem_free_object_rcu+0x3c/0x60 [i915] [ 2138.373982] rcu_core+0x326/0xa10 [ 2138.373990] __do_softirq+0x12f/0x618 [ 2138.374006] The buggy address belongs to the object at ffff888223650f00 which belongs to the cache drm_i915_gem_object of size 1136 [ 2138.374018] The buggy address is located 256 bytes inside of 1136-byte region [ffff888223650f00, ffff888223651370) [ 2138.374028] The buggy address belongs to the page: [ 2138.374040] page:ffffea00088d9400 refcount:1 mapcount:0 mapping:ffff888235ed5b80 index:0x0 compound_mapcount: 0 [ 2138.374050] flags: 0x8000000000010200(slab|head) [ 2138.374061] raw: 8000000000010200 0000000000000000 0000000100000001 ffff888235ed5b80 [ 2138.374068] raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000 [ 2138.374074] page dumped because: kasan: bad access detected [ 2138.374087] Memory state around the buggy address: [ 2138.374099] ffff888223650f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2138.374110] ffff888223650f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2138.374121] >ffff888223651000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2138.374130] ^ [ 2138.374141] ffff888223651080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2138.374152] ffff888223651100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2138.374161] ==================================================================
Note: i915_gem_objects info is saved after all tests have finished, not during testing, so trigger for above listed KASAN issue is not directly related to the deadlock. It shows that there's some brokeness in i915 gem object accounting though.
(In reply to Eero Tamminen from comment #63) > Note: i915_gem_objects info is saved after all tests have finished, not > during testing, so trigger for above listed KASAN issue is not directly > related to the deadlock. It shows that there's some brokeness in i915 gem > object accounting though. The good news, bad news, it's just a bug in the debugfs. Was there anything else?
David is still running tests, but so far it seems that the deadlock is finally gone (he's running them with lockdep + KASAN kernel, but I guess testing without those is also needed). When using smaller resolution video, it's possible to do a bit over hundred parallel ffmpeg instances without device OOMing, that's what David's currently testing.
I compiled a new kernel using your latest wip-mutex tree (as of Tue, 03 Sep 2019 10:41:32 +0000). The following tests were run (20 iterations each): gstreamer_hd22_p50 ffmpeg_qsv_hd22_p50 both passed. We upped the number of parallel processes a bit; with 100 processes the tests still passed. With 150 processes the test failed due to out of memory, but the failure was graceful--the machine recovered properly and no panic, OOPS, or i915 GPU hang was triggered. Once our nightly test finishes for the night I'll run one more test run with a standard kernel (no lockdep, no KASAN).
A test-run without KASAN & lockdep enabled completed successfully, BUT the following log messages could be found in the dmesg (triggered during the testruns): 71 Resetting rcs0 for preemption time out 3 Resetting vcs0 for preemption time out 50 Resetting vecs0 for preemption time out So it seems likely that there's still some nasty issue remaining.
Created attachment 145307 [details] Broxton error state for only partly recoverable VCS0 hang in a similar test-case Got recoverable GPU hang with 5 parallel test-cases in another Media case on BXT, after which everything following it gets also GPU reseted. This is with v5.2 drm-tip kernel and i965 media driver from Ubuntu, so I'm not sure it's related to the resets David mentioned above. Chris?
Created attachment 145596 [details] Broxton error state for recoverable RCS0 hang in a similar test-case
Time to test drm-tip finally, as of commit 7e8057626640cfedbae000c5032be32269713687 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Oct 4 14:40:02 2019 +0100 drm/i915: Drop struct_mutex from around i915_retire_requests()
(In reply to Chris Wilson from comment #70) > Time to test drm-tip finally, as of > > commit 7e8057626640cfedbae000c5032be32269713687 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Oct 4 14:40:02 2019 +0100 > > drm/i915: Drop struct_mutex from around i915_retire_requests() Thanks, I've re-enabled the tests triggering the issue, and report back after few nights of runs.
Two nights in row without any problems with FFmpeg QSV or MSDK in the tests that earlier deadlocked -> FIXED / VERIFIED. PS. (easily reproducible) Iris related GPU hangs were not impacted by this fix: * HdrBloom: bug 111385 * ICL / TerrainFlyTess: bug 111936 * ICL / CSDof: https://gitlab.freedesktop.org/mesa/mesa/issues/1368
(In reply to Eero Tamminen from comment #72) > Two nights in row without any problems with FFmpeg QSV or MSDK in the tests > that earlier deadlocked -> FIXED / VERIFIED. Chris, something similar started to happen in the same MSDK test within few past days. Is there some regression with mutex handling? This time it lead to system hang, network dropped and machine didn't react to any other input (from USB mouse/keyboard) besides SysRq-Boot. Should I re-open this or file a new bug if it it continues?
(In reply to Eero Tamminen from comment #73) > (In reply to Eero Tamminen from comment #72) > > Two nights in row without any problems with FFmpeg QSV or MSDK in the tests > > that earlier deadlocked -> FIXED / VERIFIED. > > Chris, something similar started to happen in the same MSDK test within few > past days. Is there some regression with mutex handling? > > This time it lead to system hang, network dropped and machine didn't react > to any other input (from USB mouse/keyboard) besides SysRq-Boot. > > Should I re-open this or file a new bug if it it continues? I have one in mind, but it's effect is just lost powersaving. To lose network suggests spinlock/irq issues which is a much different bug. So best to open a new bug so the information is fresh.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.