110848 – [BXT] Everything using GPU gets stuck after running+killing parallel Media loads

Bug 110848 - [BXT] Everything using GPU gets stuck after running+killing parallel Media loads

Summary: [BXT] Everything using GPU gets stuck after running+killing parallel Media loads

Status:	VERIFIED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	high critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:	regression

Depends on:
Blocks:

Reported:	2019-06-06 12:22 UTC by Eero Tamminen
Modified:	2019-11-20 11:42 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	BXT
i915 features:

Attachments
Kernel stacks for all tasks when the problem happens (37.22 KB, text/plain) 2019-06-07 14:05 UTC, Eero Tamminen	no flags	Details
Kernel stacks for all tasks when the problem happens (25.90 KB, text/plain) 2019-06-12 10:01 UTC, Eero Tamminen	no flags	Details
KBL GT3e i915 error state for recoverable GPU hang in same use-case (37.76 KB, text/plain) 2019-06-20 08:14 UTC, Eero Tamminen	no flags	Details
dmesg from a Broxton stuck during media testing (118.44 KB, application/gzip) 2019-07-18 15:33 UTC, David Weinehall	no flags	Details
dmesg from a Broxton stuck during media testing w/ Chris's patch applied (450.19 KB, application/gzip) 2019-07-19 15:55 UTC, David Weinehall	no flags	Details
Netconsole log with drm-tip from 2019y-08m-05d-17h-28m-24s (1.54 MB, text/plain) 2019-08-06 18:28 UTC, David Weinehall	no flags	Details
MediaSDK sample transcode app input file for running 5 parallel MPEG2->AVC transcoding streams (555 bytes, text/plain) 2019-08-16 08:51 UTC, Eero Tamminen	no flags	Details
dmesg with latest wip-mutex applied (54.98 KB, text/plain) 2019-08-30 15:06 UTC, David Weinehall	no flags	Details
Broxton error state for only partly recoverable VCS0 hang in a similar test-case (16.20 KB, text/plain) 2019-09-09 11:14 UTC, Eero Tamminen	no flags	Details
Broxton error state for recoverable RCS0 hang in a similar test-case (4.50 KB, text/plain) 2019-09-30 13:53 UTC, Eero Tamminen	no flags	Details
Show Obsolete (1) View All

Description Eero Tamminen 2019-06-06 12:22:26 UTC

Setup 1:
* HW: BXT J4205
* OS: ClearLinux
* kernel: drm-tip compiled from git
* media: MediaSDK and its deps compiled from git (GitHub)
* FFmpeg: month old Git version: 2019-05-08 c636dc9819 "libavfilter/dnn: add more data type support for dnn model input"
* GUI: Weston / Wayland / Mesa compiled from git

Setup 2 (differences from setup 1):
* OS: Ubuntu 18.04
* FFmpeg: latest git version
* GUI: Unity from Ubuntu with X server & MEsa compiled from git

Test-case:
1. Run 3D benchmarks
2. Do several runs of 50 parallel instances of following H264 transcode operations:
  ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264
3. Do a single non-parallel run of above

Expected outcome:
* Everything works fine, like with month older 5.1 drm-tip kernel, or with GEN9 Core devices

Actual outcome:
* FFmpeg freezes at step 3
* No errors or warnings in dmesg
* Anything else trying to use GPU (even just glxinfo, clinfo) freezes
* Some processes not using GPU also freeze when started

There are no warnings or errors in dmesg.

First time I saw Media tests freezing was around 28th of May.  So the regression may have come between following drm-tip commits:
* 2019-05-27 14:42:27 e8f06c34fa: drm-tip: 2019y-05m-27d-14h-41m-23s UTC integration manifest
* 2019-05-28 15:48:05 8991a80f85: drm-tip: 2019y-05m-28d-15h-47m-22s UTC integration manifest

What Media case freezes, differs a bit e.g. based on FFmpeg version, and from week to week.  It's not always the same, but it's always single instance test freezing after parallel tests have seemingly [1] finished.  Both FFmpeg QSV and MediaSDK sample application H264 transcode cases have frozen. Which one, differs between setup 1 & setup 2.

If I run just Media cases after boot, I don't see the freeze, so there's some interaction with what 3D benchmarks do.


[1] With last night drm-tip kernel, "ps" output for processes in D state, and few other relevant process is following:
---------------------------------------------
...
   38 ?        DN     0:00 [khugepaged]
...
  396 tty7     Ssl+   5:08 /opt/install/bin/weston --tty=7 --idle-time=0 --xwayland
...
  545 tty7     Dl+    0:00 Xwayland :0 -rootless -listen 60 -listen 61 -wm 62 -terminate
...
 8444 ?        D      0:06 [ffmpeg]
 8447 ?        Zl     0:06 [ffmpeg] <defunct>
 8448 ?        Zl     0:06 [ffmpeg] <defunct>
 8449 ?        Zl     0:06 [ffmpeg] <defunct>
 8451 ?        Zl     0:06 [ffmpeg] <defunct>
 8453 ?        D      0:06 [ffmpeg]
 8483 ?        Zl     0:06 [ffmpeg] <defunct>
 8497 ?        Zl     0:06 [ffmpeg] <defunct>
 8512 ?        Zl     0:06 [ffmpeg] <defunct>
 8525 ?        Zl     0:06 [ffmpeg] <defunct>
 8531 ?        Zl     0:06 [ffmpeg] <defunct>
 8546 ?        Zl     0:06 [ffmpeg] <defunct>
 8559 ?        Zl     0:06 [ffmpeg] <defunct>
 8574 ?        Zl     0:06 [ffmpeg] <defunct>
 8585 ?        Zl     0:06 [ffmpeg] <defunct>
 8603 ?        Zl     0:06 [ffmpeg] <defunct>
 8623 ?        D      0:06 [ffmpeg]
 8642 ?        Zl     0:06 [ffmpeg] <defunct>
 8650 ?        D      0:06 [ffmpeg]
 8678 ?        Zl     0:06 [ffmpeg] <defunct>
 8697 ?        Zl     0:06 [ffmpeg] <defunct>
 8704 ?        D      0:06 [ffmpeg]
 8711 ?        D      0:06 [ffmpeg]
 8723 ?        D      0:06 [ffmpeg]
 8733 ?        D      0:06 [ffmpeg]
 8756 ?        Zl   143:22 [ffmpeg] <defunct>
 8793 ?        Zl     0:06 [ffmpeg] <defunct>
 8822 ?        Zl     0:06 [ffmpeg] <defunct>
 8837 ?        Zl     0:06 [ffmpeg] <defunct>
 8845 ?        D      0:06 [ffmpeg]
 8851 ?        Zl     0:06 [ffmpeg] <defunct>
 8858 ?        Zl     0:06 [ffmpeg] <defunct>
 8871 ?        Zl     0:06 [ffmpeg] <defunct>
 8893 ?        Zl     0:06 [ffmpeg] <defunct>
 8942 ?        Zl     0:06 [ffmpeg] <defunct>
 8958 ?        D      0:06 [ffmpeg]
 8983 ?        Zl     0:06 [ffmpeg] <defunct>
 8991 ?        D      0:06 [ffmpeg]
 8999 ?        Zl     0:06 [ffmpeg] <defunct>
 9013 ?        D      0:06 [ffmpeg]
 9017 ?        D      0:06 [ffmpeg]
 9035 ?        Zl     0:06 [ffmpeg] <defunct>
 9058 ?        Zl     0:06 [ffmpeg] <defunct>
 9071 ?        Zl     0:06 [ffmpeg] <defunct>
 9117 ?        Zl     0:06 [ffmpeg] <defunct>
 9122 ?        D      0:06 [ffmpeg]
 9152 ?        Zl     0:06 [ffmpeg] <defunct>
 9165 ?        Zl     0:06 [ffmpeg] <defunct>
 9180 ?        D      0:06 [ffmpeg]
 9250 ?        D      0:06 [ffmpeg]
 9758 ?        Ds     0:00 ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264
...
10184 ?        D      0:00 top
...
---------------------------------------------

50 ffmpeg instances can use a bit of GEM objects (and FFmpeg QSV uses more of them than MediaSDK alone, or FFmpeg VAAPI), which can put memory pressure on the system, so kernel "khugepaged" thread being in D state is suspicious.


Other notes:
* I had also a process running in the background which uses ftrace uprobes to track frame update functions from the 3D and Media processes, and tracking "i915:intel_gpu_freq_change" events
* Killing that process after freeze, caused ssh to stop working, so it's possible that there's some connection to ftrace

Comment 1 Chris Wilson 2019-06-06 12:25:46 UTC

What's the latest tip tested? We're carrying a couple of workarounds for mm/ bugs:

commit f27a5d91201639161d6f6e25af1c89c9cbb3cac7
Author: Hugh Dickins <hughd@google.com>
Date:   Wed May 29 09:25:40 2019 +0200

    x86/fpu: Use fault_in_pages_writeable() for pre-faulting

commit 2e5e960659d76767ff9e9ef04306dbf36f07f46f (topic/core-for-CI)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 3 08:08:43 2019 +0100

    mm: Band aid for 5fd4ca2d84b249f0858ce28cf637cf25b61a398f

Comment 2 Eero Tamminen 2019-06-06 13:01:28 UTC

(In reply to Chris Wilson from comment #1)
> What's the latest tip tested?

Yesterday evening:
2019-06-05 16:28:24 7a984cf096: drm-tip: 2019y-06m-05d-16h-27m-26s UTC integration manifest


(In reply to Eero Tamminen from comment #0)
> 50 ffmpeg instances can use a bit of GEM objects (and FFmpeg QSV uses more
> of them than MediaSDK alone, or FFmpeg VAAPI)

About this much (in total 2GB which shouldn't be a problem):
------------------------------------------
12769 shrinkable objects, 2378752000 bytes
491 unbound objects, 29552640 bytes
12277 bound objects, 2349133824 bytes
51 purgeable objects, 10121216 bytes
220 mapped objects, 11239424 bytes
2220 huge-paged objects (2M, 64K, 4K) 2080702464 bytes
12 display objects (globally pinned), 8851456 bytes
4294967296 [0x0000000010000000] gtt total
Supported page sizes: 2M, 64K, 4K
------------------------------------------

On each round the max number of huge-paged objects seems to grow, although it drops back to same number after tests have finished.

Comment 3 Eero Tamminen 2019-06-07 08:07:12 UTC

(In reply to Eero Tamminen from comment #0)
> Test-case:
> 1. Run 3D benchmarks
> 2. Do several runs of 50 parallel instances of following H264 transcode
> operations:
>   ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i
> 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf
> scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264

And on each round, kill the whole process group with the 50 parallel transcoders after few minutes.


> 3. Do a single non-parallel run of above
> 
> Expected outcome:
> * Everything works fine, like with month older 5.1 drm-tip kernel, or with
> GEN9 Core devices

I'm pretty sure the reason this happens only on BXT, is it being much slower than core devices, and this test often hitting specified timeout before it finishes.  At timeout, the whole transcoding process group is killed.

=> I think the trigger for this issue is killing of large number of media processes, and there being some issue / race / deadlock in the kernel media resource de-allocation path.

=> I'll try to get sysrq task info / bt for them, after I've reproduced the issue again.


> What Media case freezes, differs a bit e.g. based on FFmpeg version, and
> from week to week.  It's not always the same, but it's always single
> instance test freezing after parallel tests have seemingly [1] finished.

The reason why it's the single process that seems to freeze after parallel run, is that when the parallel transcoding process group is killed, only the parallelization wrapper actually dies because the transcoding processes are in D state.  Because the wrapper died, test runner thinks that test finished, and starts next one.

And as the single transcoding process directly invoked after them freezes and refuses to die, it looks like that's where it froze, although freeze happened already in previous test.


PS: while set of 3D tests run before media tests in setup 1 and setup 2 differ, some of the tests in setup 1 are Wayland/EGL variants of the X/GLX tests in setup 2.  Those are GfxBench 5.0 and SynMark2 v7.  X based GPU memory bandwidth tests are same in both setups.  Rest of the tests differ, but are also simpler, so most likely they don't matter for this bug.

Comment 4 Chris Wilson 2019-06-07 08:10:21 UTC

When it gets stuck, grab the stacktraces of all processes, sysrq-t or cat /proc/*/stack.

Comment 5 Eero Tamminen 2019-06-07 14:05:56 UTC

Created attachment 144476 [details]
Kernel stacks for all tasks when the problem happens

# for i in /proc/[0-9]*; do
    egrep '^(Name|State|Pid)' $i/status; cat $i/stack;
  done | grep -B1 -A4 "disk sleep"

Name:   khugepaged
State:  D (disk sleep)
Pid:    38
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
--
Name:   kworker/u8:0+i915
State:  D (disk sleep)
Pid:    7274
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0
--
Name:   ffmpeg
State:  D (disk sleep)
Pid:    7293
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0

--- lots of similar ffmpeg tasks ---

Name:   ffmpeg
State:  D (disk sleep)
Pid:    8606
[<0>] chrdev_open+0xa3/0x1b0
[<0>] do_dentry_open+0x1c4/0x380
[<0>] path_openat+0x564/0x11f0

--- lots of similar ffmpeg tasks ---

Name:   ps
State:  D (disk sleep)
Pid:    8751
[<0>] proc_pid_cmdline_read+0x1e3/0x330
[<0>] vfs_read+0x91/0x140
[<0>] ksys_read+0x91/0xe0


I guess the ffmpeg processes in exit stage are ones from previous test, and the ones trying to open device file ffmpeg processes from the next test.

To me it looks like kworker doing __i915_gem_free_work() stuff would be blocking khugepaged task and everything else.


PS. Freeze when "ps" reads PID command lines seems to happen for running processes:
$ for i in /proc/[0-9]*; do
    egrep '^(Name|State|Pid)' $i/status; cat $i/cmdline;
  done
...
Name:	ffmpeg
State:	Z (zombie)
Pid:	7937
Name:	ffmpeg
State:	Z (zombie)
Pid:	7977
Name:	ffmpeg
State:	R (running)
Pid:	8008
<freeze>

Comment 6 Eero Tamminen 2019-06-07 14:24:01 UTC

From the full ffmpeg stack it's clearer what is the trigger.

SIGINT is sent for the transcoding process group, kernel reacts to that and on do_exit() for the FFmpeg it calls task_work_run(), which I assume to be adding stuff for the kworker that then calls __i915_gem_free_work()?

Name:	kworker/u8:0+i915
State:	D (disk sleep)
Pid:	7274
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
Name:	ffmpeg
State:	Z (zombie)
Pid:	7285
Name:	ffmpeg
State:	Z (zombie)
Pid:	7292
Name:	ffmpeg
State:	D (disk sleep)
Pid:	7293
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0
[<0>] do_group_exit+0x33/0xa0
[<0>] get_signal+0x121/0x910
[<0>] do_signal+0x30/0x6b0
[<0>] exit_to_usermode_loop+0x96/0xb0
[<0>] prepare_exit_to_usermode+0x8c/0xb0
[<0>] retint_user+0x8/0x8


(In reply to Eero Tamminen from comment #3)
> PS: while set of 3D tests run before media tests in setup 1 and setup 2
> differ, some of the tests in setup 1 are Wayland/EGL variants of the X/GLX
> tests in setup 2.  Those are GfxBench 5.0 and SynMark2 v7.  X based GPU
> memory bandwidth tests are same in both setups.  Rest of the tests differ,
> but are also simpler, so most likely they don't matter for this bug.

Running just gfxbench, synmark & membw 3D tests before media tests wasn't enough either, so for now I can can test this only after full 3-4h test round => can't really bisect this to steps that would take only minutes, iterating from full set down to something like that would take way more time than I'm allowed for this. :-/

Comment 7 Eero Tamminen 2019-06-11 13:53:40 UTC

I've now verified that every time this freeze happens, these are stuck:
-----------------------------------------------
Name:	khugepaged
State:	D (disk sleep)
Pid:	38
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
--
Name:	kworker/u8:2+i915
State:	D (disk sleep)
Pid:	7911
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
-----------------------------------------------

in addition to the stuck transcoding tasks:
-----------------------------------------------
...
Name:	sample_multi_tr
State:	D (disk sleep)
Pid:	20118
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0
[<0>] do_group_exit+0x33/0xa0
[<0>] get_signal+0x121/0x910
[<0>] do_signal+0x30/0x6b0
[<0>] exit_to_usermode_loop+0x96/0xb0
[<0>] do_syscall_64+0x103/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
--
Name:	sample_multi_tr
State:	D (disk sleep)
Pid:	20221
[<0>] chrdev_open+0xa3/0x1b0
[<0>] do_dentry_open+0x1c4/0x380
[<0>] path_openat+0x564/0x11f0
[<0>] path_openat+0x564/0x11f0
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1bd/0x260
[<0>] do_syscall_64+0x4f/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
-----------------------------------------------

Above is from last evening drm-tip git build.  For some reason on ClearLinux (comment 5) ffmpeg git build gets stuck, and on Ubuntu 18.04 (this comment) it's the MSDK git build test app.  In both setups, kernel & Media stack are using same binaries.  Main difference between these setups are:
- HW is supposed to be identical, but they are separate machines
- Weston vs. X server running in background (which aren't used in Media tests)
- optimizations in ClearLinux libc

I think the differences in freezing app are just due to some differences in timings.

Comment 8 Eero Tamminen 2019-06-12 10:01:54 UTC

Created attachment 144520 [details]
Kernel stacks for all tasks when the problem happens

(In reply to Eero Tamminen from comment #7)
> I've now verified that every time this freeze happens, these are stuck:
> -----------------------------------------------
> Name:	khugepaged
> State:	D (disk sleep)
> Pid:	38
> [<0>] kthread+0x113/0x130
> [<0>] ret_from_fork+0x35/0x40
> --
> Name:	kworker/u8:2+i915
> State:	D (disk sleep)
> Pid:	7911
> [<0>] __i915_gem_free_work+0x5f/0x90 [i915]
> [<0>] process_one_work+0x1e9/0x410
> [<0>] worker_thread+0x2d/0x3d0
> [<0>] kthread+0x113/0x130
> [<0>] ret_from_fork+0x35/0x40
> -----------------------------------------------
> 
> in addition to the stuck transcoding tasks:
[...]

Today, there was no frozen khugepaged, just these:
-----------------------------------------------
Name:   kworker/u8:0+i915
State:  D (disk sleep)
Pid:    7194
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
...
Name:   ffmpeg
State:  D (disk sleep)
Pid:    9407
[<0>] drm_file_free.part.0+0x1c8/0x280
[<0>] drm_release+0x9c/0xd0
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0
[<0>] do_group_exit+0x33/0xa0
[<0>] get_signal+0x121/0x910
[<0>] do_signal+0x30/0x6b0
[<0>] exit_to_usermode_loop+0x96/0xb0
[<0>] do_syscall_64+0x103/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
-----------------------------------------------

In addition to the earlier listed transcode processes stuck at fput() and chrdev_open().  See attachment.

=> I assume khugepaged stuff isn't related to the issue, it was just innocent bystander.

Comment 9 Chris Wilson 2019-06-12 10:16:40 UTC

Those stacks are all waiting for struct_mutex, but nothing appears to be actually holding it.

Comment 10 Eero Tamminen 2019-06-12 10:26:59 UTC

(In reply to Chris Wilson from comment #9)
> Those stacks are all waiting for struct_mutex, but nothing appears to be
> actually holding it.

Any suggestions on how to get more info for you on the issue, either with some kernel command line option [1], or after the issue has triggered?

[1] anything affecting performance, could change when/how the issue triggers.

Comment 11 Eero Tamminen 2019-06-12 11:38:10 UTC

(In reply to Chris Wilson from comment #9)
> Those stacks are all waiting for struct_mutex, but nothing appears to be
> actually holding it.

It would be a kernel bug, but is it possible that mutex had been taken for one of the killed processes, and it had been killed before freeing the mutex, and kernel somehow missed freeing the lock for the killed process?

Comment 12 Chris Wilson 2019-06-12 11:41:52 UTC

It is possible; that is the type of bug lockdep complains loudly about. If you do get the opportunity to run with lockdep and khungtaskd, that should dump all locks held and their owners when stuck.

Comment 13 Eero Tamminen 2019-06-19 11:29:47 UTC

In can trigger also with smaller number of parallel media processes, 5 seems enough:

# ps ax|grep -e ' Z' -e ' D'
   38 ?        DN     0:00 [khugepaged]
   41 ?        D      0:05 [kworker/u8:1+i9]
18218 ?        Zl   146:04 [sample_multi_tr] <defunct>
18219 ?        Zl     0:05 [sample_multi_tr] <defunct>
18221 ?        Zl     0:05 [sample_multi_tr] <defunct>
18225 ?        D      0:05 [sample_multi_tr]
18244 ?        D      0:05 [sample_multi_tr]
18415 ?        Ds     0:00 sample_multi_transcode -i::mpeg2 1920x1080i_29.97_20mb_mpeg2_high.mpv -o::h264 0024_HD17i7_1.0.h264 -b 6000 -u 7 -n 2400 -async 4 -hw

# for i in /proc/[0-9]*; do
    egrep '^(Name|State|Pid)' $i/status; cat $i/stack;
  done | grep -B1 -A4 "disk sleep"
Name:	sample_multi_tr
State:	D (disk sleep)
Pid:	18225
[<0>] drm_file_free.part.0+0x1c8/0x280
[<0>] drm_release+0x9c/0xd0
[<0>] __fput+0xae/0x200
--
Name:	sample_multi_tr
State:	D (disk sleep)
Pid:	18244
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0
--
Name:	sample_multi_tr
State:	D (disk sleep)
Pid:	18415
[<0>] chrdev_open+0xa3/0x1b0
[<0>] do_dentry_open+0x1c4/0x380
[<0>] path_openat+0x564/0x11f0
--
Name:	khugepaged
State:	D (disk sleep)
Pid:	38
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
--
Name:	kworker/u8:1+i915
State:	D (disk sleep)
Pid:	41
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0

Comment 14 Eero Tamminen 2019-06-19 11:35:27 UTC

(In reply to Chris Wilson from comment #12)
> It is possible; that is the type of bug lockdep complains loudly about. If
> you do get the opportunity to run with lockdep and khungtaskd, that should
> dump all locks held and their owners when stuck.

I could give it a quick try.  Which exact kernel build CONFIG_*, and kernel boot command line options I need for these with drm-tip git kernel version?

Comment 15 Eero Tamminen 2019-06-20 08:14:48 UTC

Created attachment 144597 [details]
KBL GT3e i915 error state for recoverable GPU hang in same use-case

I'm not sure whether they are at all relevant for this deadlock(?) freeze issue, but I'm seeing also some random recoverable GPU hangs in the same test-case on Core devices.

Comment 16 Eero Tamminen 2019-07-02 08:30:49 UTC

Large number of parallel media workloads, which get a group KILL signal (because they took too much time), still continue to deadlock i915 and other parts of kernel in most annoying way with latest drm-tip 5.2.0-rc7 git kernel.

Device doesn't appear frozen to normal freeze checks, but "random" operations on it get stuck:
# for i in /proc/[0-9]*; do
     egrep '^(Name|State|Pid)' $i/status; cat $i/stack;
  done | grep -B1 -A6 "disk sleep"
Name:	kworker/u8:1+i915
State:	D (disk sleep)
Pid:	10660
[<0>] __i915_gem_free_work+0x5f/0x90 [i915]
[<0>] process_one_work+0x1e9/0x410
[<0>] worker_thread+0x2d/0x3d0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
--
Name:	ffmpeg
State:	D (disk sleep)
Pid:	10704
[<0>] __fput+0xae/0x200
[<0>] task_work_run+0x84/0xa0
[<0>] do_exit+0x308/0xba0
[<0>] do_group_exit+0x33/0xa0
[<0>] get_signal+0x121/0x910
--
<17 similar FFmpeg processes>
--
Name:	compiz
State:	D (disk sleep)
Pid:	1195
[<0>] i915_gem_create_ioctl+0x17/0x30 [i915]
[<0>] drm_ioctl_kernel+0x88/0xf0
[<0>] drm_ioctl+0x2f8/0x3b0
[<0>] do_vfs_ioctl+0xa4/0x630
[<0>] ksys_ioctl+0x3a/0x70
--
Name:	ffmpeg
State:	D (disk sleep)
Pid:	12011
[<0>] chrdev_open+0xa3/0x1b0
[<0>] do_dentry_open+0x1c4/0x380
[<0>] path_openat+0x564/0x11f0
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1bd/0x260
--
Name:	ps
State:	D (disk sleep)
Pid:	16106
[<0>] proc_pid_cmdline_read+0x1e3/0x330
[<0>] vfs_read+0x91/0x140
[<0>] ksys_read+0x91/0xe0
[<0>] do_syscall_64+0x4f/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
--
Name:	khugepaged
State:	D (disk sleep)
Pid:	38
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40

(This semi-zombie device state breaks our automation because it somehow manages to freeze Jenkins job that shouldn't be doing anything on the device, just trigger start of testing on that machine with a forced reboot job.)

Comment 17 Eero Tamminen 2019-07-10 13:48:14 UTC

So far the smallest set of tests with which I've managed to reproduce the freeze is this set of tests (in this order):
* running 3x 33 Synmark tests in fullscreen
* running 5x 3 GLBenchmark windowed tests, all both onscreen & offscreen
* running 3x 10 GpuTest windowed & fullscreen tests
* running 3x 5 different FFmpeg VAAPI transcode operations, alternating multiple processes (parallel process count being from 5 to 50) and single transcode process
* at end doing process group kill for the 50 parallel processes case before next single transcode process

However, that freeze wasn't reproducible on next boot.

From earlier data, it seems that running with this amount of 3D tests before media tests results in approximately 1/2 reproducibility, and 2x larger set of 3D tests before the Media tests, makes it nearly fully reproducible (not sure whether use of MediaSDK made it also more reproducible, as I hadn't tried plain VAAPI backend before).

I haven't been able to trigger the media freeze by using just fullscreen 3D applications, so I assume some parallel 3D engine usage is needed (X or Weston compositing in addition to 3D benchmark itself) before parallel media engine usage.  But running just the (above) windowed 3D benchmarks before media tests hasn't yet been enough to trigger the freeze either.

=> I think i915 would need some specific IGT tests that:
* access i915 from multiple processes in parallel
* do that for each supported engine in turn, and then all of them at same time
* some of those processes are in separate process group and get killed as a group during testing, and tester just forks more of them as replacement
* is run under lockdep

Comment 18 Eero Tamminen 2019-07-11 14:17:12 UTC

(In reply to Eero Tamminen from comment #17)
> So far the smallest set of tests with which I've managed to reproduce the
> freeze is this set of tests (in this order):
[...]

I have slightly smaller test-case, with freeze reproducibility of about 1/3:
1. Start Ubuntu 18.04 with X/Unity & latest git gfx/media stack
2. Run large number of Wayland test cases which fail immediately at start
   because there's no Wayland. I don't think these are relevant, but if some
   of them briefly open window before exiting, it does exercise i915
3. Run 3x 7 (very heavy / low FPS) windowed GPU bandwidth tests
4. Run 3x 8 different FFmpeg QSV (=using MediaSDK) transcode operations,
   alternating betwee multiple processes (parallel process count being from
   5 to 50) and single transcode process versions of them
5. Run same things as in 4), but using MediaSDK (transcode sample app) directly

In some of these 1/3 freeze occurrences, freezing happens in step 3), sometimes in step 4).

After Media freeze, reboot seems stuck at shutdown, device needs to be power-cycled.

Comment 19 Chris Wilson 2019-07-12 09:09:46 UTC

The patch we identified as being problematic was reverted just before 5.2; tbf I don't have a lot of confidence that was the only problematic patch in that mm/ series.

e8f06c34fa..8991a80f85 does contain more fences, and so increased opportunity for a missed wakeup. They would however be shown as processes stuck waiting for a fence inside i915; whereas what your kernel stacks show are processes stuck waiting for page io (if I read that correctly, generally from __dput).

Comment 20 Eero Tamminen 2019-07-12 10:15:46 UTC

(In reply to Chris Wilson from comment #19)
> e8f06c34fa..8991a80f85 does contain more fences, and so increased
> opportunity for a missed wakeup. They would however be shown as processes
> stuck waiting for a fence inside i915; whereas what your kernel stacks show
> are processes stuck waiting for page io (if I read that correctly, generally
> from __dput).

Do you have detailed instructions on how to enable lockdep (google didn't tell)?

I.e. hat *specific* kernel config options need to be enabled, what set on kernel command line etc?

I could give it a try after I come from summer vacation, if this is still happening.

Comment 21 Chris Wilson 2019-07-12 10:21:04 UTC

$ ./scripts/config --enable CONFIG_PROVE_LOCKING

selects everything required for lockdep, i.e.

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

Comment 22 Eero Tamminen 2019-07-12 17:32:08 UTC

Quick testing didn't give any warnings from lockdep, but I wasn't able to trigger the freeze yet either.  David said that he could try full test set with lockdep some time next week.

Comment 23 David Weinehall 2019-07-18 15:33:15 UTC

Created attachment 144814 [details]
dmesg from a Broxton stuck during media testing

Comment 24 David Weinehall 2019-07-18 15:36:10 UTC

I managed to find an easy way to reproduce unrecoverable hangs. The full test set isn't necessary; it's enough to run 5-10 iterations of the gstreamer_hd22_p50 testcase.

Eero's lockdep enabled configuration wasn't enough (I'm assuming he used the options you provided), but I enabled a few more debugging options which gave quite a lot of output--hopefully some of it useful. dmesg attached (well, as much as fit in the scrollback on the console where I followed it).

Comment 25 Chris Wilson 2019-07-18 15:43:39 UTC

dmesg too short, it needed the stacktrace for kworker/u8:1 and khugepaged

Comment 26 David Weinehall 2019-07-18 16:17:02 UTC

Damn. OK, I'll try to get a longer trace.

Comment 27 Chris Wilson 2019-07-19 11:59:24 UTC

If you do get good at reproducing this :), could you try https://patchwork.freedesktop.org/patch/318722/?series=63857&rev=1

Comment 28 David Weinehall 2019-07-19 15:55:40 UTC

Created attachment 144824 [details]
dmesg from a Broxton stuck during media testing w/ Chris's patch applied

I tested with your patch applied, but I can still reproduce a hang. I've attached the log; hopefully it'll be some use.

Comment 29 Chris Wilson 2019-07-19 16:11:25 UTC

This is the only process of interest that is active

[ 3810.865923] h264parse0:sink R  running task        0 10012      1 0x00004004
[ 3810.865928] Call Trace:
[ 3810.865931]  ? mark_held_locks+0x49/0x70
[ 3810.865936]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 3810.865939]  ? lockdep_hardirqs_on+0xeb/0x180
[ 3810.865943]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 3810.865950]  ? retint_kernel+0x2b/0x2b
[ 3810.866009]  ? i915_gem_evict_something+0x2df/0x4e0 [i915]
[ 3810.866054]  ? i915_gem_evict_something+0x2fc/0x4e0 [i915]
[ 3810.866105]  ? i915_gem_gtt_insert+0x1ce/0x270 [i915]
[ 3810.866152]  ? __i915_vma_do_pin+0x40c/0x530 [i915]
[ 3810.866200]  ? i915_gem_object_ggtt_pin+0x11f/0x1e0 [i915]
[ 3810.866241]  ? i915_gem_fault+0x291/0x6d0 [i915]
[ 3810.866245]  ? __lock_acquire+0x2bf/0x22e0
[ 3810.866255]  ? __do_fault+0x31/0xb0
[ 3810.866262]  ? __handle_mm_fault+0x9ab/0xfc0
[ 3810.866274]  ? handle_mm_fault+0x15a/0x350
[ 3810.866280]  ? __do_page_fault+0x24d/0x500
[ 3810.866287]  ? page_fault+0x8/0x30
[ 3810.866291]  ? page_fault+0x1e/0x30

What does the profiler say is going on? From the above, it seems the system is still running, just struggling? That eviction is going to struggle to clear context objects...

Comment 30 Chris Wilson 2019-07-19 17:26:52 UTC

The kernel still has ggtt_flush() inside evict, so it should still be flushing context objects from the ggtt.

Comment 31 David Weinehall 2019-07-22 10:59:50 UTC

Hmmm, so you're saying that given enough time the machine could/should recover?

I left the machine on 4h after the test concluded (hung), but no luck. The desktop clock of the machine stopped updating at the end of the task, the machine stopped replying to ping, and when I checked 4h later the machine still didn't respond to keyboard input.

Comment 32 Chris Wilson 2019-08-02 15:02:15 UTC

(In reply to David Weinehall from comment #31)
> Hmmm, so you're saying that given enough time the machine could/should
> recover?
> 
> I left the machine on 4h after the test concluded (hung), but no luck. The
> desktop clock of the machine stopped updating at the end of the task, the
> machine stopped replying to ping, and when I checked 4h later the machine
> still didn't respond to keyboard input.

There are a few corner cases where oom just eats itself, but nowadays they usually result in a panic + reboot. (Do you have reboot on panic set?) Give or take a few HW faults, a lockup usually means a panic. I would personally use remote logging (via netconsole) in case it is a panic. If not, and with lockdep and kernel NMI watchdog running, it just quietly hangs, we found ourselves a more "interesting" issue.

Comment 33 Chris Wilson 2019-08-02 23:13:41 UTC

Fwiw,

commit 515b8b7e935ef3f59c4efda04a3b05353ed6fbb7 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 2 22:21:37 2019 +0100

    drm/i915: Flush the freed object list on file close
    
    As we increase the number of RCU objects, it becomes easier for us to
    have several hundred thousand objects in the deferred RCU free queues.
    An example is gem_ctx_create/files which continually creates active
    contexts, which are not immediately freed upon close as they are kept
    alive by outstanding requests. This lack of backpressure allows the
    context objects to persist until they overwhelm and starve the system.
    We can increase our backpressure by flushing the freed object queue upon
    closing the device fd which should then not impact other clients.
    
    Testcase: igt/gem_ctx_create/*files
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-2-chris@chris-wilson.co.uk

commit 1aff1903d0ff53f055088a77948ac8d8224d42db
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 2 22:21:36 2019 +0100

    drm/i915: Hide unshrinkable context objects from the shrinker
    
    The shrinker cannot touch objects used by the contexts (logical state
    and ring). Currently we mark those as "pin_global" to let the shrinker
    skip over them, however, if we remove them from the shrinker lists
    entirely, we don't event have to include them in our shrink accounting.
    
    By keeping the unshrinkable objects in our shrinker tracking, we report
    a large number of objects available to be shrunk, and leave the shrinker
    deeply unsatisfied when we fail to reclaim those. The shrinker will
    persist in trying to reclaim the unavailable objects, forcing the system
    into a livelock (not even hitting the dread oomkiller).
    
    v2: Extend unshrinkable protection for perma-pinned scratch and guc
    allocations (Tvrtko)
    v3: Notice that we should be pinned when marking unshrinkable and so the
    link cannot be empty; merge duplicate paths.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-1-chris@chris-wilson.co.uk


It didn't seem to help earlier though -- but I still think are related.

Comment 34 David Weinehall 2019-08-06 18:28:28 UTC

Created attachment 144960 [details]
Netconsole log with drm-tip from 2019y-08m-05d-17h-28m-24s

Both of the patches you reference in comment 33 are in the drm-tip used for this test run; the machine still hangs. The log has been taken from netconsole after boot (hence why the entire dmesg isn't there).

Reboot on panic was set, but a reboot wasn't triggered for some reason. I had htop running in a separate connection to the machine; the memory usage at the point it stopped responding was high, but not alarmingly so (maybe 5 out of 8GB).

What I find interesting is that in most cases all testruns complete; the machine hangs when finalising the test run.

Comment 35 Chris Wilson 2019-08-06 18:46:08 UTC

(In reply to David Weinehall from comment #34)
> Created attachment 144960 [details]
> Netconsole log with drm-tip from 2019y-08m-05d-17h-28m-24s

At the instant of that hungtask, it was waiting for

[ 2582.003387] R13: 0000000000000000 R14: 00007fc30c02be50 R15: 0000000000000008
[ 2582.003405] h264parse0:sink R  running task        0  6195      1 0x00004004
[ 2582.003418] Call Trace:
[ 2582.003429]  ? mark_held_locks+0x49/0x70
[ 2582.003437]  ? trace_hardirqs_on_thunk+0x1a/0x20
[ 2582.003445]  ? lockdep_hardirqs_on+0xeb/0x180
[ 2582.003454]  ? trace_hardirqs_on_thunk+0x1a/0x20
[ 2582.003465]  ? retint_kernel+0x2b/0x2b
[ 2582.003481]  ? drm_mm_scan_remove_block+0x23/0x50
[ 2582.003543]  ? i915_gem_evict_something+0x2fc/0x4e0 [i915]
[ 2582.003601]  ? i915_gem_gtt_insert+0x1ce/0x270 [i915]
[ 2582.003655]  ? __i915_vma_do_pin+0x40c/0x530 [i915]
[ 2582.003707]  ? i915_gem_object_ggtt_pin+0x11f/0x1e0 [i915]
[ 2582.003755]  ? i915_gem_fault+0x291/0x6d0 [i915]
[ 2582.003787]  ? __do_fault+0x31/0xb0
[ 2582.003801]  ? __handle_mm_fault+0x90d/0xfc0
[ 2582.003820]  ? handle_mm_fault+0x15a/0x360
[ 2582.003832]  ? __do_page_fault+0x24e/0x500
[ 2582.003845]  ? page_fault+0x34/0x40

Other than complaining about the stupidity of using mmap_gtt, I guess it means we were badly fragmented and eviction was slow.

Comment 36 David Weinehall 2019-08-09 13:18:08 UTC

Well, define "slow"? Half an hour? Hour? Multiple hours?

Comment 37 Chris Wilson 2019-08-09 13:27:36 UTC

Have you heard the tale of the pagefault of doom?

It's not a story any developer would tell you...

Under exactly the wrong conditions, eviction of entire surfaces (up to the GGTT) can occur between every singe byte read and written via the mmap. We are not meant to be that stupid any more (although it comes close), and instead of evicting the entire GGTT it may just be ending up in a ping-pong of the memcpy(dst, src). For moderately sized buffers (say a few 10s of MB) you are not likely to see that complete within a year.

I hope we haven't regressed to that level of stupidity, but I wouldn't put it past me.

Comment 38 David Weinehall 2019-08-09 13:49:26 UTC

Is there anyway to debug whether that may be the case?

Comment 39 Chris Wilson 2019-08-09 13:54:50 UTC

That shows up in perf as a process spending 100% inside the pagefault performing eviction.

Comment 40 Chris Wilson 2019-08-12 20:01:53 UTC

Try https://cgit.freedesktop.org/~ickle/linux-2.6/patch/?id=3b7910f6317981233154d4cb6473648a1d0404af as a potential idea

Comment 41 David Weinehall 2019-08-13 17:43:52 UTC

Sadly the patch in comment #40 isn't enough to fix the issue.

Comment 42 Chris Wilson 2019-08-13 17:52:41 UTC

Is it still making "forward" progress though? So far the only lead has been that one 'R' process still doing a map_gtt pagefault

Comment 43 Eero Tamminen 2019-08-15 13:37:44 UTC

(In reply to David Weinehall from comment #24)
> I managed to find an easy way to reproduce unrecoverable hangs. The full
> test set isn't necessary; it's enough to run 5-10 iterations of the
> gstreamer_hd22_p50 testcase.

I.e. no 3D tests need to be run before that, just booting (Ubuntu 18.04 Unity) desktop is enough.  It hasn't been tested whether the issue can be reproduced on headless setup without X11, but I assume it would.

Comment 44 Chris Wilson 2019-08-15 13:42:15 UTC

So let's try and reproduce that. A script or download & setup instructions, or disk image of the test setup?

Comment 45 David Weinehall 2019-08-15 14:28:46 UTC

FWIW with Chris's patch from Comment #40 the tests complete without any ominous log messages in dmesg (the kernel is configured with lockdep + netconsole, and I only got the log messages telling what test was on-going).

BUT: it still hangs the same way as before.

So the patch probably fixes something, but it isn't enough to make things work.

Comment 46 Eero Tamminen 2019-08-15 16:37:39 UTC

(In reply to Chris Wilson from comment #44)
> So let's try and reproduce that. A script or download & setup instructions,
> or disk image of the test setup?

I sent instructions to your Intel email.

Comment 47 Eero Tamminen 2019-08-16 08:51:55 UTC

Created attachment 145076 [details]
MediaSDK sample transcode app input file for running 5 parallel MPEG2->AVC transcoding streams

After removing the above test-case from regular testing, there's another one that hangs, a bit simpler one.  As the earlier test-case can trigger deadlock alone, I'm assuming this one can also.

1. Build media-driver & MediaSDK:
   https://github.com/intel/media-driver/wiki/How-to-build-and-use-media-driver%3F
   https://github.com/Intel-Media-SDK/MediaSDK/wiki/Build-Media-SDK-on-Ubuntu
2. Run 5 instances of following multi-stream process in parallel in a same process group, with the attached parameter file & suitably sized MPEG2 input file:
   sample_multi_transcode -par parameters.txt
3. Kill the process group before above command finishes
4. Repeat from 2)

Comment 48 Eero Tamminen 2019-08-16 09:04:29 UTC

I've never seen this happen with FFmpeg that uses media-driver directly through libva, only when media-driver is used in (FFmpeg ->) MediaSDK -> libva -> media-driver -> kernel/i915 pipeline (I've running much less tests with FFmpeg / VA-API though).

I've seen in FFmpeg/MediaSDK bug trackers some mentions that MediaSDK would do some media-driver access directly, without always going through libva, potentially something related to VPP (video post-processing, not sure whether it uses e.g. SFC or shaders).

Comment 49 Eero Tamminen 2019-08-20 08:11:48 UTC

(In reply to Eero Tamminen from comment #48)
> I've never seen this happen with FFmpeg that uses media-driver directly
> through libva, only when media-driver is used in (FFmpeg ->) MediaSDK ->
> libva -> media-driver -> kernel/i915 pipeline (I've running much less tests
> with FFmpeg / VA-API though).

And few days afterwards I got test machine disconnect with FFmpeg VA-API after 50x test-case.  This was followed by automated reboot, but I'm pretty sure the cause for the issue was same as with MSDK using media applications.

I haven't seen this with KBL anymore, so I'm removing that from list platforms (I assume it's fast enough not to trigger the deadlock).

In case it matters, all of these machines have 8GB of RAM.

Updating also title, as David found out that triggering the issue doesn't need 3D test-cases.

Comment 50 Chris Wilson 2019-08-28 16:38:08 UTC

If you get the chance, it would be nice to have some feedback on
https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=wip-mutex
under the presumption that this code is stuck under struct_mutex in the shrinker/evict and that branch eliminates struct_mutex from those paths.

Comment 51 David Weinehall 2019-08-29 14:40:07 UTC

Testing with a kernel using your wip-mutex branch gives the following:

[  774.826181] general protection fault: 0000 [#1] PREEMPT SMP PTI
[  774.826192] CPU: 3 PID: 4859 Comm: h264parse0:sink Not tainted 5.3.0-rc6-CI-Custom_4349+ #1
[  774.826195] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017
[  774.826253] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915]
[  774.826258] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8
[  774.826261] RSP: 0018:ffffc900018b39b0 EFLAGS: 00010283
[  774.826265] RAX: ffff888272178800 RBX: ffff8881ca561080 RCX: 0000000000000000
[  774.826267] RDX: ffff888263f80000 RSI: ffffffff82661d60 RDI: ffff888204495f40
[  774.826270] RBP: ffffc900018b39c8 R08: 0000000000000001 R09: ffffffffa0208c00
[  774.826273] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881e2813d48
[  774.826275] R13: ffffc900018b39c8 R14: ffff8881e2813c80 R15: dead000000000122
[  774.826279] FS:  00007f927aef6700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[  774.826281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.826284] CR2: 00007f1da83b2000 CR3: 00000001e467e000 CR4: 00000000003406e0
[  774.826287] Call Trace:
[  774.826338]  i915_request_await_object+0x248/0x380 [i915]
[  774.826345]  ? __mutex_unlock_slowpath+0x4b/0x2b0
[  774.826387]  i915_gem_do_execbuffer+0x18a1/0x1e00 [i915]
[  774.826404]  ? __lock_acquire+0x25f/0x1cc0
[  774.826452]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  774.826492]  i915_gem_execbuffer2_ioctl+0x103/0x420 [i915]
[  774.826534]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  774.826540]  drm_ioctl_kernel+0x88/0xf0
[  774.826546]  drm_ioctl+0x2f8/0x3b0
[  774.826587]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  774.826598]  do_vfs_ioctl+0xa5/0x6f0
[  774.826605]  ? __fget+0x114/0x200
[  774.826611]  ksys_ioctl+0x3a/0x70
[  774.826616]  __x64_sys_ioctl+0x16/0x20
[  774.826621]  do_syscall_64+0x5a/0x1c0
[  774.826627]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  774.826631] RIP: 0033:0x7f92861395d7
[  774.826635] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
[  774.826638] RSP: 002b:00007f927aef5068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  774.826641] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f92861395d7
[  774.826644] RDX: 00007f927aef5110 RSI: 00000000c0406469 RDI: 0000000000000006
[  774.826647] RBP: 00007f927aef5090 R08: 0000555c6cfa1890 R09: 0000000000000000
[  774.826649] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f927aef5110
[  774.826652] R13: 00000000c0406469 R14: 0000000000000006 R15: 0000555c6cfa1770
[  774.826661] Modules linked in: fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_intel_nhlt lpc_ich snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 pinctrl_broxton mei_me pinctrl_intel realtek mei sunrpc
[  774.826681] ---[ end trace 867abd8886fa79e0 ]---
[  774.826724] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915]
[  774.826728] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8
[  774.826731] RSP: 0018:ffffc900018b39b0 EFLAGS: 00010283
[  774.826734] RAX: ffff888272178800 RBX: ffff8881ca561080 RCX: 0000000000000000
[  774.826737] RDX: ffff888263f80000 RSI: ffffffff82661d60 RDI: ffff888204495f40
[  774.826740] RBP: ffffc900018b39c8 R08: 0000000000000001 R09: ffffffffa0208c00
[  774.826742] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881e2813d48
[  774.826745] R13: ffffc900018b39c8 R14: ffff8881e2813c80 R15: dead000000000122
[  774.826748] FS:  00007f927aef6700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[  774.826751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.826754] CR2: 00007f1da83b2000 CR3: 00000001e467e000 CR4: 00000000003406e0

Comment 52 Chris Wilson 2019-08-29 14:56:31 UTC

Round 2?

commit 33437a8f0703f9eed186a11484b9f475de1c743f (HEAD, fdo/wip-mutex)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 18 13:39:56 2019 +0100

    drm/i915: Move global activity tracking from GEM to GT

Comment 53 David Weinehall 2019-08-29 18:15:47 UTC

I'll give it a whirl tomorrow; the machine in question is currently busy doing nightly testing.

Comment 54 David Weinehall 2019-08-30 13:16:51 UTC

The latest patch still isn't quite perfect; already during the first test iteration the following happens:

[  256.976769] general protection fault: 0000 [#1] PREEMPT SMP PTI
[  256.976781] CPU: 3 PID: 2601 Comm: h264parse0:sink Not tainted 5.3.0-rc6-CI-Custom_4350+ #1
[  256.976784] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017
[  256.976852] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915]
[  256.976857] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8
[  256.976861] RSP: 0018:ffffc90001f839b0 EFLAGS: 00010283
[  256.976865] RAX: ffff88818075c600 RBX: ffff88811cf0aec0 RCX: 0000000000000000
[  256.976869] RDX: ffff88827180c000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  256.976872] RBP: ffffc90001f839c8 R08: ffff88827180c550 R09: 0000000000000000
[  256.976875] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88811a8cd148
[  256.976878] R13: ffffc90001f839c8 R14: ffff88811a8cd080 R15: dead000000000122
[  256.976882] FS:  00007f932cba4700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[  256.976886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  256.976889] CR2: 00007fbae3374000 CR3: 0000000270288000 CR4: 00000000003406e0
[  256.976892] Call Trace:
[  256.976954]  i915_request_await_object+0x248/0x380 [i915]
[  256.976963]  ? __mutex_unlock_slowpath+0x4b/0x2b0
[  256.977014]  i915_gem_do_execbuffer+0x18a1/0x1e00 [i915]
[  256.977034]  ? __lock_acquire+0x25f/0x1cc0
[  256.977114]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  256.977154]  i915_gem_execbuffer2_ioctl+0x103/0x420 [i915]
[  256.977197]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  256.977202]  drm_ioctl_kernel+0x88/0xf0
[  256.977209]  drm_ioctl+0x2f8/0x3b0
[  256.977249]  ? i915_gem_execbuffer_ioctl+0x2e0/0x2e0 [i915]
[  256.977260]  do_vfs_ioctl+0xa5/0x6f0
[  256.977268]  ? __fget+0x114/0x200
[  256.977273]  ksys_ioctl+0x3a/0x70
[  256.977278]  __x64_sys_ioctl+0x16/0x20
[  256.977283]  do_syscall_64+0x5a/0x1c0
[  256.977289]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  256.977293] RIP: 0033:0x7f9337de75d7
[  256.977298] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
[  256.977300] RSP: 002b:00007f932cba3068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  256.977304] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f9337de75d7
[  256.977307] RDX: 00007f932cba3110 RSI: 00000000c0406469 RDI: 0000000000000006
[  256.977310] RBP: 00007f932cba3090 R08: 00005561beddcf90 R09: 0000000000000000
[  256.977312] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f932cba3110
[  256.977315] R13: 00000000c0406469 R14: 0000000000000006 R15: 00005561beddce70
[  256.977323] Modules linked in: fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_nhlt snd_hda_codec r8169 realtek lpc_ich snd_hwdep snd_hda_core snd_pcm pinctrl_broxton pinctrl_intel mei_me mei sunrpc
[  256.978130] ---[ end trace 7db3409bc5d5dfb1 ]---
[  256.978194] RIP: 0010:i915_request_await_dma_fence+0x303/0x420 [i915]
[  256.978199] Code: 00 84 4a 30 0f 85 26 ff ff ff 48 8b 83 a0 00 00 00 4c 8b bb 90 02 00 00 48 05 c0 00 00 00 49 39 c7 74 29 49 8b 86 a0 00 00 00 <49> 8b b7 98 fd ff ff 41 8b 97 a0 fd ff ff 48 8d b8 f0 00 00 00 e8
[  256.978204] RSP: 0018:ffffc90001f839b0 EFLAGS: 00010283
[  256.978209] RAX: ffff88818075c600 RBX: ffff88811cf0aec0 RCX: 0000000000000000
[  256.978211] RDX: ffff88827180c000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  256.978214] RBP: ffffc90001f839c8 R08: ffff88827180c550 R09: 0000000000000000
[  256.978217] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88811a8cd148
[  256.978219] R13: ffffc90001f839c8 R14: ffff88811a8cd080 R15: dead000000000122
[  256.978223] FS:  00007f932cba4700(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[  256.978225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  256.978228] CR2: 00007fbae3374000 CR3: 0000000270288000 CR4: 00000000003406e0


The machine doesn't hang, but that's about the only good news; after this the other iterations continue (without additional errors in dmesg), but none of the test runs are successful.

A point of reference might be that doing:

cat /sys/kernel/debug/dri/0/i915_gem_objects

seems either take roughly forever, or is stuck somehow.

Comment 55 Chris Wilson 2019-08-30 13:32:46 UTC

commit d22dd615143516b49ca61d6a800719b7fbafafce (fdo/wip-mutex)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 18 13:39:56 2019 +0100

    drm/i915: Move global activity tracking from GEM to GT

?

That's annoying, I was sure that was ready. Could I get you squeeze a lockdep + CONFIG_KASAN run in?

Comment 56 David Weinehall 2019-08-30 14:29:29 UTC

That kernel is (well, should be) built with lockdep. Not KASAN though.

Comment 57 David Weinehall 2019-08-30 14:55:32 UTC

Actually, the kernel I tested with has KASAN enable too.

Comment 58 Chris Wilson 2019-08-30 14:57:03 UTC

(In reply to David Weinehall from comment #57)
> Actually, the kernel I tested with has KASAN enable too.

Could you attach the entire dmesg please?

Comment 59 David Weinehall 2019-08-30 15:06:47 UTC

Created attachment 145219 [details]
dmesg with latest wip-mutex applied

Comment 60 Chris Wilson 2019-08-30 15:13:04 UTC

Ok, it hasn't picked up CONFIG_KASAN. It's complaining about a dead list, but I'm not sure which. Kasan should ideally show which particular object is being used-after-free.

Could you double check what happened to kasan in the build :)

Comment 61 David Weinehall 2019-09-02 11:40:30 UTC

Sorry, my bad--I just grepped for KASAN, noticed CONFIG_HAVE_ARCH_KASAN=y and thought nothing more about it.

New kernel built, will run tests.

Comment 62 David Weinehall 2019-09-02 12:51:33 UTC

[ 2138.371643] ==================================================================
[ 2138.371911] BUG: KASAN: use-after-free in per_file_stats+0x43/0x380 [i915]
[ 2138.371924] Read of size 8 at addr ffff888223651000 by task cat/8293

[ 2138.371947] CPU: 0 PID: 8293 Comm: cat Not tainted 5.3.0-rc6-CI-Custom_4352+ #1
[ 2138.371953] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017
[ 2138.371959] Call Trace:
[ 2138.371974]  dump_stack+0x7c/0xbb
[ 2138.372099]  ? per_file_stats+0x43/0x380 [i915]
[ 2138.372108]  print_address_description+0x73/0x3a0
[ 2138.372231]  ? per_file_stats+0x43/0x380 [i915]
[ 2138.372352]  ? per_file_stats+0x43/0x380 [i915]
[ 2138.372362]  __kasan_report+0x14e/0x192
[ 2138.372489]  ? per_file_stats+0x43/0x380 [i915]
[ 2138.372502]  kasan_report+0xe/0x20
[ 2138.372625]  per_file_stats+0x43/0x380 [i915]
[ 2138.372751]  ? i915_panel_show+0x110/0x110 [i915]
[ 2138.372761]  idr_for_each+0xa7/0x160
[ 2138.372773]  ? idr_get_next_ul+0x110/0x110
[ 2138.372782]  ? do_raw_spin_lock+0x10a/0x1d0
[ 2138.372923]  print_context_stats+0x264/0x510 [i915]
[ 2138.373051]  ? i915_interrupt_info+0x1140/0x1140 [i915]
[ 2138.373065]  ? preempt_count_sub+0x14/0xc0
[ 2138.373074]  ? __mutex_lock+0x656/0xcb0
[ 2138.373092]  ? __mutex_add_waiter+0x90/0x90
[ 2138.373121]  ? seq_vprintf+0xb0/0xb0
[ 2138.373262]  i915_gem_object_info+0xc8/0xe0 [i915]
[ 2138.373276]  seq_read+0x1a4/0x6b0
[ 2138.373306]  full_proxy_read+0x8e/0xc0
[ 2138.373323]  vfs_read+0xc3/0x1e0
[ 2138.373338]  ksys_read+0x116/0x170
[ 2138.373348]  ? kernel_write+0xb0/0xb0
[ 2138.373361]  ? lockdep_hardirqs_off+0xb5/0x100
[ 2138.373368]  ? mark_held_locks+0x1a/0x90
[ 2138.373385]  do_syscall_64+0x72/0x260
[ 2138.373397]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2138.373405] RIP: 0033:0x7f7da3356081
[ 2138.373415] Code: fe ff ff 48 8d 3d 67 9c 0a 00 48 83 ec 08 e8 a6 4c 02 00 66 0f 1f 44 00 00 48 8d 05 81 08 2e 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[ 2138.373421] RSP: 002b:00007ffeed129258 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 2138.373429] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7da3356081
[ 2138.373435] RDX: 0000000000020000 RSI: 00007f7da381b000 RDI: 0000000000000003
[ 2138.373441] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
[ 2138.373447] R10: 0000000000000022 R11: 0000000000000246 R12: 00007f7da381b000
[ 2138.373453] R13: 0000000000000003 R14: 00007f7da381b00f R15: 0000000000020000

[ 2138.373494] Allocated by task 7737:
[ 2138.373507]  __kasan_kmalloc.constprop.0+0xcb/0x130
[ 2138.373515]  kmem_cache_alloc+0xdf/0x2d0
[ 2138.373641]  i915_gem_object_create_shmem.part.1+0x1d/0x2c0 [i915]
[ 2138.373767]  i915_gem_create+0x96/0x140 [i915]
[ 2138.373776]  drm_ioctl_kernel+0x114/0x190
[ 2138.373783]  drm_ioctl+0x4ba/0x580
[ 2138.373790]  do_vfs_ioctl+0x134/0x9d0
[ 2138.373797]  ksys_ioctl+0x3a/0x70
[ 2138.373803]  __x64_sys_ioctl+0x3d/0x50
[ 2138.373810]  do_syscall_64+0x72/0x260
[ 2138.373817]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[ 2138.373832] Freed by task 187:
[ 2138.373843]  __kasan_slab_free+0x146/0x200
[ 2138.373851]  kmem_cache_free+0xb3/0x390
[ 2138.373975]  __i915_gem_free_object_rcu+0x3c/0x60 [i915]
[ 2138.373982]  rcu_core+0x326/0xa10
[ 2138.373990]  __do_softirq+0x12f/0x618

[ 2138.374006] The buggy address belongs to the object at ffff888223650f00
                which belongs to the cache drm_i915_gem_object of size 1136
[ 2138.374018] The buggy address is located 256 bytes inside of
                1136-byte region [ffff888223650f00, ffff888223651370)
[ 2138.374028] The buggy address belongs to the page:
[ 2138.374040] page:ffffea00088d9400 refcount:1 mapcount:0 mapping:ffff888235ed5b80 index:0x0 compound_mapcount: 0
[ 2138.374050] flags: 0x8000000000010200(slab|head)
[ 2138.374061] raw: 8000000000010200 0000000000000000 0000000100000001 ffff888235ed5b80
[ 2138.374068] raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
[ 2138.374074] page dumped because: kasan: bad access detected

[ 2138.374087] Memory state around the buggy address:
[ 2138.374099]  ffff888223650f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2138.374110]  ffff888223650f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2138.374121] >ffff888223651000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2138.374130]                    ^
[ 2138.374141]  ffff888223651080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2138.374152]  ffff888223651100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2138.374161] ==================================================================

Comment 63 Eero Tamminen 2019-09-02 13:39:43 UTC

Note: i915_gem_objects info is saved after all tests have finished, not during testing, so trigger for above listed KASAN issue is not directly related to the deadlock.  It shows that there's some brokeness in i915 gem object accounting though.

Comment 64 Chris Wilson 2019-09-03 06:06:52 UTC

(In reply to Eero Tamminen from comment #63)
> Note: i915_gem_objects info is saved after all tests have finished, not
> during testing, so trigger for above listed KASAN issue is not directly
> related to the deadlock.  It shows that there's some brokeness in i915 gem
> object accounting though.

The good news, bad news, it's just a bug in the debugfs. Was there anything else?

Comment 65 Eero Tamminen 2019-09-03 15:43:08 UTC

David is still running tests, but so far it seems that the deadlock is finally gone (he's running them with lockdep + KASAN kernel, but I guess testing without those is also needed).

When using smaller resolution video, it's possible to do a bit over hundred parallel ffmpeg instances without device OOMing, that's what David's currently testing.

Comment 66 David Weinehall 2019-09-04 08:10:03 UTC

I compiled a new kernel using your latest wip-mutex tree (as of Tue, 03 Sep 2019 10:41:32 +0000).

The following tests were run (20 iterations each):

gstreamer_hd22_p50
ffmpeg_qsv_hd22_p50

both passed. We upped the number of parallel processes a bit; with 100 processes the tests still passed. With 150 processes the test failed due to out of memory, but the failure was graceful--the machine recovered properly and no panic, OOPS, or i915 GPU hang was triggered.

Once our nightly test finishes for the night I'll run one more test run with a standard kernel (no lockdep, no KASAN).

Comment 67 David Weinehall 2019-09-04 12:23:51 UTC

A test-run without KASAN & lockdep enabled completed successfully, BUT the following log messages could be found in the dmesg (triggered during the testruns):

     71 Resetting rcs0 for preemption time out
      3 Resetting vcs0 for preemption time out
     50 Resetting vecs0 for preemption time out

So it seems likely that there's still some nasty issue remaining.

Comment 68 Eero Tamminen 2019-09-09 11:14:46 UTC

Created attachment 145307 [details]
Broxton error state for only partly recoverable VCS0 hang in a similar test-case

Got recoverable GPU hang with 5 parallel test-cases in another Media case on BXT, after which everything following it gets also GPU reseted.

This is with v5.2 drm-tip kernel and i965 media driver from Ubuntu, so I'm not sure it's related to the resets David mentioned above.  Chris?

Comment 69 Eero Tamminen 2019-09-30 13:53:29 UTC

Created attachment 145596 [details]
Broxton error state for recoverable RCS0 hang in a similar test-case

Comment 70 Chris Wilson 2019-10-04 20:14:48 UTC

Time to test drm-tip finally, as of

commit 7e8057626640cfedbae000c5032be32269713687
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 14:40:02 2019 +0100

    drm/i915: Drop struct_mutex from around i915_retire_requests()

Comment 71 Eero Tamminen 2019-10-07 12:47:33 UTC

(In reply to Chris Wilson from comment #70)
> Time to test drm-tip finally, as of
> 
> commit 7e8057626640cfedbae000c5032be32269713687
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Oct 4 14:40:02 2019 +0100
> 
>     drm/i915: Drop struct_mutex from around i915_retire_requests()

Thanks, I've re-enabled the tests triggering the issue, and report back after few nights of runs.

Comment 72 Eero Tamminen 2019-10-09 12:05:22 UTC

Two nights in row without any problems with FFmpeg QSV or MSDK in the tests that earlier deadlocked -> FIXED / VERIFIED.

PS. (easily reproducible) Iris related GPU hangs were not impacted by this fix:
* HdrBloom: bug 111385
* ICL / TerrainFlyTess: bug 111936
* ICL / CSDof: https://gitlab.freedesktop.org/mesa/mesa/issues/1368

Comment 73 Eero Tamminen 2019-11-20 11:38:50 UTC

(In reply to Eero Tamminen from comment #72)
> Two nights in row without any problems with FFmpeg QSV or MSDK in the tests
> that earlier deadlocked -> FIXED / VERIFIED.

Chris, something similar started to happen in the same MSDK test within few past days.  Is there some regression with mutex handling?

This time it lead to system hang, network dropped and machine didn't react to any other input (from USB mouse/keyboard) besides SysRq-Boot.

Should I re-open this or file a new bug if it it continues?

Comment 74 Chris Wilson 2019-11-20 11:42:35 UTC

(In reply to Eero Tamminen from comment #73)
> (In reply to Eero Tamminen from comment #72)
> > Two nights in row without any problems with FFmpeg QSV or MSDK in the tests
> > that earlier deadlocked -> FIXED / VERIFIED.
> 
> Chris, something similar started to happen in the same MSDK test within few
> past days.  Is there some regression with mutex handling?
> 
> This time it lead to system hang, network dropped and machine didn't react
> to any other input (from USB mouse/keyboard) besides SysRq-Boot.
> 
> Should I re-open this or file a new bug if it it continues?

I have one in mind, but it's effect is just lost powersaving. To lose network suggests spinlock/irq issues which is a much different bug. So best to open a new bug so the information is fresh.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.