Bug 109692

Summary: deadlock occurs during GPU reset
Product: DRI Reporter: mikhail.v.gavrilov
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: andrey.grodzovsky, christian.koenig, freedesktop, rverschelde
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
yet another dmesg
none
yet another dmesg
none
[drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* - kernel log
none
"suspicious RCU usage" the commits 3741540 and 222b5f0 is reverted
none
"amd-staging-drm-next" + commits 3741540, 222b5f0 and 5936131 is reverted
none
0001-drm-scheduler-rework-job-destruction.patch
none
0002-drm-sched-Adapt-drivers-to-new-job-destruction-logic.patch
none
dmesg from patched kernel [0001 + 0002]
none
[PATCH] drm_amd_display: wait for fence without holding reservation lock.eml
none
[PATCH] drm_amd_display: wait for fence without holding reservation lock.patch
none
dmesg from patched kernel [0001 + 0002 + 3]
none
dmesg without patches
none
dmesg (with successfull GPU reset after applying patches from Comment 34)
none
Demonstration how message "drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!" get a real pain
none
dmesg of another GPU reset issue none

Description mikhail.v.gavrilov 2019-02-20 17:30:54 UTC
Created attachment 143419 [details]
dmesg

Steps for reproduce:
1. $ git clone git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next
2. $ make bzImage && make module
3. # make modules_install && make install
4. Launch "Shadow of the Tomb Raider"
--- Here GPU hung occurs ---
and after few time 
--- Here start GPU reset ---
--- Here Deadlock occurs ---

[  291.746741] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:7 pasid:32774, for process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250)
[  291.746750] amdgpu 0000:0b:00.0:   in page starting at address 0x0000000000002000 from 27
[  291.746754] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C
[  297.135183] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[  302.255032] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[  302.265813] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=13292, emitted seq=13293
[  302.265950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250
[  302.265974] amdgpu 0000:0b:00.0: GPU reset begin!

[  302.266337] ======================================================
[  302.266338] WARNING: possible circular locking dependency detected
[  302.266340] 5.0.0-rc1-drm-next-kernel+ #1 Tainted: G         C       
[  302.266341] ------------------------------------------------------
[  302.266343] kworker/5:2/871 is trying to acquire lock:
[  302.266345] 000000000abbb16a (&(&ring->fence_drv.lock)->rlock){-.-.}, at: dma_fence_remove_callback+0x1a/0x60
[  302.266352] 
               but task is already holding lock:
[  302.266353] 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140 [gpu_sched]
[  302.266358] 
               which lock already depends on the new lock.

[  302.266360] 
               the existing dependency chain (in reverse order) is:
[  302.266361] 
               -> #1 (&(&sched->job_list_lock)->rlock){-.-.}:
[  302.266366]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
[  302.266368]        dma_fence_signal+0x111/0x1a0
[  302.266414]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
[  302.266470]        sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu]
[  302.266523]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[  302.266576]        amdgpu_ih_process+0x84/0xf0 [amdgpu]
[  302.266628]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[  302.266632]        __handle_irq_event_percpu+0x3f/0x290
[  302.266635]        handle_irq_event_percpu+0x31/0x80
[  302.266637]        handle_irq_event+0x34/0x51
[  302.266639]        handle_edge_irq+0x7c/0x1a0
[  302.266643]        handle_irq+0xbf/0x100
[  302.266646]        do_IRQ+0x61/0x120
[  302.266648]        ret_from_intr+0x0/0x22
[  302.266651]        cpuidle_enter_state+0xbf/0x470
[  302.266654]        do_idle+0x1ec/0x280
[  302.266657]        cpu_startup_entry+0x19/0x20
[  302.266660]        start_secondary+0x1b3/0x200
[  302.266663]        secondary_startup_64+0xa4/0xb0
[  302.266664] 
               -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
[  302.266668]        _raw_spin_lock_irqsave+0x49/0x83
[  302.266670]        dma_fence_remove_callback+0x1a/0x60
[  302.266673]        drm_sched_stop+0x59/0x140 [gpu_sched]
[  302.266717]        amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[  302.266761]        amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[  302.266822]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
[  302.266827]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[  302.266831]        process_one_work+0x272/0x5d0
[  302.266834]        worker_thread+0x50/0x3b0
[  302.266836]        kthread+0x108/0x140
[  302.266839]        ret_from_fork+0x27/0x50
[  302.266840] 
               other info that might help us debug this:

[  302.266841]  Possible unsafe locking scenario:

[  302.266842]        CPU0                    CPU1
[  302.266843]        ----                    ----
[  302.266844]   lock(&(&sched->job_list_lock)->rlock);
[  302.266846]                                lock(&(&ring->fence_drv.lock)->rlock);
[  302.266847]                                lock(&(&sched->job_list_lock)->rlock);
[  302.266849]   lock(&(&ring->fence_drv.lock)->rlock);
[  302.266850] 
                *** DEADLOCK ***

[  302.266852] 5 locks held by kworker/5:2/871:
[  302.266853]  #0: 00000000d133fb6e ((wq_completion)"events"){+.+.}, at: process_one_work+0x1e9/0x5d0
[  302.266857]  #1: 000000008a5c3f7e ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at: process_one_work+0x1e9/0x5d0
[  302.266862]  #2: 00000000b9b2c76f (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x40 [amdgpu]
[  302.266908]  #3: 00000000ac637728 (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[  302.266965]  #4: 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140 [gpu_sched]
[  302.266971] 
               stack backtrace:
[  302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G         C        5.0.0-rc1-drm-next-kernel+ #1
[  302.266976] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
[  302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  302.266982] Call Trace:
[  302.266987]  dump_stack+0x85/0xc0
[  302.266991]  print_circular_bug.isra.0.cold+0x15c/0x195
[  302.266994]  __lock_acquire+0x134c/0x1660
[  302.266998]  ? add_lock_to_list.isra.0+0x67/0xb0
[  302.267003]  lock_acquire+0xa2/0x1b0
[  302.267006]  ? dma_fence_remove_callback+0x1a/0x60
[  302.267011]  _raw_spin_lock_irqsave+0x49/0x83
[  302.267013]  ? dma_fence_remove_callback+0x1a/0x60
[  302.267016]  dma_fence_remove_callback+0x1a/0x60
[  302.267020]  drm_sched_stop+0x59/0x140 [gpu_sched]
[  302.267065]  amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[  302.267110]  amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[  302.267173]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[  302.267178]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[  302.267183]  process_one_work+0x272/0x5d0
[  302.267188]  worker_thread+0x50/0x3b0
[  302.267191]  kthread+0x108/0x140
[  302.267194]  ? process_one_work+0x5d0/0x5d0
[  302.267196]  ? kthread_park+0x90/0x90
[  302.267199]  ret_from_fork+0x27/0x50
[  302.692194] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  302.692234] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  302.768931] amdgpu 0000:0b:00.0: GPU BACO reset
[  303.278874] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  303.279006] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[  303.279072] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[  303.279234] [drm] PSP is resuming...
[  303.426601] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[  303.572227] [drm] UVD and UVD ENC initialized successfully.
[  303.687727] [drm] VCE initialized successfully.
[  303.689585] [drm] recover vram bo from shadow start
[  303.722757] [drm] recover vram bo from shadow done
[  303.722761] [drm] Skip scheduling IBs!
[  303.722791] amdgpu 0000:0b:00.0: GPU reset(2) succeeded!
[  303.722811] [drm] Skip scheduling IBs!
[  303.722838] [drm] Skip scheduling IBs!
[  303.722846] [drm] Skip scheduling IBs!
[  303.722854] [drm] Skip scheduling IBs!
[  303.722863] [drm] Skip scheduling IBs!
[  303.722871] [drm] Skip scheduling IBs!
Comment 1 Andrey Grodzovsky 2019-02-20 19:27:49 UTC
Thanks, I will try to find time sometime soon to investigate this new deadlock.
Comment 2 Andrey Grodzovsky 2019-02-20 19:41:57 UTC
Is this deadlock happens to you always now or only sometimes ?
Comment 3 mikhail.v.gavrilov 2019-02-20 21:01:21 UTC
Created attachment 143422 [details]
yet another dmesg
Comment 4 mikhail.v.gavrilov 2019-02-20 21:01:39 UTC
Created attachment 143423 [details]
yet another dmesg
Comment 5 mikhail.v.gavrilov 2019-02-20 21:02:49 UTC
(In reply to Andrey Grodzovsky from comment #2)
> Is this deadlock happens to you always now or only sometimes ?

Looks like deadlock happens always.
Comment 6 Andrey Grodzovsky 2019-02-20 21:06:12 UTC
(In reply to Andrey Grodzovsky from comment #2)
> Is this deadlock happens to you always now or only sometimes ?

I think i see the problem but not sure yet how to fix it.
Comment 7 Tom St Denis 2019-02-21 11:46:15 UTC
Hi from Reddit!

I don't see the attached umr output in the bugzilla report.  Though it looks like Andrey (who is on my team) is looking at it.

It looks like an unhandled (or just bad) VM fault is locking up the GFX ring and then a reset occurs which then brings up your lockdep.

So it's really two bugs (or 1 and a half)

1.  GPU hang due to VM fault
2.  GPU reset not working properly

The VM fault shouldn't lock up the GPU though even if it's the applications fault (which does happen).  

The VM fault could be because of a bug in the kernel processing a CS, could be from the UMD (mesa/etc), or could be from the application itself.
Comment 8 Tom St Denis 2019-02-21 14:23:35 UTC
Just a follow up.  If you "chmod +s" umr as a non-root user then even if you sudo it it won't have permissions to read the debugfs files.

Make sure you "sudo chmod +s" the binary or simply just "sudo umr".
Comment 9 Andrey Grodzovsky 2019-02-21 17:23:00 UTC
Adding Christian to consult on solution to the problem:

The deadlock happens because we lock fence->lock from dma_fence_signal followed by sched->job_list_lock from drm_sched_process_job on one hand while on the other do reverse order locking in drm_sched_stop, locking first sched->job_list_lock and then fence->lock when trying to remove the cb from dma_fence_remove_callback.

I see 2 possible solutions - 

1) Change ring_mirror_list to lock-less list (e.g. RCU list), problem is I don't think there is a readily available implementation of doubly linked lockless list, I didn't do to much search though... 

2) Go back to removing a job from  ring_mirror_list in drm_sched_job_finish - as I see it for this to work we need to add a wait_queue (signal) to drm_sched_job which will signal AFTER job is removed from ring_mirror_list in drm_sched_job_finish, then in drm_sched_stop when iterating over ring_mirror_list you build a new list of all jobs in progress (for them dma_fence_remove_callback returns false) and then you wait on all theirs wait_queues - then you are sure they removed themselves from ring_mirror_list and you can proceed. 
Specific to AMDGPU  - To insure the jobs are not being freed in drm_sched_job_finish while we process them in drm_sched_stop it's not enough to wait to do  cancel_delayed_work_sync(&sched->work_tdr) for this scheduler but also for all the other schedulers in the device and even in the hive for XGMI use case.
Comment 10 mikhail.v.gavrilov 2019-02-26 10:42:22 UTC
Even without reproducing GPU hang in kernel log I found "suspicious RCU usage" and some errors.

[drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx 000000002caf7aed is still alive
[drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx 000000002caf7aed is still alive

=============================
WARNING: suspicious RCU usage
5.0.0-rc1-drm-next-kernel+ #1 Tainted: G         C       
-----------------------------
include/linux/rcupdate.h:280 Illegal context switch in RCU read-side critical section!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by CrashBandicootN/26312:
 #0: 00000000eb680bad (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
 #1: 00000000b3a3c406 (&p->lock){+.+.}, at: seq_read+0x38/0x410
 #2: 000000007c893f05 (rcu_read_lock){....}, at: dev_seq_start+0x5/0x100

stack backtrace:
CPU: 8 PID: 26312 Comm: CrashBandicootN Tainted: G         C        5.0.0-rc1-drm-next-kernel+ #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
Call Trace:
 dump_stack+0x85/0xc0
 ___might_sleep+0x100/0x180
 __mutex_lock+0x61/0x930
 ? igb_get_stats64+0x29/0x80 [igb]
 ? seq_vprintf+0x33/0x50
 ? igb_get_stats64+0x29/0x80 [igb]
 igb_get_stats64+0x29/0x80 [igb]
 dev_get_stats+0x5c/0xc0
 dev_seq_printf_stats+0x33/0xe0
 dev_seq_show+0x10/0x30
 seq_read+0x2fa/0x410
 proc_reg_read+0x3c/0x60
 __vfs_read+0x37/0x1b0
 vfs_read+0xb2/0x170
 ksys_read+0x52/0xc0
 do_syscall_64+0x5c/0xa0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f2188d8934c
Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 79 c9 01 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 48 89 44 24 08 e8 af c9 01 00 48
RSP: 002b:000000000023f010 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000000007d11b6d0 RCX: 00007f2188d8934c
RDX: 0000000000000400 RSI: 000000007d0dd4f0 RDI: 000000000000007b
RBP: 0000000000000d68 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f2188621c40 R11: 0000000000000246 R12: 00007f2188e59740
R13: 00007f2188e5a340 R14: 00000000000001ff R15: 000000007d11b6d0

This only occures when I use "amd-staging-drm-next".
Comment 11 mikhail.v.gavrilov 2019-02-26 10:43:46 UTC
Created attachment 143476 [details]
[drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* - kernel log
Comment 12 Andrey Grodzovsky 2019-02-26 15:17:31 UTC
Thanks for more info. 
Please be patient on this deadlock bug since the fix is not trivial and also I am currently working on other pressing issues - so this might take a bit.
Comment 13 Andrey Grodzovsky 2019-02-26 15:20:44 UTC
Also, in the meanwhile try reverting this 2 patches and see if it helps - 

3741540 Andrey Grodzovsky   5 weeks ago    drm/sched: Rework HW fence processing.   
222b5f0 Andrey Grodzovsky   5 weeks ago    drm/sched: Refactor ring mirror list handling.  

They are causing this deadlock.
Comment 14 mikhail.v.gavrilov 2019-02-27 04:24:14 UTC
Removing commits 3741540 and 222b5f0 not fix "suspicious RCU usage"
Comment 15 mikhail.v.gavrilov 2019-02-27 04:25:56 UTC
Created attachment 143485 [details]
"suspicious RCU usage" the commits  3741540 and 222b5f0 is reverted
Comment 16 Michel Dänzer 2019-02-27 09:23:31 UTC
The "suspicious RCU usage" looks related to the igb ethernet driver, not amdgpu. Quite likely it's already fixed in later 5.0-rc releases.
Comment 17 mikhail.v.gavrilov 2019-02-28 16:43:08 UTC
(In reply to Michel Dänzer from comment #16)
> The "suspicious RCU usage" looks related to the igb ethernet driver, not
> amdgpu. Quite likely it's already fixed in later 5.0-rc releases.

Thank Michel, I knew, but I forgot about it.
This can be fixed by revert commit 59361316afcb08569af21e1af83e89c7051c055a, but kernel in this branch even after it still much buggy than from Linus tree.
For example when I launch Assassin Creed Odyssey errors begin out related to file system after then the computer hangs.

[44188.533349] ------------[ cut here ]------------
[44188.533840] kernel BUG at fs/inode.c:515!
[44188.533906] invalid opcode: 0000 [#1] SMP NOPTI
[44188.533909] CPU: 2 PID: 114 Comm: kswapd0 Tainted: G         C        5.0.0-rc1-drm-next-kernel-3+ #3
[44188.533910] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
[44188.533915] RIP: 0010:clear_inode+0x7c/0x90
[44188.533917] Code: 2b a8 40 75 29 48 8b 83 e8 01 00 00 48 8d 93 e8 01 00 00 48 39 c2 75 18 48 c7 83 e0 00 00 00 60 00 00 00 5b 5d c3 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
[44188.533918] RSP: 0018:ffffb577c3effae8 EFLAGS: 00010216
[44188.533920] RAX: ff4d207cff51227d RBX: ffff8e4f4f12f208 RCX: 0000000000000007
[44188.533922] RDX: ffff8e4f4f12f5c8 RSI: ffff8e561c5ec020 RDI: ffff8e561c5eb2c0
[44188.533923] RBP: ffff8e4f4f12f440 R08: 0000283071cf739b R09: 0000000000000000
[44188.533924] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9b042f00
[44188.533925] R13: ffff8e4f4f120780 R14: ffff8e4f4f120810 R15: ffff8e4f4f12f208
[44188.533927] FS:  0000000000000000(0000) GS:ffff8e563ce00000(0000) knlGS:0000000000000000
[44188.533928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[44188.533929] CR2: 00007fcc2acc75d0 CR3: 00000005437e0000 CR4: 00000000003406e0
[44188.533930] Call Trace:
[44188.533935]  proc_evict_inode+0x1d/0x50
[44188.533937]  evict+0xd0/0x1a0
[44188.533940]  __dentry_kill+0xdf/0x190
[44188.533942]  shrink_dentry_list+0x11c/0x330
[44188.533944]  prune_dcache_sb+0x53/0x80
[44188.533947]  super_cache_scan+0xf7/0x1b0
[44188.533951]  do_shrink_slab+0x18a/0x3d0
[44188.533954]  shrink_slab+0x170/0x2c0
[44188.533957]  shrink_node+0x177/0x460
[44188.533960]  balance_pgdat+0x2d5/0x560
[44188.533965]  kswapd+0x25b/0x540
[44188.533968]  ? finish_wait+0x90/0x90
[44188.533972]  kthread+0x108/0x140
[44188.533973]  ? balance_pgdat+0x560/0x560
[44188.533975]  ? kthread_park+0x90/0x90
[44188.533978]  ret_from_fork+0x27/0x50
[44188.533981] Modules linked in: snd_seq_dummy uinput fuse rfcomm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat arc4 r8822be(C) edac_mce_amd mac80211 kvm_amd kvm uvcvideo irqbypass videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_hda_codec_realtek videobuf2_common snd_hda_codec_generic ledtrig_audio videodev snd_hda_codec_hdmi snd_usb_audio joydev snd_hda_intel cfg80211 snd_hda_codec media snd_usbmidi_lib btusb snd_hda_core snd_rawmidi btrtl snd_seq btbcm btintel crct10dif_pclmul snd_hwdep bluetooth snd_seq_device crc32_pclmul eeepc_wmi asus_wmi snd_pcm sparse_keymap video
[44188.534011]  wmi_bmof ghash_clmulni_intel snd_timer snd ccp ecdh_generic sp5100_tco soundcore rfkill i2c_piix4 k10temp pcspkr pcc_cpufreq gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc hid_sony amdgpu ff_memless hid_logitech_hidpp chash amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb nvme dca i2c_algo_bit hid_logitech_dj nvme_core wmi pinctrl_amd
[44188.534028] ---[ end trace 85933fec4c138d6a ]---
Comment 18 mikhail.v.gavrilov 2019-02-28 16:45:43 UTC
Created attachment 143497 [details]
"amd-staging-drm-next" + commits  3741540, 222b5f0 and 5936131 is reverted
Comment 19 Andrey Grodzovsky 2019-04-05 20:41:52 UTC
Created attachment 143882 [details] [review]
0001-drm-scheduler-rework-job-destruction.patch
Comment 20 Andrey Grodzovsky 2019-04-05 20:51:04 UTC
Created attachment 143883 [details] [review]
0002-drm-sched-Adapt-drivers-to-new-job-destruction-logic.patch

Mikhail, attached 2 patches should help avoid the deadlock you see. Please apply them on top of latest https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next.
Comment 21 mikhail.v.gavrilov 2019-04-06 14:44:12 UTC
Created attachment 143886 [details]
dmesg from patched kernel [0001 + 0002]
Comment 22 mikhail.v.gavrilov 2019-04-06 14:48:46 UTC
GPU reset not working again.
In dmesg seen only message "GPU reset begin!", but  keys Alt - F3 .. F7 no working and later "init 6" couldn't reboot PC.
Comment 23 Andrey Grodzovsky 2019-04-08 14:46:25 UTC
Created attachment 143897 [details] [review]
[PATCH] drm_amd_display: wait for fence without holding reservation lock.eml

This looks again like deadlock of GPU reset with display waiting on scanout FB fences in display code. Please manually apply the attached patch and test again.
Comment 24 Andrey Grodzovsky 2019-04-08 14:48:19 UTC
Created attachment 143898 [details] [review]
[PATCH] drm_amd_display: wait for fence without holding reservation lock.patch

Reattaching.
Comment 25 Dieter Nützel 2019-04-08 21:28:37 UTC
(In reply to Andrey Grodzovsky from comment #24)
> Created attachment 143898 [details] [review] [review]
> [PATCH] drm_amd_display: wait for fence without holding reservation
> lock.patch
> 
> Reattaching.

Hello Andrey,

is this one needed even on your sent triple pack (dri-devel and amd-gfx)?
Comment 26 Andrey Grodzovsky 2019-04-09 01:51:53 UTC
(In reply to Dieter Nützel from comment #25)
> (In reply to Andrey Grodzovsky from comment #24)
> > Created attachment 143898 [details] [review] [review] [review]
> > [PATCH] drm_amd_display: wait for fence without holding reservation
> > lock.patch
> > 
> > Reattaching.
> 
> Hello Andrey,
> 
> is this one needed even on your sent triple pack (dri-devel and amd-gfx)?

If by triple pack you mean the 3 patches I sent out today then yes, why not. This patch was sent a while ago and also reviewed-by but I don't see it in the latest code so it probably was never submitted.
Comment 27 mikhail.v.gavrilov 2019-04-09 03:15:01 UTC
Created attachment 143904 [details]
dmesg from patched kernel [0001 + 0002 + 3]
Comment 28 mikhail.v.gavrilov 2019-04-09 03:16:00 UTC
And again deadlock occurred.
Comment 29 Andrey Grodzovsky 2019-04-09 16:34:57 UTC
(In reply to mikhail.v.gavrilov from comment #28)
> And again deadlock occurred.

Yea, looks like another annoying regression. Try first of all to revert all the 3 patches i uploaded here and see if you are back to the original deadlock or this issue is still there masking our problem.
Comment 30 mikhail.v.gavrilov 2019-04-10 03:41:52 UTC
Created attachment 143908 [details]
dmesg without patches

Without patches backtrace are looks like initial deadlock backtrace.
Comment 31 Andrey Grodzovsky 2019-04-10 14:25:16 UTC
(In reply to mikhail.v.gavrilov from comment #30)
> Created attachment 143908 [details]
> dmesg without patches
> 
> Without patches backtrace are looks like initial deadlock backtrace.

OK, seems the patches somehow break display engine reset. Let me look into it.
Comment 32 Andrey Grodzovsky 2019-04-10 19:01:54 UTC
(In reply to mikhail.v.gavrilov from comment #30)
> Created attachment 143908 [details]
> dmesg without patches
> 
> Without patches backtrace are looks like initial deadlock backtrace.

Can you please apply back the first 2 patches and then add the following - 
 
Install trace-cmd utility
Type sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
to enable kernel event tracing log

After hang happens 

cd /sys/kernel/debug/tracing && cat trace > event_dump, upload this dump here.
Comment 33 Andrey Grodzovsky 2019-04-10 20:11:44 UTC
Another thing - I noted a regression in existing driver code - can you manually apply this patch and test ?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f7c8930d9e8b188caa53705883373c49f95fd284
Comment 34 Andrey Grodzovsky 2019-04-10 20:14:48 UTC
(In reply to Andrey Grodzovsky from comment #33)
> Another thing - I noted a regression in existing driver code - can you
> manually apply this patch and test ?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=f7c8930d9e8b188caa53705883373c49f95fd284

Just to be clear - apply the first 2 patches and this one on top 
So
0001-drm-scheduler-rework-job-destruction.patch
0002-drm-sched-Adapt-drivers-to-new-job-destruction-logic.patch
And then the patch from the link.
Comment 35 mikhail.v.gavrilov 2019-04-11 03:53:30 UTC
Created attachment 143927 [details]
dmesg (with successfull GPU reset after applying patches from Comment 34)

Hooray, it's a win! Now GPU successfully reseted without deadlock.

1. When these patches will landed in the mainline?

2. Is it possible to reduce the amount messages (really better not showing this messages)  in terminal?
Because it's very hart to kill `gnome-session` process when in console appears new messages constantly like this: "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!"

This flooding message is real headache.
Comment 36 mikhail.v.gavrilov 2019-04-11 04:03:30 UTC
Created attachment 143928 [details]
Demonstration how message "drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!" get a real pain
Comment 37 Andrey Grodzovsky 2019-04-11 14:25:39 UTC
(In reply to mikhail.v.gavrilov from comment #36)
> Created attachment 143928 [details]
> Demonstration how message "drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to
> initialize parser -125!" get a real pain

Those messages are actually expected for this particular  use case, i might consider switching them to info.

I am working on completing  review and pushing those patches into our upstream branch.

Thanks for testing.
Comment 38 Michel Dänzer 2019-04-11 14:41:00 UTC
(In reply to Andrey Grodzovsky from comment #37)
> Those messages are actually expected for this particular  use case, i might
> consider switching them to info.

I'd argue they should be debug, so they don't spam dmesg by default.
Comment 39 mikhail.v.gavrilov 2019-04-11 14:52:28 UTC
(In reply to Michel Dänzer from comment #38)
> 
> I'd argue they should be debug, so they don't spam dmesg by default.

Let this messages stay in kernel log (dmesg), but I don't want see they when I enter any command in terminal. They makes the process entering commands in terminal very annoying.
Comment 40 Andrey Grodzovsky 2019-04-11 16:32:12 UTC
(In reply to Michel Dänzer from comment #38)
> (In reply to Andrey Grodzovsky from comment #37)
> > Those messages are actually expected for this particular  use case, i might
> > consider switching them to info.
> 
> I'd argue they should be debug, so they don't spam dmesg by default.

Actually since in non reset case this is an actual error maybe i just print it this happens NOT during reset.
Comment 41 Michel Dänzer 2019-04-11 16:42:55 UTC
(In reply to Andrey Grodzovsky from comment #40)
> Actually since in non reset case this is an actual error maybe i just print
> it this happens NOT during reset.

Buggy / malicious userspace can probably trigger an error here, allowing it to spam dmesg, so it should be debug.
Comment 42 mikhail.v.gavrilov 2019-05-12 20:58:56 UTC
Andrey, could you look on another GPU reset issue?

[18735.255511] sony 0003:054C:09CC.0008: input,hidraw4: USB HID v81.11 Gamepad [Sony Interactive Entertainment Wireless Controller] on usb-0000:0c:00.3-1.4.4/input3
[18735.340415] usbcore: registered new interface driver snd-usb-audio
[18742.241131] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=201025, emitted seq=201027
[18742.241141] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[18742.241196] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[18742.241200] amdgpu 0000:0b:00.0: GPU reset begin!
[18742.251116] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2264508, emitted seq=2264510
[18742.251153] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 17912 thread chrome:cs0 pid 17915
[18742.251156] amdgpu 0000:0b:00.0: GPU reset begin!
[18742.754134] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=15388, emitted seq=15388
[18742.754203] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[18742.754207] amdgpu 0000:0b:00.0: GPU reset begin!
[18751.968977] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=346, emitted seq=346
[18751.969042] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 3587 thread chrome:cs0 pid 3604
[18751.969047] amdgpu 0000:0b:00.0: GPU reset begin!
[18753.504894] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[18763.744722] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[18763.744758] amdgpu: [powerplay] Failed to send message 0x47, response 0xffffffff
[18763.744770] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
[18763.744772] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
[18763.744784] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
[18763.744785] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[18763.744796] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
[18763.744798] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[18763.744809] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
[18763.744810] amdgpu: [powerplay] Failed to set soft min gfxclk !
[18763.744811] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[18763.779547] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:950
[18763.779651] WARNING: CPU: 4 PID: 25404 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:277 generic_reg_wait.cold+0x29/0x30 [amdgpu]
[18763.779652] Modules linked in: snd_usb_audio hid_sony ff_memless snd_usbmidi_lib snd_rawmidi snd_seq_dummy uinput fuse rfcomm cmac nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables bnep sunrpc vfat fat arc4 joydev r8822be(C) snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd ledtrig_audio eeepc_wmi snd_hda_codec_hdmi asus_wmi kvm_amd sparse_keymap snd_hda_intel video wmi_bmof mac80211 snd_hda_codec kvm snd_hda_core snd_hwdep snd_seq btusb btrtl snd_seq_device btbcm irqbypass btintel bluetooth snd_pcm crct10dif_pclmul cfg80211 crc32_pclmul snd_timer ghash_clmulni_intel snd sp5100_tco ecdh_generic soundcore i2c_piix4 k10temp rfkill ccp
[18763.779676]  gpio_amdpt gpio_generic pcc_cpufreq acpi_cpufreq xfs libcrc32c amdgpu chash gpu_sched amd_iommu_v2 ttm drm_kms_helper igb drm nvme dca i2c_algo_bit crc32c_intel nvme_core wmi pinctrl_amd
[18763.779687] CPU: 4 PID: 25404 Comm: kworker/4:0 Tainted: G         C        5.1.0-1.fc31.x86_64 #1
[18763.779689] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
[18763.779694] Workqueue: events drm_sched_job_timedout [gpu_sched]
[18763.779764] RIP: 0010:generic_reg_wait.cold+0x29/0x30 [amdgpu]
[18763.779766] Code: ff 44 8b 44 24 68 48 8b 4c 24 60 48 c7 c7 10 c0 8b c0 8b 54 24 58 8b 34 24 e8 b0 59 92 d0 41 83 7c 24 20 01 0f 84 9b 78 fe ff <0f> 0b e9 94 78 fe ff e8 9a 4d ed ff 48 c7 c7 00 b0 95 c0 e8 de 5f
[18763.779767] RSP: 0018:ffff9b5320bb7720 EFLAGS: 00010297
[18763.779769] RAX: 0000000000000052 RBX: 0000000000000bb9 RCX: 0000000000000000
[18763.779771] RDX: 0000000000000000 RSI: ffff8d40fd9168c8 RDI: ffff8d40fd9168c8
[18763.779772] RBP: 00000000000050e2 R08: ffff8d40fd9168c8 R09: 000000000000053d
[18763.779773] R10: ffff8d411e39d7d0 R11: ffff9b5320bb75d5 R12: ffff8d40f97ca580
[18763.779774] R13: 00000000000050e2 R14: 00000000ffffffff R15: 00000000ffffffff
[18763.779776] FS:  0000000000000000(0000) GS:ffff8d40fd900000(0000) knlGS:0000000000000000
[18763.779778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18763.779779] CR2: 00000c0d6e33c000 CR3: 000000027ed00000 CR4: 00000000003406e0
[18763.779780] Call Trace:
[18763.779854]  dce110_stream_encoder_dp_blank+0x159/0x2e0 [amdgpu]
[18763.779921]  core_link_disable_stream+0x42/0x270 [amdgpu]
[18763.779987]  dce110_reset_hw_ctx_wrap+0xca/0x1f0 [amdgpu]
[18763.780053]  dce110_apply_ctx_to_hw+0x4a/0x490 [amdgpu]
[18763.780100]  ? amdgpu_pm_compute_clocks+0xb9/0x5e0 [amdgpu]
[18763.780168]  ? dm_pp_apply_display_requirements+0x19e/0x1b0 [amdgpu]
[18763.780232]  dc_commit_state+0x26b/0x570 [amdgpu]
[18763.780236]  ? vsnprintf+0x3aa/0x4f0
[18763.780304]  amdgpu_dm_atomic_commit_tail+0x3e2/0x1980 [amdgpu]
[18763.780308]  ? vt_console_print+0x74/0x3f0
[18763.780311]  ? up+0x12/0x60
[18763.780314]  ? irq_work_queue+0x91/0xa0
[18763.780316]  ? wake_up_klogd+0x30/0x40
[18763.780318]  ? vprintk_emit+0x1ef/0x250
[18763.780320]  ? printk+0x58/0x6f
[18763.780328]  ? drm_atomic_helper_wait_for_dependencies+0x1e4/0x1f0 [drm_kms_helper]
[18763.780341]  ? drm_err+0x72/0x90 [drm]
[18763.780349]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[18763.780356]  commit_tail+0x3c/0x70 [drm_kms_helper]
[18763.780364]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
[18763.780371]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
[18763.780378]  drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper]
[18763.780445]  dm_suspend+0x1c/0x60 [amdgpu]
[18763.780489]  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
[18763.780532]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[18763.780601]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
[18763.780668]  amdgpu_device_gpu_recover+0x77/0x731 [amdgpu]
[18763.780730]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
[18763.780733]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
[18763.780737]  process_one_work+0x19d/0x380
[18763.780739]  worker_thread+0x50/0x3b0
[18763.780742]  kthread+0xfb/0x130
[18763.780744]  ? process_one_work+0x380/0x380
[18763.780745]  ? kthread_park+0x90/0x90
[18763.780748]  ret_from_fork+0x22/0x40
[18763.780751] ---[ end trace 1b9ec5613589027e ]---
Comment 43 mikhail.v.gavrilov 2019-05-12 21:00:22 UTC
Created attachment 144242 [details]
dmesg of another GPU reset issue
Comment 44 Martin Peres 2019-11-19 09:14:47 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/702.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.