Bug 108710

Summary: Since 4.20 kernel Vega 56 hangs when I surf pages in steam client
Product: DRI Reporter: mikhail.v.gavrilov
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
yet another dmesg
none
dmesg 4.20 rc2
none
dmesg 4.20 rc2 with patch from comment 4
none
dmesg 4.20 rc2 with patch from comment 4 (GPU hang again and again)
none
4.20rc3 still freezes
none
4.20 g94f371cb7394 + mesa 18.3.0-rc5 none

Description mikhail.v.gavrilov 2018-11-11 11:46:17 UTC
Created attachment 142434 [details]
dmesg

$ inxi -bM
System:    Host: localhost.localdomain Kernel: 4.20.0-0.rc1.git4.1.fc30.x86_64 x86_64 bits: 64 Desktop: Gnome 3.30.1 
           Distro: Fedora release 30 (Rawhide) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X470-I GAMING v: Rev 1.xx serial: <root required> 
           UEFI: American Megatrends v: 0901 date: 07/23/2018 
CPU:       8-Core: AMD Ryzen 7 2700X type: MT MCP speed: 3427 MHz min/max: 2200/4000 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           Display: wayland server: Fedora Project X.org 1.20.3 driver: amdgpu resolution: 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.27.0 4.20.0-0.rc1.git4.1.fc30.x86_64 LLVM 7.0.0) v: 4.5 Mesa 18.2.4 
Network:   Device-1: Intel I211 Gigabit Network driver: igb 
           Device-2: Realtek RTL8822BE 802.11a/b/g/n/ac WiFi adapter driver: r8822be 
Drives:    Local Storage: total: 11.36 TiB used: 5.93 TiB (52.2%) 
Info:      Processes: 455 Uptime: 16m Memory: 31.30 GiB used: 15.99 GiB (51.1%) Shell: bash inxi: 3.0.27 


[ 3852.511166] gmc_v9_0_process_interrupt: 56 callbacks suppressed
[ 3852.511182] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 3852.511184] amdgpu 0000:0b:00.0:   in page starting at address 0x000000401080c000 from 18
[ 3852.511186] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00040152
[ 3862.673344] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=72072, emitted seq=72074
[ 3862.673356] [drm] GPU recovery disabled.
[ 4044.170764] sysrq: SysRq : Show Blocked State
[ 4044.170959]   task                        PC stack   pid father
[ 4044.171026] kworker/u32:5   D10872   253      2 0x80000000
[ 4044.171060] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 4044.171063] Call Trace:
[ 4044.171073]  ? __schedule+0x2f3/0xb90
[ 4044.171077]  ? __lock_acquire+0x279/0x1650
[ 4044.171085]  ? dma_fence_default_wait+0x242/0x330
[ 4044.171089]  schedule+0x2f/0x90
[ 4044.171092]  schedule_timeout+0x31c/0x4f0
[ 4044.171096]  ? find_held_lock+0x34/0xa0
[ 4044.171099]  ? find_held_lock+0x34/0xa0
[ 4044.171104]  ? mark_held_locks+0x57/0x80
[ 4044.171134]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 4044.171140]  ? dma_fence_default_wait+0x242/0x330
[ 4044.171143]  dma_fence_default_wait+0x26e/0x330
[ 4044.171147]  ? dma_fence_release+0x120/0x120
[ 4044.171153]  dma_fence_wait_timeout+0x182/0x200
[ 4044.171160]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[ 4044.171263]  amdgpu_dm_do_flip+0x112/0x380 [amdgpu]
[ 4044.171378]  amdgpu_dm_atomic_commit_tail+0x6d0/0xd30 [amdgpu]
[ 4044.171386]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4044.171391]  ? wait_for_completion_timeout+0x73/0x1a0
[ 4044.171408]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 4044.171413]  process_one_work+0x27d/0x600
[ 4044.171423]  worker_thread+0x3c/0x390
[ 4044.171428]  ? drain_workqueue+0x180/0x180
[ 4044.171433]  kthread+0x120/0x140
[ 4044.171437]  ? kthread_park+0x80/0x80
[ 4044.171442]  ret_from_fork+0x27/0x50
[ 4044.172479] (time-dir)      D13944 15221      1 0x00000000
[ 4044.172487] Call Trace:
[ 4044.172496]  ? __schedule+0x2f3/0xb90
[ 4044.172501]  ? prepare_to_wait_event+0xd2/0x180
[ 4044.172508]  schedule+0x2f/0x90
[ 4044.172514]  drm_sched_entity_flush+0x1df/0x1f0 [gpu_sched]
[ 4044.172518]  ? finish_wait+0x80/0x80
[ 4044.172580]  amdgpu_ctx_mgr_entity_flush+0x7c/0xc0 [amdgpu]
[ 4044.172637]  amdgpu_flush+0x1f/0x30 [amdgpu]
[ 4044.172640]  filp_close+0x34/0x70
[ 4044.172645]  __x64_sys_close+0x1e/0x50
[ 4044.172649]  do_syscall_64+0x60/0x1f0
[ 4044.172653]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 4044.172656] RIP: 0033:0x7f5a96622ec7
[ 4044.172662] Code: Bad RIP value.
[ 4044.172665] RSP: 002b:00007ffcce3d00e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 4044.172668] RAX: ffffffffffffffda RBX: 000000000000007c RCX: 00007f5a96622ec7
[ 4044.172671] RDX: 0000000000000000 RSI: 00007ffcce3d0180 RDI: 000000000000007c
[ 4044.172673] RBP: 000055d29a73aa60 R08: 000055d29a73b676 R09: 0000000000000000
[ 4044.172675] R10: 00007f5a965bbae0 R11: 0000000000000293 R12: 00007f5a95939750
[ 4044.172677] R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffcce3d0180
[ 4057.229953] INFO: task kworker/u32:5:253 blocked for more than 120 seconds.
[ 4057.229957]       Tainted: G        WC        4.20.0-0.rc1.git4.1.fc30.x86_64 #1
[ 4057.229959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4057.229962] kworker/u32:5   D10872   253      2 0x80000000
[ 4057.229979] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 4057.229982] Call Trace:
[ 4057.229994]  ? __schedule+0x2f3/0xb90
[ 4057.229998]  ? __lock_acquire+0x279/0x1650
[ 4057.230006]  ? dma_fence_default_wait+0x242/0x330
[ 4057.230010]  schedule+0x2f/0x90
[ 4057.230013]  schedule_timeout+0x31c/0x4f0
[ 4057.230017]  ? find_held_lock+0x34/0xa0
[ 4057.230020]  ? find_held_lock+0x34/0xa0
[ 4057.230025]  ? mark_held_locks+0x57/0x80
[ 4057.230028]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 4057.230034]  ? dma_fence_default_wait+0x242/0x330
[ 4057.230037]  dma_fence_default_wait+0x26e/0x330
[ 4057.230041]  ? dma_fence_release+0x120/0x120
[ 4057.230047]  dma_fence_wait_timeout+0x182/0x200
[ 4057.230052]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[ 4057.230134]  amdgpu_dm_do_flip+0x112/0x380 [amdgpu]
[ 4057.230221]  amdgpu_dm_atomic_commit_tail+0x6d0/0xd30 [amdgpu]
[ 4057.230228]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4057.230232]  ? wait_for_completion_timeout+0x73/0x1a0
[ 4057.230249]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 4057.230254]  process_one_work+0x27d/0x600
[ 4057.230263]  worker_thread+0x3c/0x390
[ 4057.230269]  ? drain_workqueue+0x180/0x180
[ 4057.230272]  kthread+0x120/0x140
[ 4057.230276]  ? kthread_park+0x80/0x80
[ 4057.230281]  ret_from_fork+0x27/0x50
[ 4057.230571] 
               Showing all locks held in the system:
[ 4057.230581] 1 lock held by khungtaskd/94:
[ 4057.230583]  #0: 00000000a1fc4e6f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x183
[ 4057.230596] 3 locks held by kworker/u32:5/253:
[ 4057.230597]  #0: 00000000156505f1 ((wq_completion)"events_unbound"){+.+.}, at: process_one_work+0x1f3/0x600
[ 4057.230603]  #1: 000000000d248f14 ((work_completion)(&state->commit_work)){+.+.}, at: process_one_work+0x1f3/0x600
[ 4057.230608]  #2: 000000003df03870 (reservation_ww_class_mutex){+.+.}, at: amdgpu_dm_do_flip+0xd6/0x380 [amdgpu]
[ 4057.230700] 2 locks held by gnome-shell/2152:
[ 4057.230702]  #0: 00000000a2cb2cbf (crtc_ww_class_acquire){+.+.}, at: drm_mode_cursor_common+0x95/0x220 [drm]
[ 4057.230721]  #1: 00000000e86bda0d (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x101/0x120 [drm]
[ 4057.230746] 5 locks held by Xwayland/2222:
[ 4057.230784] 1 lock held by htop/3225:
[ 4057.230848] 1 lock held by CPU 0/KVM/4333:
[ 4057.230989] 1 lock held by (time-dir)/15221:
[ 4057.230991]  #0: 000000006ef8a6af (&mgr->lock){+.+.}, at: amdgpu_ctx_mgr_entity_flush+0x3c/0xc0 [amdgpu]

[ 4057.231068] =============================================
Comment 1 mikhail.v.gavrilov 2018-11-12 06:00:02 UTC
Created attachment 142440 [details]
yet another dmesg
Comment 2 mikhail.v.gavrilov 2018-11-14 03:59:51 UTC
Unfortunately in 4.20 rc2 this annoying bug still not fixed
Comment 3 mikhail.v.gavrilov 2018-11-14 04:00:17 UTC
Created attachment 142458 [details]
dmesg 4.20 rc2
Comment 4 Alex Deucher 2018-11-14 14:52:44 UTC
Does this patch help?
https://patchwork.freedesktop.org/patch/261435/
Comment 5 mikhail.v.gavrilov 2018-11-14 19:35:57 UTC
Alex, unfortunately this patch couldn't help me.

I am not observed messages as in comment 3:
[ 1136.956119] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:171 vmid:2 pasid:32776, for process SOTTR.exe pid 12574 thread SOTTR.exe pid 12574)
[ 1136.956122] amdgpu 0000:0b:00.0:   in page starting at address 0x00008001802c0000 from 27

but gpu still hung with usual message:
[  390.017999] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=58179, emitted seq=58215
[  390.018001] [drm] GPU recovery disabled.
Comment 6 mikhail.v.gavrilov 2018-11-14 19:36:46 UTC
Created attachment 142467 [details]
dmesg 4.20 rc2 with patch from comment 4
Comment 7 mikhail.v.gavrilov 2018-11-15 16:43:40 UTC
Created attachment 142482 [details]
dmesg 4.20 rc2 with patch from comment 4 (GPU hang again and again)
Comment 8 mikhail.v.gavrilov 2018-11-15 16:46:05 UTC
Oh I see again messages (even with proposed patch and Mesa 18.3.0-rc2):

[ 1784.721401] gmc_v9_0_process_interrupt: 1 callbacks suppressed
[ 1784.721406] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 1784.721409] amdgpu 0000:0b:00.0:   in page starting at address 0x000000010001a000 from 18
[ 1784.721410] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00040152
[ 1795.007321] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=74616, emitted seq=74621
[ 1795.007324] [drm] GPU recovery disabled.
[ 1795.011389] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=233554, emitted seq=233557
[ 1795.011391] [drm] GPU recovery disabled.


$ inxi -bM
System:    Host: localhost.localdomain Kernel: 4.20.0-0.rc2.git0.1.local.fc30.x86_64 x86_64 bits: 64 Desktop: Gnome 3.31.2 
           Distro: Fedora release 30 (Rawhide) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X470-I GAMING v: Rev 1.xx serial: <root required> 
           UEFI: American Megatrends v: 0901 date: 07/23/2018 
CPU:       8-Core: AMD Ryzen 7 2700X type: MT MCP speed: 2506 MHz min/max: 2200/4000 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           Display: wayland server: Fedora Project X.org 1.20.3 driver: amdgpu resolution: 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.27.0 4.20.0-0.rc2.git0.1.local.fc30.x86_64 LLVM 7.0.0) 
           v: 4.5 Mesa 18.3.0-rc2 
Network:   Device-1: Intel I211 Gigabit Network driver: igb 
           Device-2: Realtek RTL8822BE 802.11a/b/g/n/ac WiFi adapter driver: r8822be 
Drives:    Local Storage: total: 11.36 TiB used: 5.95 TiB (52.4%) 
Info:      Processes: 443 Uptime: 15m Memory: 31.34 GiB used: 15.94 GiB (50.9%) Shell: bash inxi: 3.0.27
Comment 9 mikhail.v.gavrilov 2018-11-22 03:45:00 UTC
Created attachment 142564 [details]
4.20rc3 still freezes
Comment 10 mikhail.v.gavrilov 2018-12-02 12:15:05 UTC
Looks like problem was gone after commit 94f371cb7394
In Fedora this is package 4.20.0-0.rc4.git2.1.fc30.x86_64
Comment 11 mikhail.v.gavrilov 2018-12-04 22:04:14 UTC
I am was able reproduce this issue again with mesa 18.3.0-rc5
Comment 12 mikhail.v.gavrilov 2018-12-04 22:05:16 UTC
Created attachment 142726 [details]
4.20 g94f371cb7394 + mesa 18.3.0-rc5
Comment 13 Martin Peres 2019-11-19 09:03:58 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/604.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.