Bug 108710 - Since 4.20 kernel Vega 56 hangs when I surf pages in steam client
Summary: Since 4.20 kernel Vega 56 hangs when I surf pages in steam client
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-11 11:46 UTC by mikhail.v.gavrilov
Modified: 2018-12-04 22:05 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg (142.49 KB, text/plain)
2018-11-11 11:46 UTC, mikhail.v.gavrilov
no flags Details
yet another dmesg (144.89 KB, text/plain)
2018-11-12 06:00 UTC, mikhail.v.gavrilov
no flags Details
dmesg 4.20 rc2 (91.08 KB, text/plain)
2018-11-14 04:00 UTC, mikhail.v.gavrilov
no flags Details
dmesg 4.20 rc2 with patch from comment 4 (98.90 KB, text/plain)
2018-11-14 19:36 UTC, mikhail.v.gavrilov
no flags Details
dmesg 4.20 rc2 with patch from comment 4 (GPU hang again and again) (93.69 KB, text/plain)
2018-11-15 16:43 UTC, mikhail.v.gavrilov
no flags Details
4.20rc3 still freezes (97.46 KB, text/plain)
2018-11-22 03:45 UTC, mikhail.v.gavrilov
no flags Details
4.20 g94f371cb7394 + mesa 18.3.0-rc5 (163.28 KB, text/plain)
2018-12-04 22:05 UTC, mikhail.v.gavrilov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description mikhail.v.gavrilov 2018-11-11 11:46:17 UTC
Created attachment 142434 [details]
dmesg

$ inxi -bM
System:    Host: localhost.localdomain Kernel: 4.20.0-0.rc1.git4.1.fc30.x86_64 x86_64 bits: 64 Desktop: Gnome 3.30.1 
           Distro: Fedora release 30 (Rawhide) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X470-I GAMING v: Rev 1.xx serial: <root required> 
           UEFI: American Megatrends v: 0901 date: 07/23/2018 
CPU:       8-Core: AMD Ryzen 7 2700X type: MT MCP speed: 3427 MHz min/max: 2200/4000 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           Display: wayland server: Fedora Project X.org 1.20.3 driver: amdgpu resolution: 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.27.0 4.20.0-0.rc1.git4.1.fc30.x86_64 LLVM 7.0.0) v: 4.5 Mesa 18.2.4 
Network:   Device-1: Intel I211 Gigabit Network driver: igb 
           Device-2: Realtek RTL8822BE 802.11a/b/g/n/ac WiFi adapter driver: r8822be 
Drives:    Local Storage: total: 11.36 TiB used: 5.93 TiB (52.2%) 
Info:      Processes: 455 Uptime: 16m Memory: 31.30 GiB used: 15.99 GiB (51.1%) Shell: bash inxi: 3.0.27 


[ 3852.511166] gmc_v9_0_process_interrupt: 56 callbacks suppressed
[ 3852.511182] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 3852.511184] amdgpu 0000:0b:00.0:   in page starting at address 0x000000401080c000 from 18
[ 3852.511186] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00040152
[ 3862.673344] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=72072, emitted seq=72074
[ 3862.673356] [drm] GPU recovery disabled.
[ 4044.170764] sysrq: SysRq : Show Blocked State
[ 4044.170959]   task                        PC stack   pid father
[ 4044.171026] kworker/u32:5   D10872   253      2 0x80000000
[ 4044.171060] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 4044.171063] Call Trace:
[ 4044.171073]  ? __schedule+0x2f3/0xb90
[ 4044.171077]  ? __lock_acquire+0x279/0x1650
[ 4044.171085]  ? dma_fence_default_wait+0x242/0x330
[ 4044.171089]  schedule+0x2f/0x90
[ 4044.171092]  schedule_timeout+0x31c/0x4f0
[ 4044.171096]  ? find_held_lock+0x34/0xa0
[ 4044.171099]  ? find_held_lock+0x34/0xa0
[ 4044.171104]  ? mark_held_locks+0x57/0x80
[ 4044.171134]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 4044.171140]  ? dma_fence_default_wait+0x242/0x330
[ 4044.171143]  dma_fence_default_wait+0x26e/0x330
[ 4044.171147]  ? dma_fence_release+0x120/0x120
[ 4044.171153]  dma_fence_wait_timeout+0x182/0x200
[ 4044.171160]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[ 4044.171263]  amdgpu_dm_do_flip+0x112/0x380 [amdgpu]
[ 4044.171378]  amdgpu_dm_atomic_commit_tail+0x6d0/0xd30 [amdgpu]
[ 4044.171386]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4044.171391]  ? wait_for_completion_timeout+0x73/0x1a0
[ 4044.171408]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 4044.171413]  process_one_work+0x27d/0x600
[ 4044.171423]  worker_thread+0x3c/0x390
[ 4044.171428]  ? drain_workqueue+0x180/0x180
[ 4044.171433]  kthread+0x120/0x140
[ 4044.171437]  ? kthread_park+0x80/0x80
[ 4044.171442]  ret_from_fork+0x27/0x50
[ 4044.172479] (time-dir)      D13944 15221      1 0x00000000
[ 4044.172487] Call Trace:
[ 4044.172496]  ? __schedule+0x2f3/0xb90
[ 4044.172501]  ? prepare_to_wait_event+0xd2/0x180
[ 4044.172508]  schedule+0x2f/0x90
[ 4044.172514]  drm_sched_entity_flush+0x1df/0x1f0 [gpu_sched]
[ 4044.172518]  ? finish_wait+0x80/0x80
[ 4044.172580]  amdgpu_ctx_mgr_entity_flush+0x7c/0xc0 [amdgpu]
[ 4044.172637]  amdgpu_flush+0x1f/0x30 [amdgpu]
[ 4044.172640]  filp_close+0x34/0x70
[ 4044.172645]  __x64_sys_close+0x1e/0x50
[ 4044.172649]  do_syscall_64+0x60/0x1f0
[ 4044.172653]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 4044.172656] RIP: 0033:0x7f5a96622ec7
[ 4044.172662] Code: Bad RIP value.
[ 4044.172665] RSP: 002b:00007ffcce3d00e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 4044.172668] RAX: ffffffffffffffda RBX: 000000000000007c RCX: 00007f5a96622ec7
[ 4044.172671] RDX: 0000000000000000 RSI: 00007ffcce3d0180 RDI: 000000000000007c
[ 4044.172673] RBP: 000055d29a73aa60 R08: 000055d29a73b676 R09: 0000000000000000
[ 4044.172675] R10: 00007f5a965bbae0 R11: 0000000000000293 R12: 00007f5a95939750
[ 4044.172677] R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffcce3d0180
[ 4057.229953] INFO: task kworker/u32:5:253 blocked for more than 120 seconds.
[ 4057.229957]       Tainted: G        WC        4.20.0-0.rc1.git4.1.fc30.x86_64 #1
[ 4057.229959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4057.229962] kworker/u32:5   D10872   253      2 0x80000000
[ 4057.229979] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 4057.229982] Call Trace:
[ 4057.229994]  ? __schedule+0x2f3/0xb90
[ 4057.229998]  ? __lock_acquire+0x279/0x1650
[ 4057.230006]  ? dma_fence_default_wait+0x242/0x330
[ 4057.230010]  schedule+0x2f/0x90
[ 4057.230013]  schedule_timeout+0x31c/0x4f0
[ 4057.230017]  ? find_held_lock+0x34/0xa0
[ 4057.230020]  ? find_held_lock+0x34/0xa0
[ 4057.230025]  ? mark_held_locks+0x57/0x80
[ 4057.230028]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 4057.230034]  ? dma_fence_default_wait+0x242/0x330
[ 4057.230037]  dma_fence_default_wait+0x26e/0x330
[ 4057.230041]  ? dma_fence_release+0x120/0x120
[ 4057.230047]  dma_fence_wait_timeout+0x182/0x200
[ 4057.230052]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[ 4057.230134]  amdgpu_dm_do_flip+0x112/0x380 [amdgpu]
[ 4057.230221]  amdgpu_dm_atomic_commit_tail+0x6d0/0xd30 [amdgpu]
[ 4057.230228]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4057.230232]  ? wait_for_completion_timeout+0x73/0x1a0
[ 4057.230249]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 4057.230254]  process_one_work+0x27d/0x600
[ 4057.230263]  worker_thread+0x3c/0x390
[ 4057.230269]  ? drain_workqueue+0x180/0x180
[ 4057.230272]  kthread+0x120/0x140
[ 4057.230276]  ? kthread_park+0x80/0x80
[ 4057.230281]  ret_from_fork+0x27/0x50
[ 4057.230571] 
               Showing all locks held in the system:
[ 4057.230581] 1 lock held by khungtaskd/94:
[ 4057.230583]  #0: 00000000a1fc4e6f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x183
[ 4057.230596] 3 locks held by kworker/u32:5/253:
[ 4057.230597]  #0: 00000000156505f1 ((wq_completion)"events_unbound"){+.+.}, at: process_one_work+0x1f3/0x600
[ 4057.230603]  #1: 000000000d248f14 ((work_completion)(&state->commit_work)){+.+.}, at: process_one_work+0x1f3/0x600
[ 4057.230608]  #2: 000000003df03870 (reservation_ww_class_mutex){+.+.}, at: amdgpu_dm_do_flip+0xd6/0x380 [amdgpu]
[ 4057.230700] 2 locks held by gnome-shell/2152:
[ 4057.230702]  #0: 00000000a2cb2cbf (crtc_ww_class_acquire){+.+.}, at: drm_mode_cursor_common+0x95/0x220 [drm]
[ 4057.230721]  #1: 00000000e86bda0d (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x101/0x120 [drm]
[ 4057.230746] 5 locks held by Xwayland/2222:
[ 4057.230784] 1 lock held by htop/3225:
[ 4057.230848] 1 lock held by CPU 0/KVM/4333:
[ 4057.230989] 1 lock held by (time-dir)/15221:
[ 4057.230991]  #0: 000000006ef8a6af (&mgr->lock){+.+.}, at: amdgpu_ctx_mgr_entity_flush+0x3c/0xc0 [amdgpu]

[ 4057.231068] =============================================
Comment 1 mikhail.v.gavrilov 2018-11-12 06:00:02 UTC
Created attachment 142440 [details]
yet another dmesg
Comment 2 mikhail.v.gavrilov 2018-11-14 03:59:51 UTC
Unfortunately in 4.20 rc2 this annoying bug still not fixed
Comment 3 mikhail.v.gavrilov 2018-11-14 04:00:17 UTC
Created attachment 142458 [details]
dmesg 4.20 rc2
Comment 4 Alex Deucher 2018-11-14 14:52:44 UTC
Does this patch help?
https://patchwork.freedesktop.org/patch/261435/
Comment 5 mikhail.v.gavrilov 2018-11-14 19:35:57 UTC
Alex, unfortunately this patch couldn't help me.

I am not observed messages as in comment 3:
[ 1136.956119] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:171 vmid:2 pasid:32776, for process SOTTR.exe pid 12574 thread SOTTR.exe pid 12574)
[ 1136.956122] amdgpu 0000:0b:00.0:   in page starting at address 0x00008001802c0000 from 27

but gpu still hung with usual message:
[  390.017999] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=58179, emitted seq=58215
[  390.018001] [drm] GPU recovery disabled.
Comment 6 mikhail.v.gavrilov 2018-11-14 19:36:46 UTC
Created attachment 142467 [details]
dmesg 4.20 rc2 with patch from comment 4
Comment 7 mikhail.v.gavrilov 2018-11-15 16:43:40 UTC
Created attachment 142482 [details]
dmesg 4.20 rc2 with patch from comment 4 (GPU hang again and again)
Comment 8 mikhail.v.gavrilov 2018-11-15 16:46:05 UTC
Oh I see again messages (even with proposed patch and Mesa 18.3.0-rc2):

[ 1784.721401] gmc_v9_0_process_interrupt: 1 callbacks suppressed
[ 1784.721406] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 1784.721409] amdgpu 0000:0b:00.0:   in page starting at address 0x000000010001a000 from 18
[ 1784.721410] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00040152
[ 1795.007321] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=74616, emitted seq=74621
[ 1795.007324] [drm] GPU recovery disabled.
[ 1795.011389] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=233554, emitted seq=233557
[ 1795.011391] [drm] GPU recovery disabled.


$ inxi -bM
System:    Host: localhost.localdomain Kernel: 4.20.0-0.rc2.git0.1.local.fc30.x86_64 x86_64 bits: 64 Desktop: Gnome 3.31.2 
           Distro: Fedora release 30 (Rawhide) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X470-I GAMING v: Rev 1.xx serial: <root required> 
           UEFI: American Megatrends v: 0901 date: 07/23/2018 
CPU:       8-Core: AMD Ryzen 7 2700X type: MT MCP speed: 2506 MHz min/max: 2200/4000 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           Display: wayland server: Fedora Project X.org 1.20.3 driver: amdgpu resolution: 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.27.0 4.20.0-0.rc2.git0.1.local.fc30.x86_64 LLVM 7.0.0) 
           v: 4.5 Mesa 18.3.0-rc2 
Network:   Device-1: Intel I211 Gigabit Network driver: igb 
           Device-2: Realtek RTL8822BE 802.11a/b/g/n/ac WiFi adapter driver: r8822be 
Drives:    Local Storage: total: 11.36 TiB used: 5.95 TiB (52.4%) 
Info:      Processes: 443 Uptime: 15m Memory: 31.34 GiB used: 15.94 GiB (50.9%) Shell: bash inxi: 3.0.27
Comment 9 mikhail.v.gavrilov 2018-11-22 03:45:00 UTC
Created attachment 142564 [details]
4.20rc3 still freezes
Comment 10 mikhail.v.gavrilov 2018-12-02 12:15:05 UTC
Looks like problem was gone after commit 94f371cb7394
In Fedora this is package 4.20.0-0.rc4.git2.1.fc30.x86_64
Comment 11 mikhail.v.gavrilov 2018-12-04 22:04:14 UTC
I am was able reproduce this issue again with mesa 18.3.0-rc5
Comment 12 mikhail.v.gavrilov 2018-12-04 22:05:16 UTC
Created attachment 142726 [details]
4.20 g94f371cb7394 + mesa 18.3.0-rc5


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.