Bug 108309

Summary: Raven Ridge 2700U system lock-up on multiple games
Product: DRI Reporter: Samantha McVey <samantham>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: f.pinamartins, jason.oliveira
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg log
none
dmesg
none
4.18.12 dmesg log, was using vulkan at the time
none
4.18.12 Xorg log, was using vulkan at the time
none
dmesg from 4.19.0-rc7+ (commit bab5c80b2110 I believe)
none
Journalctl, from starting Steam to Magic Sysrq shutdown none

Description Samantha McVey 2018-10-10 02:07:23 UTC
05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd] (rev d0)

System locks up partially with Plague Inc. (native Linux game). With Path of Exile (wine) it fully locks up, not able to respond to sysrq.

I haven't gotten any output from the full lockup, but for the partial lock-up, nothing showed up in the logs. Let me know what steps I can take to help diagnose the issue here.

Happens on 4.18.12 and 4.19.0-rc7+ ( v4.19-rc7-15-g64c5e530ac2c)
Comment 1 Alex Deucher 2018-10-10 20:13:56 UTC
Please attach your dmesg output and xorg log if using X.
Comment 2 Samantha McVey 2018-10-12 04:40:34 UTC
Created attachment 142000 [details]
Xorg log
Comment 3 Samantha McVey 2018-10-12 04:42:02 UTC
Created attachment 142001 [details]
dmesg

Took some time to respond since I waited until I got more RAM to ensure there was no issue with running out of RAM. The game that had a soft lockup seemed to be related to ram and works fine. The other one I still get a system lockup. I have attached the logs.
Comment 4 Samantha McVey 2018-10-13 00:54:15 UTC
This can be closed. I was able to get things working by `idle=nomwait` to my linux cmdline.
Comment 5 Samantha McVey 2018-10-13 08:30:07 UTC
Created attachment 142013 [details]
4.18.12 dmesg log, was using vulkan at the time

Seems I spoke too soon. I am uploading some logs taken with 4.18.12 while using vulkan. Screen was fully frozen, and for example trying to mute audio would not change the status indicator on my keyboard. Some extracts related to amdgpu below, but the full dmesg is attached.

Oct 13 01:08:12 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1055330, last emitted seq=1055332
...
Oct 13 01:09:24 kernel: kworker/0:2: page allocation failure: order:10, mode:0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
Oct 13 01:09:24 kernel: kworker/0:2 cpuset=/ mems_allowed=0
Oct 13 01:09:24 kernel: CPU: 0 PID: 192238 Comm: kworker/0:2 Tainted: G         C        4.18.12 #1
Oct 13 01:09:24 kernel: Hardware name: LENOVO 20MUCTO1WW/20MUCTO1WW, BIOS R0WET34W (1.02 ) 07/05/2018
Oct 13 01:09:24 kernel: Workqueue: events do_poweroff
Oct 13 01:09:24 kernel: Call Trace:
Oct 13 01:09:24 kernel:  dump_stack+0x5c/0x7b
Oct 13 01:09:24 kernel:  warn_alloc+0xf7/0x180
Oct 13 01:09:24 kernel:  ? _cond_resched+0x11/0x40
Oct 13 01:09:24 kernel:  __alloc_pages_nodemask+0xee3/0x1030
Oct 13 01:09:24 kernel:  ? __radix_tree_delete+0x7e/0xa0
Oct 13 01:09:24 kernel:  cache_grow_begin+0x77/0x510
Oct 13 01:09:24 kernel:  fallback_alloc+0x15c/0x1f0
Oct 13 01:09:24 kernel:  ? amdgpu_vcn_suspend+0x47/0x80 [amdgpu]
Oct 13 01:09:24 kernel:  __kmalloc+0x1bf/0x240
Oct 13 01:09:24 kernel:  amdgpu_vcn_suspend+0x47/0x80 [amdgpu]
Oct 13 01:09:24 kernel:  amdgpu_device_ip_suspend+0xbd/0x160 [amdgpu]
Oct 13 01:09:24 kernel:  device_shutdown+0x13f/0x1e0
Oct 13 01:09:24 kernel:  kernel_power_off+0x2c/0x60
Oct 13 01:09:24 kernel:  process_one_work+0x1e0/0x3c0
Oct 13 01:09:24 kernel:  worker_thread+0x44/0x3f0
Oct 13 01:09:24 kernel:  kthread+0xf0/0x130
Oct 13 01:09:24 kernel:  ? process_one_work+0x3c0/0x3c0
Oct 13 01:09:24 kernel:  ? kthread_flush_work_fn+0x10/0x10
Oct 13 01:09:24 kernel:  ret_from_fork+0x22/0x40
Comment 6 Samantha McVey 2018-10-13 08:31:43 UTC
Created attachment 142014 [details]
4.18.12 Xorg log, was using vulkan at the time
Comment 7 Samantha McVey 2018-10-13 11:13:00 UTC
Created attachment 142015 [details]
dmesg from 4.19.0-rc7+ (commit bab5c80b2110 I believe)

Here is a log with git from a day ago (commit bab5c80b2110). This log and the previous I posted both seem to show some aspect of the system is still responding, as you can tell it logged my sysrq attempts. Though trying to hard shutdown/restart with sysrq doesn't result in the system restarting, and it still shows the same image on the screen from the time of the freeze.

DRM messages at the end of the log:

Oct 12 21:19:22 kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:41:crtc-0] flip_done timed out
Oct 12 21:19:32 kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:41:crtc-0] flip_done timed out
Oct 12 21:19:42 kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
Comment 8 Samantha McVey 2018-10-13 11:25:52 UTC
Also to clarify:

The first dmesg I uploaded had
`rcu: INFO: rcu_sched detected stalls on CPUs/tasks:` in the log. Adding `idle=nomwait` to kernel cmdline has so far fully resolved this and other messages which sometimes would appear which not always referenced rcu_sched but seemed to imply a cpu or thread had stalled. There seem to be several Ryzen errata related to mwait (https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf) so I think that issue is fixed though the amdgpu/drm issues seem to have remained.
Comment 9 Cameron Banfield 2019-04-14 17:26:47 UTC
I can also confirm this issue.

06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev d0)

Lenovo ThinkPad A485
AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx
Linux Mint 19.1
Kernel 5.0.7

Hard lockups of the GPU, requiring the laptop to be power cycled. Interestingly SSH still works in the background while it's locked up.

It happens randomly when opening Firefox or doing very basic tasks - sometimes just sitting idle, however it will crash 100% of the time when trying to play Cities Skylines.

[37258.615599] gmc_v9_0_process_interrupt: 10 callbacks suppressed
[37258.615608] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615615] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107805000 from 27
[37258.615619] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[37258.615629] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615633] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107807000 from 27
[37258.615636] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615645] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615648] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107801000 from 27
[37258.615651] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615660] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615663] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107803000 from 27
[37258.615666] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615675] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615678] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107809000 from 27
[37258.615681] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615689] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615692] amdgpu 0000:06:00.0:   in page starting at address 0x000080010780b000 from 27
[37258.615695] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615704] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615707] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107805000 from 27
[37258.615710] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615740] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615743] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107807000 from 27
[37258.615746] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615756] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615759] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107801000 from 27
[37258.615762] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615771] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615774] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107803000 from 27
[37258.615777] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37268.712339] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37268.712387] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37268.712389] [drm] GPU recovery disabled.
[37278.952537] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37278.952624] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37278.952628] [drm] GPU recovery disabled.
[37289.192390] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37289.192478] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37289.192481] [drm] GPU recovery disabled.
[37299.432447] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37299.432534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37299.432538] [drm] GPU recovery disabled.
[37309.676431] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37309.676518] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37309.676522] [drm] GPU recovery disabled.
[37319.912444] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37319.912536] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37319.912541] [drm] GPU recovery disabled.
[37330.156619] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37330.156706] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37330.156710] [drm] GPU recovery disabled.
[37340.392424] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37340.392511] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37340.392515] [drm] GPU recovery disabled.
[37350.632424] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37350.632511] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37350.632514] [drm] GPU recovery disabled.
[37360.872417] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37360.872508] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37360.872511] [drm] GPU recovery disabled.
[37371.112436] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37371.112523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37371.112527] [drm] GPU recovery disabled.
[37381.352427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37381.352514] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37381.352517] [drm] GPU recovery disabled.
[37391.592410] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37391.592497] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37391.592500] [drm] GPU recovery disabled.
[37401.836426] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37401.836513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37401.836517] [drm] GPU recovery disabled.
[37412.072433] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37412.072520] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37412.072524] [drm] GPU recovery disabled.
[37422.312442] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37422.312528] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37422.312532] [drm] GPU recovery disabled.
[37432.552428] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37432.552515] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37432.552519] [drm] GPU recovery disabled.
[37442.792418] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37442.792506] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37442.792510] [drm] GPU recovery disabled.
[37453.032397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37453.032483] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37453.032487] [drm] GPU recovery disabled.
[37463.272534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37463.272621] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37463.272624] [drm] GPU recovery disabled.
[37473.512589] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37473.512676] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37473.512680] [drm] GPU recovery disabled.
[37483.752954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37483.753041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37483.753044] [drm] GPU recovery disabled.
[37493.992566] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37493.992654] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37493.992657] [drm] GPU recovery disabled.
Comment 10 Jason Oliveira 2019-06-07 16:54:35 UTC
Having the same issue, but it seems to be localized to Vulkan. I've confirmed this on a Ryzen 2300U laptop and a desktop Ryzen 2400G. I have a save file in Pac-Man Champion Edition DX+ that guaranteed will crash the system within 5 seconds of loading a time trial screen on both laptops. Same save file works fine on a proprietary nvidia card, and on (cherry trail) intel drivers, so this looks like it's a Vulkan bug exhibiting itself through DXVK on Raven Ridge. 

Tested on 19.1.0_rc2 and git current as of this date: both exhibit identical issues in Ubuntu and Gentoo.
Comment 11 Francisco Pina Martins 2019-07-01 22:12:13 UTC
Created attachment 144687 [details]
Journalctl, from starting Steam to Magic Sysrq shutdown

I can also confirm this issue, with the same error message in the log.
I can reproduce the issue every single time by trying to start a new game in "Cities: Skylines".

System information:

```
AMD Ryzen 5 2400G
                                                                                                                                                                                                       
Linux ZenBox 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux

mesa-19.1.1-1
```
Comment 12 Martin Peres 2019-11-19 08:58:00 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/548.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.