I've been experiencing random GPU hangs since I upgraded to Threadripper about a year ago. Specs: - Motherboard: ASUS Prime X399-A, all bios versions from stock until current 0808 - CPU: Threadripper 1950X, 32 threads - GPU: MSI Radeon RX Vega 64 Air Boost 8G OC (was also happening on ASUS R9 Fury X on the same machine; this GPU was generally stable on previous box) - Displays: - 2x DELL U2412M 1920x1200x60 (DP) - 1x ASUS MG279Q 2560x1440x144 (DP) - Kernel versions: 4.20, 5.0-rc2 (has been happening since from at least 4.14; earlier versions weren't tried). - linux-firmware: 20181218 - Mesa: 18.3.1 - X: 1.20.3 - libdrm: 2.4.96 - Possibly relevant kernel options: amd_iommu=on vfio-pci.ids=10de:1005,10de:0e1a,1912:0014,1106:3483 iommu=pt vfio-pci.disable_vga=1 hpet=disable nohpet amdgpu.ppfeaturemask=0xfffd7fff amdgpu.gpu_recovery=1 pcie_aspm=off The problem manifests itself usually like this: 1. Screen suddenly freezes (sometimes it is possible to move mouse cursor for a few seconds, but it will freeze eventually too) 2. GPU fan speeds up and remain high 3. Every process that talks to GPU freezes and becomes impossible to kill. 4. Can SSH into the machine and everything else besides the GPU works ok. 5. dmesg contains a message like this: [Jan21 00:03] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=17188686, emitted seq=17188689 [ +0.000032] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process X pid 9315 thread X:cs0 pid 9335 or with a bit more stuff happening before: [Jan18 19:43] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000003] amdgpu 0000:44:00.0: in page starting at address 0x0000800010607000 from 27 [ +0.000002] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0060153D [ +0.000005] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000002] amdgpu 0000:44:00.0: in page starting at address 0x0000800010609000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010607000 from 27 [ +0.000002] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010609000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000002] amdgpu 0000:44:00.0: in page starting at address 0x0000800010607000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010609000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010607000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000002] amdgpu 0000:44:00.0: in page starting at address 0x0000800010609000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010607000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225 thread superposit:cs0 pid 11308) [ +0.000001] amdgpu 0000:44:00.0: in page starting at address 0x0000800010609000 from 27 [ +0.000001] amdgpu 0000:44:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [Jan18 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=40554, emitted seq=40556 [ +0.000047] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process superposition pid 11225 thread superposit:cs0 pid 11308 6. amdgpu reports near 100% cpu usage and high power draw, even it was completely idle before the freeze. If I enable amdgpu.gpu_recovery, then it tries to reset the GPU but fails most of the time: [ +0.000005] amdgpu 0000:44:00.0: GPU reset begin! [ +10.230091] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-2] hw_done or flip_done timed out (there are no further logs) (I've seen it succesfully reset the GPU only *once*, and that obviously required X restart) These freezes happen pretty much randomly: - Sometimes the GPU remains stable for weeks - It will generally remain stable while just playing games or running benchmarks like Unigine Superposition for many hours - There have been a couple of freezes when just watching youtube using firefox and not doing anything else - It will sometimes freeze with GPU being completely idle (but outputs on), while CPU is at 100% - It will sometimes freeze when opening shadertoy shaders. Not specific ones, just randomly. - It will likely freeze within 1-2 hours of streaming using OBS: - XSHM is used to grab 2560x1440 screen at 60fps - image downscaled to 1080p60 using whatever OBS does - a bunch of minor stuff added to the frame - software encoding using x264 medium preset resulting in 10-30% CPU load - It can freeze both when doing live shader programming (and GPU is at 100% with heavy pathtracing compute), and when just editing text in vim. - It is still pretty random: sometimes it remains stable for a week of 2-4 hours of almost everyday streaming, but on some days it will freeze 2-3 times within one evening. This would suggest a hardware issue, but strangely enough I have never experienced this problem on Windows using the same PC. This also prevents me from RMA because there's no plausible way reproduce the issue. Other hardware is stable: - CPU being 100% busy compiling some huge C++ codebases for hours remains stable - many-hours memtest doesn't show any errors - there's also an NVidia GPU installed in this machine that is being passed through to Windows running under qemu. This GPU is also stable under any load. - although it was throwing PCI AER errors into dmesg (without any other symptoms). This is believed to be benign X399 issue, and is suppressed using pcie_aspm=off kernel parameter - Loading the entire system for 100% (simultaneously running GPU benchmarks on host and vm, and also compiling something on CPU) generally doesn't trigger the issue. Adding OBS to that likely does. - Three different PSUs were used on this system, no behaviour difference. Other things: - Power management on Linux is significantly different from one on Windows. - on Windows idle means idle: all clocks and voltages are as low as pp allows, power draw is ~20W - on Linux even idle (nothing is feeding GPU with any work) will have slck at 3 (1138Mhz 1000mV) and mclk at 3 (max, 945MHz 1100mV), power draw is 40W - I am unable to dump BIOS of this card properly on Linux: - Both /sys/kernel/debug/dri/0/amdgpu_vbios and /sys/class/drm/card0/device/rom are truncated at 60928 - Contents are different from what I could dump on Windows, e.g: @@ -1,6 +1,6 @@ -00000000: 55aa 77e9 eb02 0000 0000 0000 0000 0000 U.w............. -00000010: 0000 0000 0000 0000 9c02 0000 0000 4942 ..............IB -00000020: 4d9d ac8a 0000 0000 0000 0000 0000 0004 M............... +00000000: 55aa 77e9 eb02 0000 00c0 0000 0000 0000 U.w............. +00000010: 0000 0000 0044 0000 9c02 0000 0000 4942 .....D........IB +00000020: 4d43 ac8a 0000 0000 0000 0000 0000 0004 MC.............. 00000030: 2037 3631 3239 3535 3230 0000 0000 0000 761295520...... 00000040: 0000 0000 0000 0000 7402 0000 0000 0000 ........t....... 00000050: 3132 2f31 322f 3137 2030 313a 3237 0000 12/12/17 01:27.. @@ -38,13 +38,13 @@ 00000250: 315f 4d42 415f 4131 5f48 424d 5f38 4742 1_MBA_A1_HBM_8GB 00000260: 5f56 3336 3831 305c 636f 6e66 6967 2e68 _V36810\config.h 00000270: 0000 0090 2800 0202 4154 4f4d 00c0 eb03 ....(...ATOM.... -00000280: 1802 c102 6c01 1e04 0000 0000 6214 8036 ....l.......b..6 +00000280: 1802 c102 6c01 1e04 0000 0030 6214 8036 ....l......0b..6 - Under/over-volting doesn't work: any however insignificant change to any of the default voltages result in severe throttling, see https://github.com/RadeonOpenCompute/ROCm/issues/681 Is there anything else I could try? Is there a way to collect more info? Links to (probably, superficially) similar problems: - https://bugs.freedesktop.org/show_bug.cgi?id=105733 - https://bugs.freedesktop.org/show_bug.cgi?id=105819 - https://bugs.freedesktop.org/show_bug.cgi?id=109022 - https://bugs.freedesktop.org/show_bug.cgi?id=105251 - https://bugs.freedesktop.org/show_bug.cgi?id=108493 - https://github.com/RadeonOpenCompute/ROCm/issues/348
Created attachment 143176 [details] dmesg-5.0-rc2
I wonder if this is related to your motherboard. I have an ASUS Zenith with a 1950X, 128GB RAM and a Vega 64 LC that have been on Kernel 4.20 through 5.0-rc4. The latter of which I'm currently on. I have no kernel parameters on my grub file, only the default 'splash quiet'. I have never ran into hangs while gaming, using youtube, OBS nor compiling the linux kernel. Just thought I would share my similar configuration. I can only suggest try updating and or downgrading your BIOS?
Hey, can you check if adding amdgpu.vm_debug=1 makes the VMC page faults disappear ? Regarding the hang you see while doing GPU reset - please provide dmesg for this but with command line parameter of drm.debug=0xff - you also probably should open another ticket for this specific issue.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/682.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.