Description
Adrià Cereto i Massagué
2018-02-26 09:39:09 UTC
Getting the exact same issue with my vega 56, system hangs when I log in to lightdm, fans spin up and just get louder and louder, shutting down doesn't work. Reverting to 4.15 didn't seem to fix the issue either, even though it was working fine before upgrading to 4.16 Also happens on Manjaro KDE with kernels 4.16 through the latest 4.17rc: [ 8164.289086] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289091] amdgpu 0000:38:00.0: at page 0x000000010d203000 from 27 [ 8164.289093] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031 [ 8164.289099] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289101] amdgpu 0000:38:00.0: at page 0x000000010d205000 from 27 [ 8164.289103] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289109] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289110] amdgpu 0000:38:00.0: at page 0x000000010d20b000 from 27 [ 8164.289112] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289118] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289119] amdgpu 0000:38:00.0: at page 0x000000010d20d000 from 27 [ 8164.289121] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289126] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289128] amdgpu 0000:38:00.0: at page 0x000000010d201000 from 27 [ 8164.289129] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289135] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289136] amdgpu 0000:38:00.0: at page 0x000000010d207000 from 27 [ 8164.289138] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289143] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289145] amdgpu 0000:38:00.0: at page 0x000000010d209000 from 27 [ 8164.289146] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289152] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289153] amdgpu 0000:38:00.0: at page 0x000000010d201000 from 27 [ 8164.289154] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289160] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289161] amdgpu 0000:38:00.0: at page 0x000000010d20e000 from 27 [ 8164.289163] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8164.289168] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) [ 8164.289170] amdgpu 0000:38:00.0: at page 0x000000010d212000 from 27 [ 8164.289171] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 8174.340966] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=401175, last emitted seq=401177 [ 8174.340974] [drm] No hardware hang detected. Did some blocks stall? Vega8 / Ryzen 2400G btw. amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768) amdgpu 0000:38:00.0: at page 0x000000010760d000 from 27 amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031 Got the issue on kernel 4.17-rc6 with Mesa 18.2 built against LLVM 7.0. 2400G with Vega 11 Graphics. Is there any additional info we need to get? Anything we can test? My system is currently unusable until this is fixed and it has been 3 months since being reported and haven't heard anything but more reports It seems I'm now affected by this bug too... Hardware: GPU: RX Vega 64 Liquid CPU: Ryzen R7 1800X Software: OS: OpenSUSE Tumbleweed Kernel: 4.17rc5 (from OpenSUSE Factory repos) Mesa: 18.1.0 (from OpenSUSE Tumbleweed repos) Kernel log - "journalctl -b -1 -r | grep amdgpu": May 31 20:38:04 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=2, last emitted seq=3 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x001013BD May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: at page 0x00000005000c0000 from 27 May 31 20:37:54 kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:222 vmid:1 pasid:32768) May 31 20:35:48 kernel: [drm] Initialized amdgpu 3.25.0 20150101 for 0000:0d:00.0 on minor 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0 May 31 20:35:48 kernel: amdgpu 0000:0d:00.0: fb0: amdgpudrmfb frame buffer device May 31 20:35:48 kernel: fbcon: amdgpudrmfb (fb0) is primary device May 31 20:35:47 kernel: [drm] amdgpu: 8176M of GTT memory ready. May 31 20:35:47 kernel: [drm] amdgpu: 8176M of VRAM memory ready May 31 20:35:47 kernel: amdgpu 0000:0d:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF May 31 20:35:47 kernel: amdgpu 0000:0d:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used) May 31 20:35:47 kernel: [drm] add ip block number 6 <gfx_v9_0> May 31 20:35:47 kernel: amdgpu 0000:0d:00.0: enabling device (0006 -> 0007) May 31 20:35:47 kernel: fb: switching to amdgpudrmfb from EFI VGA May 31 20:35:47 kernel: [drm] amdgpu kernel modesetting enabled. VMC Page faults are now in the log always, but "amdgpu_job_timeout" is persistent: May 31 20:38:04 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=2, last emitted seq=3 I discovered that the cause of this for me was pywal, when I ran it my gpu hung, but if I didn't run it, it was otherwise fine. Another cause is cemu through wine with mesa_mild For me its hang immediately at boot (as soon as Xorg loads). Only way I was able to successfully boot the machine is setting: "NoAccel" "True" in Xorg.conf.d/10-amdgpu.conf. In some cases there is nothing in dmesg or Xorg.0.log machine just hangs with "cursor" on the screen. So one more update: My boot issue went away after updating: - kernel-firmware to 20180525 (as there were some amdgpu firmware updates in 20180518). - libLLVM6 from 6.0.0rc1 to 6.0.0 (and I strongly suspect this was the cause as I had VM pagefault issues before with libLLVM5 - but only in some OpenGL applications, not at boot). Hi everybody, first of all please add logs as attachments and not inline into the bug report. Then make sure that the firmware files are up to date. It looks like we accidentally released corrupted firmware files once, but those should already be replaced with working versions. Created attachment 140645 [details]
Complete DMESG from boot to lockup
My system (Threadripper, Vega 64) started exhibiting the same issue on 4.17. It will lock hard for me under IO. I have a custom python script that I run that does NFS IO off my X550 network card and then invokes imagemagick/convert to generate thumbnails. I wasn't experiencing issues on 4.16 personally. 4.17.5 locks 100% on running of the script with the attached ciri example dmesg. The only other system change I have made recently is the addition of the opencl-amd package in a failed attempt to make Divinci Resolve run (https://aur.archlinux.org/packages/opencl-amd/). My linux-firmware package is linux-firmware-git 20171125.17e6288-1. There might be a better way to get the firmware version from the card itself, but I don't possess such knowledge (yet). Barry I upgraded my linux-firmware to 20180606.d114732-1 and it had no affect on the issue. Still locks running the script with the same dmesg. Did some more testing and found that I can cause this issue to happen repeatably by using Imagemagick convert to attempt to convert and resize a jpg image. Doing the same convert and settings the environment variable MAGICK_OCL_DEVICE=OFF works without lockup. Some sort of OpenCL thing? I think my issue is related. I get black screen boots roughly every 2/3 times I boot up my computer. This last time, it booted up, kernel panic'd and I could still see the output so I took some pictures. https://imgur.com/gallery/T69zIjX Info: $ uname -a Linux itx-dev.local 4.17.12-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 3 07:16:41 UTC 2018 x86_64 GNU/Linux $ cat /proc/cpuinfo processor : 5 vendor_id : AuthenticAMD cpu family : 23 model : 17 model name : AMD Ryzen 5 2400G with Radeon Vega Graphics stepping : 0 $ pacman -Qs amdgpu local/xf86-video-amdgpu 18.0.1-2 (xorg-drivers) X.org amdgpu video driver $ pacman -Qs mesa local/glu 9.0.0-5 Mesa OpenGL Utility library local/lib32-libva-mesa-driver 18.1.5-1 VA-API implementation for gallium (32-bit) local/lib32-mesa 18.1.5-1 An open-source implementation of the OpenGL specification (32-bit) local/lib32-vulkan-radeon 18.1.5-1 Radeon's Vulkan mesa driver (32-bit) local/libva-mesa-driver 18.1.5-1 VA-API implementation for gallium local/mesa 18.1.5-1 An open-source implementation of the OpenGL specification local/mesa-vdpau 18.1.5-1 Mesa VDPAU drivers local/vulkan-radeon 18.1.5-1 Radeon's Vulkan mesa driver Hi everyone, I've tried with latest kernel and latest VEGA10 firmware and wasn't able to reproduce this problem. From the logs it seems all of you are running 4.17.x kernel or earlier - try latest 4.18 and latest firmware form here - https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ (In reply to Andrey Grodzovsky from comment #16) > Hi everyone, I've tried with latest kernel and latest VEGA10 firmware and > wasn't able to reproduce this problem. > > From the logs it seems all of you are running 4.17.x kernel or earlier - try > latest 4.18 and latest firmware form here - > > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ Hi, I can reproduce this every time, on kernel 4.18 with mesa 18.3 and a Vega64. Simply try to open Mario Kart 8 in Cemu with wine, and the system will crash with the exact same dmesg. (In reply to CheatCodesOfLife from comment #17) > (In reply to Andrey Grodzovsky from comment #16) > > Hi everyone, I've tried with latest kernel and latest VEGA10 firmware and > > wasn't able to reproduce this problem. > > > > From the logs it seems all of you are running 4.17.x kernel or earlier - try > > latest 4.18 and latest firmware form here - > > > > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > > Hi, > > I can reproduce this every time, on kernel 4.18 with mesa 18.3 and a Vega64. > > Simply try to open Mario Kart 8 in Cemu with wine, and the system will crash > with the exact same dmesg. I had mesa 18.2 so I updated to 18.3 - still nothing. Could you provide glxinfo dump ? What LLVM are you using ? I have 7. (In reply to Andrey Grodzovsky from comment #18) > (In reply to CheatCodesOfLife from comment #17) > > (In reply to Andrey Grodzovsky from comment #16) > > > Hi everyone, I've tried with latest kernel and latest VEGA10 firmware and > > > wasn't able to reproduce this problem. > > > > > > From the logs it seems all of you are running 4.17.x kernel or earlier - try > > > latest 4.18 and latest firmware form here - > > > > > > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > > > > Hi, > > > > I can reproduce this every time, on kernel 4.18 with mesa 18.3 and a Vega64. > > > > Simply try to open Mario Kart 8 in Cemu with wine, and the system will crash > > with the exact same dmesg. > > I had mesa 18.2 so I updated to 18.3 - still nothing. Could you provide > glxinfo dump ? What LLVM are you using ? I have 7. I have had this problem with mesa 18.2 and LLVM7. Currently on mesa 18.3 and LLVM8. I also had this result with a Vega56, and I know people online who have the same problem. Nobody can open Mario Kart 8 in Cemu with wine if they have a Vega card. I've attached my glxinfo > glxinfo.txt Created attachment 141210 [details]
glxinfo dump as requested
(In reply to CheatCodesOfLife from comment #17) > (In reply to Andrey Grodzovsky from comment #16) > > Hi everyone, I've tried with latest kernel and latest VEGA10 firmware and > > wasn't able to reproduce this problem. > > > > From the logs it seems all of you are running 4.17.x kernel or earlier - try > > latest 4.18 and latest firmware form here - > > > > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > > Hi, > > I can reproduce this every time, on kernel 4.18 with mesa 18.3 and a Vega64. > > Simply try to open Mario Kart 8 in Cemu with wine, and the system will crash > with the exact same dmesg. I had mesa 18.2 so I updated to 18.3 - still nothing. Could you provide glxinfo dump ? What LLVM are you using ? I have 7.(In reply to CheatCodesOfLife from comment #20) > Created attachment 141210 [details] > glxinfo dump as requested Thanks for the info, is there any other way you reproduce it without the wine platform ? You're welcome. Not the exact same problem, no. I can get a hard-lock by trying to use amdvlk to play rpcs3, but it doesn't produce the same error and it's not as consistent (takes up to 15 minutes to crash) Not sure if it's worth noting but I went back and tried every Cemu version back to 1.5 and a lot of wine versions going back to 2.8. It happens every time as soon as the game loads. (In reply to CheatCodesOfLife from comment #22) > You're welcome. > > Not the exact same problem, no. I can get a hard-lock by trying to use > amdvlk to play rpcs3, but it doesn't produce the same error and it's not as > consistent (takes up to 15 minutes to crash) > > Not sure if it's worth noting but I went back and tried every Cemu version > back to 1.5 and a lot of wine versions going back to 2.8. It happens every > time as soon as the game loads. Let's try to get some debug info for the VMC page fault then - Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/ Install trace-cmd utility Load driver with cmd line parameter amdgpu.vm_fault_stop=2 from grub P.S Best to use latest kernel from here - https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next After desktop is loaded type sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv" to enable kernel event tracing log If possible to launch the game from shell then prepend the command with GALLIUM_DDEBUG=always to dump all the MESA commands into files in ~/ddebug_dumps/ Start the game. When the problem happens do the following - as root cd /sys/kernel/debug/tracing && cat trace > event_dump as normal user or root sudo umr -lb > umr_dump sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump sudo umr -O halt_waves,use_colour -wa >> umr_dump dmesg > dmesg_dump Upload a tar/zip of all those files + all the files from ~/ddebug_dumps/ (In reply to Andrey Grodzovsky from comment #23) > (In reply to CheatCodesOfLife from comment #22) > > You're welcome. > > > > Not the exact same problem, no. I can get a hard-lock by trying to use > > amdvlk to play rpcs3, but it doesn't produce the same error and it's not as > > consistent (takes up to 15 minutes to crash) > > > > Not sure if it's worth noting but I went back and tried every Cemu version > > back to 1.5 and a lot of wine versions going back to 2.8. It happens every > > time as soon as the game loads. > > Let's try to get some debug info for the VMC page fault then - > > Clone and build our open source register analyzer from here - > https://cgit.freedesktop.org/amd/umr/ > Install trace-cmd utility > Load driver with cmd line parameter amdgpu.vm_fault_stop=2 from grub > P.S Best to use latest kernel from here - > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > > After desktop is loaded type > > sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e > "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv" > to enable kernel event tracing log > > If possible to launch the game from shell then prepend the command with > GALLIUM_DDEBUG=always > to dump all the MESA commands into files in ~/ddebug_dumps/ > > > Start the game. When the problem happens do the following - > > as root > cd /sys/kernel/debug/tracing && cat trace > event_dump > > as normal user or root > sudo umr -lb > umr_dump > sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump > sudo umr -O halt_waves,use_colour -wa >> umr_dump > dmesg > dmesg_dump > > Upload a tar/zip of all those files + all the files from ~/ddebug_dumps/ Thanks for the instructions. I think I've followed them correctly. I didn't build the amd-drm-next kernel as it'll be an overnight job (slow internet speeds) but I did add the grub parameters. I have attached the files. Created attachment 141269 [details]
debug files
(In reply to CheatCodesOfLife from comment #25) > Created attachment 141269 [details] > debug files Thanks a lot, i will find some time in the next few days to analyze it. (In reply to CheatCodesOfLife from comment #25) > Created attachment 141269 [details] > debug files Since your kernel build doesn't have the latest AMD code I don't have ALL the trace logs so I can't be curtain but it does looks like the address reported by GPU fault is bad address, it's above any VA range seen in logs. I would need you to run Cemu.exe with GALLIUM_DDEBUG=always environment variable and upload logs from from ~/ddebug_dumps/ From googling it looks like WINE will pass down any ENVs picked from shell to the apps it runs so should be easy - just run GALLIUM_DDEBUG=always 'WINE launch commands' from shell. Also provide all the other logs like last time. Also please verify you MESA build includes the following fix - https://cgit.freedesktop.org/mesa/mesa/commit/id=c5c6e0187fd5d535c304ca3fd62de0f5e636c0c2 I assume you are running WINE with MESA ? Sorry , this link https://cgit.freedesktop.org/mesa/mesa/commit/?id=c5c6e0187fd5d535c304ca3fd62de0f5e636c0c2 Created attachment 141276 [details]
logs/trace with amd-drm-next and GALLIUM_DDEBUG=always
(In reply to Andrey Grodzovsky from comment #29) > Sorry , this link > https://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=c5c6e0187fd5d535c304ca3fd62de0f5e636c0c2 Yeah, I am using mesa. I've setup the amd-drm-next kernel kernel. This is the command I used to launch Cemu: GALLIUM_HUD="fps" GALLIUM_DDEBUG=always wine64 Cemu.exe (I switched on the GALLIUM_HUD as well so that I could verify that wine was receiving the ENVs, which it is.) uname -a Linux nihonium2 4.18.0-rc1-5024f8dfe478 #1 SMP PREEMPT Sat Aug 25 05:10:49 AEST 2018 x86_64 GNU/Linux The logs are attached (this time it took 3 tries to actually launch cemu due to an unrelated issue so the archive is 14mb) Looks like dmesg is missing, Can you recover the correct dmesg log for this last reproduction ? The bad address is there. Created attachment 141277 [details]
amd3.tar.gz dmesg, trace, ddebug logs
Sorry about that.
I don't have that dmesg any more but I did the whole process again and attached it. This time I have confirmed all the files are in the archive.
(In reply to CheatCodesOfLife from comment #33) > Created attachment 141277 [details] > amd3.tar.gz dmesg, trace, ddebug logs > > Sorry about that. > I don't have that dmesg any more but I did the whole process again and > attached it. This time I have confirmed all the files are in the archive. But where is the trace file ? :) Any way I will try to check with what I have. (In reply to Andrey Grodzovsky from comment #34) > (In reply to CheatCodesOfLife from comment #33) > > Created attachment 141277 [details] > > amd3.tar.gz dmesg, trace, ddebug logs > > > > Sorry about that. > > I don't have that dmesg any more but I did the whole process again and > > attached it. This time I have confirmed all the files are in the archive. > > But where is the trace file ? :) Any way I will try to check with what I > have. Any way, doesn't matter, could you please redo the capture and this time instead of GALLIUM_DDEBUG=always do GALLIUM_DDEBUG=1000 ? This way we can get one big dump file when VM_FAULT happens with all the info. Created attachment 141303 [details]
ddebug_dumps/Cemu.exe_2244_00000000 dmesg_dump event_dump umr_dump
Hi,
This time I double-checked the tar archive, the trace, dmesg umr and ddebug file are there. It's just 1 ddebug file this time as you said, but it's only 368kb.
Command I used was:
GALLIUM_HUD="fps" GALLIUM_DDEBUG=1000 wine64 Cemu.exe
Created attachment 141323 [details] [review] patch - fix ddebug BO list reporting Hi, Can you please get a new ddebug report with the attached patch? Thanks. (In reply to Marek Olšák from comment #37) > Created attachment 141323 [details] [review] [review] > patch - fix ddebug BO list reporting > > Hi, > > Can you please get a new ddebug report with the attached patch? Thanks. Just to be clear, you need to rebuild you mesa library with that patch on top. Created attachment 141342 [details]
logs after building the patched mesa
Hi,
Thanks for the logging patch.
I have applied patched that into the latest master branch from the mesa github page, built it and ran the game again with the new version.
The logs are attached.
Marek Olšák, I still don't see the expected debug output. I looked for 'Buffer list' CheatCodesOfLife, can you verify please you are running the patched version of MESA ? We tested yesterday the new prints and they do show on VM_FAULTs. (In reply to Andrey Grodzovsky from comment #40) > Marek Olšák, I still don't see the expected debug output. I looked for > 'Buffer list' > CheatCodesOfLife, can you verify please you are running the patched version > of MESA ? We tested yesterday the new prints and they do show on VM_FAULTs. This is most likely my fault as I'm new to most of this sort of thing. This is what I did, maybe you'll see where I went wrong: - Patch This is the patched version of src/gallium/drivers/radeonsi/si_gfx_cs.c http://termbin.com/ypet - Build I installed this build of mesa to a different prefix, rather than overriding my system install (I use this computer for work, everything). System install: glxinfo |grep Mesa\ 18 OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.0-devel (git-e345247092) OpenGL version string: 4.4 (Compatibility Profile) Mesa 18.3.0-devel (git-e345247092) OpenGL ES profile version string: OpenGL ES 3.2 Mesa 18.3.0-devel (git-e345247092) New build: OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.0-devel (git-a72dbc461b) OpenGL version string: 4.4 (Compatibility Profile) Mesa 18.3.0-devel (git-a72dbc461b) OpenGL ES profile version string: OpenGL ES 3.2 Mesa 18.3.0-devel (git-a72dbc461b) - Running: I then ran Cemu like this: LD_LIBRARY_PATH=/home/paul/mesa_log/lib/ GALLIUM_HUD="fps" GALLIUM_DDEBUG=1000 wine64 Cemu.exe I know wine lets you do this because this is how we used to use a fork of mesa called 'mesa_mild' to get the required compatibility profile prior to mesa 18.2 which provided core compatibility 4.4 If installing to a prefix like that isn't adequate for this testing, let me know and I'll re-install the OS on an external drive, do a system-wide install of this patched mesa and try again. The file is incomplete, but I don't know why. Can you try it again? Maybe it'll be complete next time. It's better to use the REISUB key sequence to reboot the machine. (put it in google) (In reply to Marek Olšák from comment #42) > The file is incomplete, but I don't know why. Can you try it again? Maybe > it'll be complete next time. It's better to use the REISUB key sequence to > reboot the machine. (put it in google) Hi Marek, Yep, I'll do this tonight (including the REISUB to reboot). In which file should I grep for 'Buffer list' to ensure it's worked before posting here? And is fine that I've sandbox'd the install to /home/paul/mesa_log rather than a system-install? If glxinfo picks up the correct driver, it's fine. The ddebug file should contain "Buffer list". I've just tried it again a couple of times, and this time I'm sitting there tailing (-f) the ddebug file and nothing is being added to it after GFX_ tail -f ~/ddebug_dumps/Cemu.13f.exe_1990_00000000 HQD_IB_BUSY = 0 CP_CPF_STALLED_STAT1 <- RING_FETCHING_DATA = 1 INDR1_FETCHING_DATA = 1 INDR2_FETCHING_DATA = 0 STATE_FETCHING_DATA = 0 TCIU_WAITING_ON_FREE = 0 TCIU_WAITING_ON_TAGS = 0 UTCL2IU_WAITING_ON_FREE = 0 UTCL2IU_WAITING_ON_TAGS = 0 GFX_ It's been 10 minutes, that's the end of it. I don't think it's a reboot / flush logs to the filesystem issue since I'm still SSH'd in and following the log file. No "Buffer list" in the file either :( I also tried building the latest master branch and applying the patch again, same thing. On the monitor in the terminal where I ran wine, it says "Hang detection timeout is 1000ms." Not sure if that's relevant. Created attachment 141377 [details]
amd6.tar.gz and amd7.tar.gz with usual logs, 2 attempts
The log is truncated for some reason. Can you apply this to make it shorter? diff --git a/src/gallium/drivers/radeonsi/si_debug.c b/src/gallium/drivers/radeonsi/si_debug.c index 5e80469cee1..325e1e3ed01 100644 --- a/src/gallium/drivers/radeonsi/si_debug.c +++ b/src/gallium/drivers/radeonsi/si_debug.c @@ -101,6 +101,7 @@ static void si_dump_shader(struct si_screen *sscreen, enum pipe_shader_type processor, const struct si_shader *shader, FILE *f) { + return; if (shader->shader_log) fwrite(shader->shader_log, shader->shader_log_size, 1, f); else Created attachment 141425 [details]
logs and trace
Hi,
I have applied the patch, ran through the process and attached the logs. The file doesn't appear to be truncated anymore.
(In reply to CheatCodesOfLife from comment #48) > Created attachment 141425 [details] > logs and trace > > Hi, > > I have applied the patch, ran through the process and attached the logs. The > file doesn't appear to be truncated anymore. Looks like still not Buffer list in the log... (In reply to Andrey Grodzovsky from comment #49) > (In reply to CheatCodesOfLife from comment #48) > > Created attachment 141425 [details] > > logs and trace > > > > Hi, > > > > I have applied the patch, ran through the process and attached the logs. The > > file doesn't appear to be truncated anymore. > > Looks like still not Buffer list in the log... Hi Andrey, sorry for the late reply, I applied the patches and built it as you guys wanted. Could something to do with the crash be causing the log file to be incomplete? I think the system is pretty unstable after the crash. Apart from all input/output on the desktop going away, I also can't 'reboot' or 'shutdown -h now' (I have to do REISUB). Perhaps something there is affecting the logging? Anything else you could think of that I can try on my end? Cheers One thing you could try is setting the synchronous attribute on the ddebug dump file before the hang: chattr +S ~/ddebug_dump/* Of course, you'll have to wait for the file to be created before doing this. Created attachment 141522 [details]
Logs + trace with patched mesa, plus example code which consistently triggers crash.
I've been experiencing a random crash which seems a lot like this; the image freezes, the keyboard stops working, the mouse can still be moved for a second then also freezes, and the "GPU usage" leds all light and the fans spin up.
Oddly enough, while working on a toy opengl program I seem to have accidentally found a means of consistently triggering it. I've included the sources in the tarball; I didn't try and narrow down the exact cause, so please pardon any extra fluff which is no doubt in there.
I at least captured a trace which contains the string "Buffer list". I also noticed umr was spitting quite a bit of stuff out to stderr which isn't in the dump; if you want that too let me know.
Some version numbers:
Radeon RX Vega 64
Linux amd-staging-drm-next 4.19.0-rc1-d0a96214993c
Mesa 18.3.0-devel (git-133e12fb69) (with the si_debug.c patch applied)
Created attachment 141524 [details]
vega_crasher Logs + trace with patched mesa
Hi Michel,
Even with the chattr +S command, the buffer list is not present :(
I also ran the vega_crasher from zzyxpaw and am able to reproduce that on my system. I have attached the output and it includes the Buffer Lists.
For some reason, when Cemu + Mario Kart 8 crashes, the file gets truncated, but when the vega_crasher tool crashes, the files are not truncated. This leads me to believe I'm not doing anything wrong? lol.
Other than that, the symptoms are the same. Mouse moves for a little while then it stops.
Oh and my umr spits out a lot of things to stderr as well, with both this and the MK8 crash. Let me know if you want this. Any updates to this? I can still reproduce with the latest amd-drm-next kernel. I am still able to reproduce this, as is everybody with a Vega in the #linux channel in the Cemu discord server. Someone with a Vega8 has also reproduced it. Hello there, I am seeing the same problems on my Ryzen 2700U which is freezing as well, with the latest Ubuntu kernel: 4.19.0-041900-generic. I am luckily able to ssh back in and try to shut it down, but it won't completely. This is what I see on the kern.log, and in my case the whole graphics is frozen. I am running linuxmint cinnamon 19, with the latest Ubuntu kernel. Oct 28 11:30:58 antonioRyzen kernel: [22639.758782] gmc_v9_0_process_interrupt: 10 callbacks suppressed Oct 28 11:30:58 antonioRyzen kernel: [22639.758789] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for p rocess cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 Oct 28 11:30:58 antonioRyzen kernel: [22639.758789] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758797] amdgpu 0000:02:00.0: at address 0x00000001010e1000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758801] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00701031 Oct 28 11:30:58 antonioRyzen kernel: [22639.758818] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 Oct 28 11:30:58 antonioRyzen kernel: [22639.758818] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758822] amdgpu 0000:02:00.0: at address 0x00000001010e0000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758825] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 Oct 28 11:30:58 antonioRyzen kernel: [22639.758834] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 Oct 28 11:30:58 antonioRyzen kernel: [22639.758834] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758839] amdgpu 0000:02:00.0: at address 0x00000001010e0000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758841] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 Oct 28 11:30:58 antonioRyzen kernel: [22639.758850] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 Oct 28 11:30:58 antonioRyzen kernel: [22639.758850] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758853] amdgpu 0000:02:00.0: at address 0x00000001010e0000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758855] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 Oct 28 11:30:58 antonioRyzen kernel: [22639.758863] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 Oct 28 11:30:58 antonioRyzen kernel: [22639.758863] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758867] amdgpu 0000:02:00.0: at address 0x00000001010e0000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758869] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 Oct 28 11:30:58 antonioRyzen kernel: [22639.758877] amdgpu 0000:02:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process cinnamon pid 1459 thread amdgpu_cs:0 pid 1463 ... Oct 28 11:30:58 antonioRyzen kernel: [22639.758931] ) Oct 28 11:30:58 antonioRyzen kernel: [22639.758935] amdgpu 0000:02:00.0: at address 0x00000001010e1000 from 27 Oct 28 11:30:58 antonioRyzen kernel: [22639.758937] amdgpu 0000:02:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 Oct 28 11:31:08 antonioRyzen kernel: [22649.811916] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1445040, emitt ed seq=1445042 Oct 28 11:31:08 antonioRyzen kernel: [22649.811925] [drm] GPU recovery disabled. Oct 28 11:33:58 antonioRyzen kernel: [22820.391906] INFO: task kworker/u32:1:19683 blocked for more than 120 seconds. Oct 28 11:33:58 antonioRyzen kernel: [22820.391914] Not tainted 4.19.0-041900-generic #201810221809 Oct 28 11:33:58 antonioRyzen kernel: [22820.391917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 28 11:33:58 antonioRyzen kernel: [22820.391920] kworker/u32:1 D 0 19683 2 0x80000000 Oct 28 11:33:58 antonioRyzen kernel: [22820.391943] Workqueue: events_unbound commit_work [drm_kms_helper] Oct 28 11:33:58 antonioRyzen kernel: [22820.391945] Call Trace: Oct 28 11:33:58 antonioRyzen kernel: [22820.391956] __schedule+0x29e/0x840 Oct 28 11:33:58 antonioRyzen kernel: [22820.391959] schedule+0x2c/0x80 Oct 28 11:33:58 antonioRyzen kernel: [22820.391962] schedule_timeout+0x258/0x360 Oct 28 11:33:58 antonioRyzen kernel: [22820.392050] ? optc1_get_crtc_scanoutpos+0x69/0xa0 [amdgpu] Oct 28 11:33:58 antonioRyzen kernel: [22820.392062] dma_fence_default_wait+0x20a/0x280 Oct 28 11:33:58 antonioRyzen kernel: [22820.392065] ? dma_fence_release+0xa0/0xa0 Oct 28 11:33:58 antonioRyzen kernel: [22820.392068] dma_fence_wait_timeout+0xe7/0x110 Oct 28 11:33:58 antonioRyzen kernel: [22820.392071] reservation_object_wait_timeout_rcu+0x201/0x340 Oct 28 11:33:58 antonioRyzen kernel: [22820.392140] ? amdgpu_get_vblank_counter_kms+0x111/0x160 [amdgpu] Oct 28 11:33:58 antonioRyzen kernel: [22820.392222] amdgpu_dm_do_flip+0x12c/0x370 [amdgpu] Oct 28 11:33:58 antonioRyzen kernel: [22820.392305] amdgpu_dm_atomic_commit_tail+0x7ac/0xea0 [amdgpu] Hi Adrià, Are you getting this by trying to run MK8 in Cemu? Or some other way? Could you try running through Andrey's instructions here: https://bugs.freedesktop.org/show_bug.cgi?id=105251#c23 That didn't work on my system (log was truncated) so they've kinda stopped looking into it, but if you could get them the complete log maybe they'll find something? I have the same thing where I can ssh back in, but can't fully `shutdown -h now` as it hangs part of the way through. You can reboot more gracefully by doing this: http://blog.kember.net/articles/reisub-the-gentle-linux-restart/ It's not just Cemu, it looks like it happens in Yuzu too. If you Google for "VMC page fault" then you'll find people running into that error in various other programs too. Personally, this is what I got when running MK8 in Cemu: ============== [Sat Nov 17 22:29:43 2018] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:155 vmid:3 pasid:32769, for process Cemu.exe pid 963 thread Cemu.exe:cs0 pid 1035) at address 0x000080014189a000 from 27 VM_L2_PROTECTION_FAULT_STATUS:0x00301137 (repeated 7 times over the space of a few minutes) [Sat Nov 17 22:32:44 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1020, emitted seq=1023 ============== And then this when trying to run Super Mario Odyssey in Yuzu: ============== [Sat Nov 17 22:47:26 2018] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:156 vmid:3 pasid:32769, for process yuzu pid 960 thread yuzu:cs0 pid 972 at address 0x000080bd27743000 from 27 VM_L2_PROTECTION_FAULT_STATUS:0x00301138 [Sat Nov 17 22:47:36 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=24703, emitted seq=24704 ============== I'll look into getting the info dump that was requested earlier in the thread to see if that helps, but the seemingly abandoned state of this bug is rather concerning. Created attachment 142505 [details]
Gallium, UMR, Dmesg Dump Package
Ok, following Andrey Grodzovsky's instructions to get the dumps didn't work for Yuzu but it did for Cemu.
In the case of Yuzu, it looks like it thought the GPU had hung well before it actually did. Infact almost immediately. It then freaked out and went downhill from there to the point where the program only appeared very briefly (well before it gets to the point where it hangs).
=================================
Gallium debugger active. Logging all calls.
Hang detection timeout is 1000ms.
GPU hang detected, collecting information...
Draw # driver prev BOP TOP BOP dump file
-------------------------------------------------------------
0 NO NO NO NO /home/arcade/ddebug_dumps/yuzu_894_00000000
Cannot open DRI name under debugfs: Permission denied
Cannot open DRI name under debugfs: Permission denied
Cannot open DRI name under debugfs: Permission denied
Done.
dd: Aborting the process...
Segmentation fault (core dumped)
=================================
Luckily Cemu seemed to be successful:
=================================
Gallium debugger active. Logging all calls.
Hang detection timeout is 1000ms.
GPU hang detected, collecting information...
Draw # driver prev BOP TOP BOP dump file
-------------------------------------------------------------
14626 YES NO NO NO /home/arcade/ddebug_dumps/Cemu.exe_1158_00014629
=================================
All the requested debug files are attached inside the archive for the Cemu attempt.
Created attachment 142797 [details]
New Error Since 4.19.X
Ok, so some time between my last report and now (started happening since 4.19.something, I don't know which version specifically) this problem has changed in how it manifests itself. Previously you'd get the "[gfxhub] VMC page fault" messages. Now it manifests itself in considerably more serious looking errors (with none of the "VMC page faults" in sight). Log file attached.
Once something triggers this, the card will become basically unresponsive and anything that tries to use it will start throwing more of the errors seen in the attached log.
It's not random though. For example I can run Unigine valley/superposition or Elder Scrolls Online (via Wine+DXVK) for as long as I like, stress-testing or benchmarking and it'll be fine. But as soon a I try one of the problem programs, it'll basically "break" the graphics card until I hard reset.
(In reply to Benjamin Hodgetts from comment #61) > [...] this problem has changed in how it manifests itself. Previously you'd get > the "[gfxhub] VMC page fault" messages. Now it manifests itself in considerably > more serious looking errors (with none of the "VMC page faults" in sight). That might be a different issue, please file a separate report about it. Hi I came across this bug report that might be related to my bug report #109022 https://bugs.freedesktop.org/show_bug.cgi?id=109022 I got the VMC Page Fault error as well while playing Yuzu with RX580. The bug can be reproduced easily at the same spot each time. GPU crashed but can be accessed through SSH. If you think my bug is very similar with this bug, maybe I can help debugging. It'll be a different bug, this only affects Vega10 cards. Polaris is fine with Cemu and the other guy's Vega test app. I've just tested the vega_crasher on the latest kernel from the linux-amd-staging-drm-next-git package (archlinux) and it didn't crash. % uname -a Linux erebor 5.0.0-rc1-amd-staging-drm-next-git-b8cd95e15410+ #1 SMP PREEMPT Sat Feb 16 02:30:22 PST 2019 x86_64 GNU/Linux That said, I'm still experiencing random crashes. I'll try and and get a debug dump next time it happens, but it looks a lot like what is described on this thread: https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1049483-amd-devs-error-ring-gfx-timeout I jumped ship to nvidia months ago so this doesn't help me, but for you guys following this thread, the Cemu developers managed to fix this issue on their end. If you install the latest public release of Cemu, all games will work with Vega + mesa under wine. Since there are non-cemu cases in here, I won't close the issue (someone else can if appropriate). I'm unsubscribing from this now. I get multiple of these: 392.377183] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:5 pasid:32772, for process firefox pid 4467 thread firefox:cs0 pid 4565 ) [ 392.377194] amdgpu 0000:09:00.0: at address 0x00000001013d4000 from 27 [ 392.377200] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00501031 (...) [ 402.621544] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=28019, emitted seq=28022 [ 402.621551] [drm] GPU recovery disabled. Fedora 30 on Gigabyte X470 AORUS ULTRA GAMING w/ AMD Ryzen 5 2400G with Radeon Vega Graphics running git mesa and git xf86-video-amdgpu. I started getting these after/around commit 076159b40b96096ba01413abc011a26c9acf7176 I have this fault with 2400G and mesa 18.3 & 19.1.1 with Linux 4.19 (other versions haven't been tested). It seems that Vega is unable to handle tiny VBO correctly. I have an old application that uses a lot of immediate mode GL functions to create small billboards using GL_QUADS like the following one: glTexCoord2f(0, 0); glVertex(v0 * Size); glTexCoord2f(1, 0); glVertex(v1 * Size); glTexCoord2f(1, 1); glVertex(v2 * Size); glTexCoord2f(0, 1); glVertex(v3 * Size); Initially I have replaced this code with static GLfloat Vtx[] = { -1, -1, 0, 0, 0, 1, -1, 0, 1, 0, 1, 1, 0, 1, 1, -1, 1, 0, 0, 1 }; glBufferData(GL_ARRAY_BUFFER, sizeof(Vtx), Vtx, GL_STATIC_DRAW); glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_TEXTURE_COORD_ARRAY); glVertexPointer(3, GL_FLOAT, 5*sizeof(GLfloat), 0); glTexCoordPointer(2, GL_FLOAT, 5*sizeof(GLfloat), 3*sizeof(GLfloat)); + I use VAO if it's available. As a variant I used independent arrays for position and texture coordinates. But with the same fault. So as a result I added required data to another related VBO which contains 8192 vertices. Now I don't have this fault. I know that OpenGL doesn't like herds of small VBOs, but the hardware failure is not an expected result if we use them. (In reply to zzyxpaw from comment #52) > Created attachment 141522 [details] > Logs + trace with patched mesa, plus example code which consistently > triggers crash. > The example code is incorrect. Line 99: glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 5*sizeof(float), &vertices[3]); Should be: glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 5*sizeof(float), 3 * sizeof(float)); (cf glVertexAttribPointer documentation: "pointer is treated as a byte offset into the buffer object's data store") With this change the program runs correctly. Note that even if the program is invalid it shouldn't hang the GPU. I'm working on a fix for this. I would like to pitch into this as it seems this particular problem has been plaguing me for some months now. Currently running kernel 5.2.1-arch1-1-ARCH and I will still occasionally get errors like this when running minetest (they seem to be subtly different from the others in this thread upon reading): [ 5699.136659] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:155 vmid:5 pasid:32770, for process minetest pid 7127 thread minetest:cs0 pid 7133) [ 5699.136662] amdgpu 0000:0b:00.0: in page starting at address 0x000080014034d000 from 27 [ 5699.136664] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00501136 [ 5704.343299] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5709.259775] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5709.259860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5709.259862] [drm] GPU recovery disabled. *repeat last four lines endlessly...* Relevant hardware is a ryzen 2200G (vega 8 GPU). The issue has survived swapping almost every component in my system so I think it is safe to rule out hardware brokenness in my case at least. Mercifully it seems the rest of the system survives this hence being able to capture the dmesg output, but with the gpu hard locked obviously the only recourse is to then reboot (after gathering some output for a while). I haven't yet been able to obtain an API trace from minetest when it becomes difficult. Furthermore it doesn't do so reliably - I can often play for hours, but then the crash will strike and then the issue can sometimes persist across a few reboots if I press minetest to try and load a world again fast enough. Heck idk, is it a case of the precise 3D cloud pattern in the menu background at the time? Sounds like it would be useful for me to have apitrace running in the background whenever I run it on the off chance I can catch it in the act. zzyxpaw's "vega crasher" in message #52 has reliably been able to cause GPU lock-up. Same sort of story: black window will pop up, nothing happens, and either lock-up occurs after a moment, or (interestingly) attempting to move the window in X11 will cause the lock-up immediately. If there is any more data (such as attempting to get an apitrace) that would be useful I am willing to attempt to gather it, as this issue is the only blemish on an otherwise perfectly stable system. This MR https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1265 should improve the situation. It has been merged last week. An incorrect program (like "vega_crasher") should hit an assert (if they're enabled in Mesa) or produce an incorrect rendering but shouldn't hang the GPU anymore. (In reply to Pierre-Eric Pelloux-Prayer from comment #72) > This MR https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1265 should > improve the situation. It has been merged last week. > > An incorrect program (like "vega_crasher") should hit an assert (if they're > enabled in Mesa) or produce an incorrect rendering but shouldn't hang the > GPU anymore. It could be good if people could report here if this improved with this MR. (In reply to Juan A. Suarez from comment #73) > It could be good if people could report here if this improved with this MR. I can utilise the mesa-git package in the arch user repository to compile from latest sources. I will then test both vega_crasher and minetest with that package installed to see what occurs. Stay tuned for updates, though it may take a couple of days while I juggle $dayjob. After compiling mesa-git on commit 0661c357c60 from the AUR pkgbuild, I can now confirm my system seems to have become impervious to the above "vega_crasher" program. Output from said program after resizing and moving vega_crasher's window a bit, in case it was important: L CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 9 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 24 Code Size: 52 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 12 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 8 Code Size: 92 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 24 Code Size: 44 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 8 Code Size: 80 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 2 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 3 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 4 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 24 Code Size: 44 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 16 VGPRS: 8 Code Size: 136 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic (remark): <unknown>:0:0: 16 instructions in function GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 24 VGPRS: 24 Code Size: 92 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 GL CALLBACK: type = 0x8251, severity = 0x826b, message = Shader Stats: SGPRS: 24 VGPRS: 24 Code Size: 88 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0 Minetest will take longer to test as the pkgbuild doesn't enable asserts, and also because of adformentioned $dayjob. I guess in that case I'd only know if I saw garbled output; it was never very consistent when it occured but would always be during the loading bar screen (but when it did happen some very colourful blocky corruption would result). (In reply to deltasquared from comment #75) > L CALLBACK: type = 0x8251, severity = 0x826b, message = LLVM diagnostic GL_CALLBACK rather on that first line. terminal copypaste fail. Created attachment 144845 [details]
vega_crasher after patch, black central output, on ryzen 2200G with vega 8 graphics
Screenshot time (1/2). It seems sometimes vega_crasher will render black - I haven't thoroughly looked over the code so I'm not sure if this is the adformentioned incorrect result (where an assert would have been hit).
Created attachment 144846 [details]
vega_crasher after patch, colour shaded central output, on ryzen 2200G with vega 8 graphics
Screenshot 2/2 of vega_crasher after patch. It seems to indeterministically switch between shaded and black central regions - I can only assume this is down to whether or not the offending index ends up out of bounds?
If it helps I can attempt more tests with an asserts-enabled build, though that will take some more time, a resource I am out of for today. (Will have to look at how to do that also - just a question of a debug build or another flag that needs passing?)
OK, have managed to get an unrelated crash from starting minetest now with the mentioned patch so at this point I think that case is unrelated. (Certainly seems to be more subtle, this MT crash has never been as reliable to trigger as some of the other things on this thread). Will endeavour to file a bug separately. Any suggestions on information to kickstart such a related bug would be appreciated, else I will reach out to various channels on freenode first to get that ball rolling. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/311. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.