Summary: | System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung! | ||
---|---|---|---|
Product: | DRI | Reporter: | dwagner <jb5sgc1n.nya> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | critical | ||
Priority: | medium | CC: | andrey.grodzovsky, a_ruhier, devurandom, freedesktop, jaapbuurman, johan.gardhage, jsimek.cz, keramidasceid, mboquien, samueldgv, samuel, taijian, thomas |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
dwagner
2017-08-20 22:53:09 UTC
Sadly, not only did this bug not attract any attention, it also still occurs, and seemingly even more frequent than before, on current bleeding-edge kernels from amd-staging-drm-next, and also with the now current Firefox 57 and the now current versions of Xorg, Mesa etc. from Arch Linux. Just to mention this once again: These system crashes still occur, and way too frequently to consider the amdgpu driver stable enough for professional use. Sample dmesg output from today: Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=5430589, last emitted seq=5430591 Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung! Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung! Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=185928, last emitted seq=185930 Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung! Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung! Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out Just for the record, others have reported similar symptoms - here is a recent example: https://bugs.freedesktop.org/show_bug.cgi?id=106666 I was asked in https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1027705-amdgpu-on-linux-4-18-to-offer-greater-vega-power-savings-displayport-1-4-fixes?p=1027933#post1027933 to mention here that I have experienced this kind of bug only when using the "new" display code (amdgpu.dc=1). I cannot strictly rule out that it could also happen with dc=0, since I have tried dc=0 only for short periods occasionally, but during those periods I did not see this kind of crash. Just for the record: To rule out my personally compiled kernels are somehow "more buggy than what others compile", I tried the current Arch-Linux-supplied Linux 4.17.2-1-ARCH kernel. Survives about 5 minutes of Firefox-browsing between crashes with: Jun 20 00:01:11 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=1895, last em> Jun 20 00:01:11 ryzen kernel: [drm] IP block:gmc_v8_0 is hung! (4.13.* did at least survive days.) Verify you are using latest AMD firmware and up to date MESA/LLVM Firmware here (amdgpu folder) - https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ Andrey (In reply to Andrey Grodzovsky from comment #6) > Verify you are using latest AMD firmware and up to date MESA/LLVM Firmware: pacman -Q linux-firmware linux-firmware 20180606.d114732-1 ll /usr/lib/firmware/amdgpu/vega10_vce.bin -rw-r--r-- 1 root root 165344 Jun 7 08:01 /usr/lib/firmware/amdgpu/vega10_vce.bin MESA: pacman -Q mesa mesa 18.1.2-1 LLVM: pacman -Q llvm-libs llvm-libs 6.0.0-4 Is this new enough? BTW: In a forum somebody asked what the dmesg output on crash looked like if I enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output, but still a fatal system crash: Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=12277, last emitted seq=12279 Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung! Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung! Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin! Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out (In reply to dwagner from comment #7) > (In reply to Andrey Grodzovsky from comment #6) > > Verify you are using latest AMD firmware and up to date MESA/LLVM > > Firmware: > > pacman -Q linux-firmware > linux-firmware 20180606.d114732-1 > > ll /usr/lib/firmware/amdgpu/vega10_vce.bin > -rw-r--r-- 1 root root 165344 Jun 7 08:01 > /usr/lib/firmware/amdgpu/vega10_vce.bin > > > MESA: > > pacman -Q mesa > mesa 18.1.2-1 > > > LLVM: > pacman -Q llvm-libs > llvm-libs 6.0.0-4 > > Is this new enough? The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7. The firmware also looks pretty late but I still would advise to manually override all firmware files with files from here https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu Just backup your existing firmware/amdgpu folder for any case. > > > BTW: In a forum somebody asked what the dmesg output on crash looked like if > I enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output, > but still a fatal system crash: > > Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* > ring gfx timeout, last signaled seq=12277, last emitted seq=12279 > Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung! > Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung! > Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin! > Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done > [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out > Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies > [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out > Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies > [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out It's a know issue, try the patch I attached to resolve the deadlock , but you will probably experience other failures after that anyway. Andrey Created attachment 140345 [details] [review] Deadlock fix (In reply to Andrey Grodzovsky from comment #8) > The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7. LLVM 7 has not been released, and replacing LLVM 6 with the current subversion head of LLVM 7 means to basically recompile and reinstall half of the operating system (starting at radeonsi, then Xorg, then its dependencies...) I'm fine with using experimental new kernels to find a more stable amdgpu driver - but if a kernel driver crashes just because some user-space application (X11) utilizes a wrong compiler version at run time, then some part of the driver design is very wrong. > The firmware also looks pretty late but I still would advise to manually > override all firmware files with files from here > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > tree/amdgpu I did a "diff -r" on the git files with the ones installed by Arch, they are all binary identical. > > Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies > > [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out > > It's a know issue, try the patch I attached to resolve the deadlock , but > you will probably experience other failures after that anyway. Ok, thanks for the patch, will try this next time I compile a new kernel. (In reply to Andrey Grodzovsky from comment #8) > The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7. LLVM 6 is fine. (In reply to dwagner from comment #2) > Just to mention this once again: These system crashes still occur, and way > too frequently to consider the amdgpu driver stable enough for professional > use. Sample dmesg output from today: > > Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > last signaled seq=5430589, last emitted seq=5430591 > Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung! > Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung! > Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > timeout, last signaled seq=185928, last emitted seq=185930 > Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung! > Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung! > Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* > [CRTC:43:crtc-0] hw_done or flip_done timed out Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to force CPU VM update mode and see if this helps ? (In reply to Andrey Grodzovsky from comment #12) > Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to > force CPU VM update mode and see if this helps ? Sure. Too early yet to say "hurray", but at an uptime of one hour, currently, 4.17.2 survived with amdgpu.vm_update_mode=3 already about 20 times longer than without that option before the first crash. One (probably just informal) message is emitted by the kernel: [ 19.319565] CPU update of VM recommended only for large BAR system Can you explain a little: What is a "large BAR system", and what does the vm_update_mode=3 option actually cause? Should I expect any weird side effects to look for? BTW: Not a result of that option, but of the kernel version, seems to be the fact that the shader clock keeps at a pretty high frequency all the time - even without any 3d or compute load, just displaying a quiet 4k/60Hz desktop image: cat pp_dpm_sclk 0: 214Mhz 1: 481Mhz 2: 760Mhz 3: 1020Mhz 4: 1102Mhz 5: 1138Mhz 6: 1180Mhz * 7: 1220Mhz Much lower shader clocks are used only if I lower the refresh rate of the screen. Is there a reason why the shader clocks should stay high even in the absence of 3d/compute load? (I would have better understood if the minimum memory clock was depending on the refresh rate, but memory clock stays as low as with the older kernels.) (In reply to dwagner from comment #13) > > Much lower shader clocks are used only if I lower the refresh rate of the > screen. Is there a reason why the shader clocks should stay high even in the > absence of 3d/compute load? > Certain display requirements can cause the engine clock to be kept higher as well. (In reply to dwagner from comment #13) > (In reply to Andrey Grodzovsky from comment #12) > > Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to > > force CPU VM update mode and see if this helps ? > > Sure. Too early yet to say "hurray", but at an uptime of one hour, > currently, 4.17.2 survived with amdgpu.vm_update_mode=3 already about 20 > times longer than without that option before the first crash. > > One (probably just informal) message is emitted by the kernel: > [ 19.319565] CPU update of VM recommended only for large BAR system > > Can you explain a little: What is a "large BAR system", and what does the > vm_update_mode=3 option actually cause? Should I expect any weird side > effects to look for? I think it just means systems with large VRAM so it will require large BAR for mapping. But I am not sure on that point. vm_update_mode=3 means GPUVM page tables update is done using CPU. By default we do it using DMA engine on the ASIC. The log showed a hang in this engine so I assumed there is something wrong with SDMA commands we submit. I assume more CPU utilization as a side effect and maybe slower rendering. > > > BTW: Not a result of that option, but of the kernel version, seems to be the > fact that the shader clock keeps at a pretty high frequency all the time - > even without any 3d or compute load, just displaying a quiet 4k/60Hz desktop > image: > > cat pp_dpm_sclk > 0: 214Mhz > 1: 481Mhz > 2: 760Mhz > 3: 1020Mhz > 4: 1102Mhz > 5: 1138Mhz > 6: 1180Mhz * > 7: 1220Mhz > > Much lower shader clocks are used only if I lower the refresh rate of the > screen. Is there a reason why the shader clocks should stay high even in the > absence of 3d/compute load? > > (I would have better understood if the minimum memory clock was depending on > the refresh rate, but memory clock stays as low as with the older kernels.) (In reply to Andrey Grodzovsky from comment #15) > I think it just means systems with large VRAM so it will require large BAR > for mapping. But I am not sure on that point. That's correct. the updates are done with the CPU rather than the GPU (SDMA). The default BAR size on most systems is usually 256MB for 32 bit compatibility so the window for CPU access to vram (where the page tables live) is limited. (In reply to Alex Deucher from comment #16) > (In reply to Andrey Grodzovsky from comment #15) > > I think it just means systems with large VRAM so it will require large BAR > > for mapping. But I am not sure on that point. > > That's correct. the updates are done with the CPU rather than the GPU > (SDMA). The default BAR size on most systems is usually 256MB for 32 bit > compatibility so the window for CPU access to vram (where the page tables > live) is limited. Thanks Alex. dwagner, this is obviously just a work around and not a fix. It points to some problem with SDMA packets, if you want to continue exploring we can try to dump some fence traces and SDMA HW ring content to examine the latest packets before the hang happened. The good news: So far no crashes during normal uptime with amdgpu.vm_update_mode=3 The bad news: System crashes immediately upon S3 resume (with messages quite different from the ones I saw with earlier S3-resume crashes) - I filed bug report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this. (In reply to Andrey Grodzovsky from comment #17) > dwagner, this is obviously just a work around and not a fix. It points to > some problem with SDMA packets, if you want to continue exploring we can try > to dump some fence traces and SDMA HW ring content to examine the latest > packets before the hang happened. If you can include some debug output into "amd-staging-drm-next" that helps finding the root cause, I might be able to provide some output - if the kernel survives long enough after the crash to write the system journal - this has not always been the case. Can you use addr2line or gdb with 'list' command to give the line number matching (In reply to dwagner from comment #18) > The good news: So far no crashes during normal uptime with > amdgpu.vm_update_mode=3 > > The bad news: System crashes immediately upon S3 resume (with messages quite > different from the ones I saw with earlier S3-resume crashes) - I filed bug > report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this. > > (In reply to Andrey Grodzovsky from comment #17) > > dwagner, this is obviously just a work around and not a fix. It points to > > some problem with SDMA packets, if you want to continue exploring we can try > > to dump some fence traces and SDMA HW ring content to examine the latest > > packets before the hang happened. > > If you can include some debug output into "amd-staging-drm-next" that helps > finding the root cause, I might be able to provide some output - if the > kernel survives long enough after the crash to write the system journal - > this has not always been the case. No need to recompile, just need to see what is the content of SDMA ring buffer when the hang occurs. Clone and build our register analyzer from here - https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run sudo umr -lb sudo umr -R gfx[.] sudo umr -R sdma0[.] sudo umr -R sdma1[.] I will probably need more info later but let's try this first. (In reply to Andrey Grodzovsky from comment #19) > No need to recompile, just need to see what is the content of SDMA ring > buffer when the hang occurs. > > Clone and build our register analyzer from here - > https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run > > sudo umr -lb > sudo umr -R gfx[.] > sudo umr -R sdma0[.] > sudo umr -R sdma1[.] > > I will probably need more info later but let's try this first. How can I run "umr" on a crashed system? I guess those register values are retained over a press of the reset button / reboot? (I meant to write "I guess those register values are NOT retained over a reboot, right?") (In reply to dwagner from comment #21) > (I meant to write "I guess those register values are NOT retained over a > reboot, right?") Yes, my assumption was that at least some times you still have SSH access to the system in those cases. Just for the record: At this point, I can say that with amggpu.vm_update_mode=3 4.17.2-ARCH runs at least for hours, not only the minutes it runs without this option before crashing. I cannot, however, say that above combination reaches the some-days-between-amdgpu-crashes uptimes that 4.13.x reached - in order to be able to test this, I would need S3 resumes to work, which is subject to bug report 107065. Without working S3 resumes, there is no way for me to test longer uptimes because amdgpu consistently crashes (in any version I know of) if I just let the system run but switch off the display, and I do not want to keep the connected 4k TV switched on all day and night. Can you try bisecting between 4.13 and 4.17 to find where stability went downhill for you? (In reply to Michel Dänzer from comment #24) > Can you try bisecting between 4.13 and 4.17 to find where stability went > downhill for you? A bisect like that is not likely to converge in any reasonable time, given the stochastic nature of those crashes. While the mean-time-between-driver-crashes is dramatically different, there will be occasions on which 4.13 will crash early enough to yield a false "bad", and there will be occasions on which 4.17 is lasting like the 20 minutes or so to assume a false "good". What about the multitude of debug-options - isn't there one that could allow for some more insight on when/why the driver crashes? Today for the first time I had a sudden "crash while just browsing with Firefox" while using the amggpu.vm_update_mode=3 parameter with the current-as-of-today amd-staging-drm-next (bb2e406ba66c2573b68e609e148cab57b1447095) with patch https://bugs.freedesktop.org/attachment.cgi?id=140418 applied on top. Different kernel messages than with previous crashed of this kind were emitted: Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146 0x0c80440c Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100190 Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7, pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68) Jul 07 01:08:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=75244, last emitted seq=75245 Jul 07 01:08:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin! Hope this helps somehow. (In reply to dwagner from comment #26) > Today for the first time I had a sudden "crash while just browsing with > Firefox" [...] That could be a Mesa issue, anyway it should probably be tracked separately from this report. (In reply to Michel Dänzer from comment #27) > That could be a Mesa issue, anyway it should probably be tracked separately > from this report. Created separate bug report https://bugs.freedesktop.org/show_bug.cgi?id=107152 (If that is a Mesa issue, no more than user processes / X11 should have crashed - but not the kernel amdgpu driver... right?) (In reply to dwagner from comment #28) > (In reply to Michel Dänzer from comment #27) > > That could be a Mesa issue, anyway it should probably be tracked separately > > from this report. > > Created separate bug report > https://bugs.freedesktop.org/show_bug.cgi?id=107152 > > (If that is a Mesa issue, no more than user processes / X11 should have > crashed - but not the kernel amdgpu driver... right?) Not exactly, MESA could create a bad request (faulty GPU address) which would lead to this. It can even be triggered on purpose using a debug flag from MESA. (In reply to Andrey Grodzovsky from comment #29) > > (If that is a Mesa issue, no more than user processes / X11 should have > > crashed - but not the kernel amdgpu driver... right?) > > Not exactly, MESA could create a bad request (faulty GPU address) which > would lead to this. It can even be triggered on purpose using a debug flag > from MESA. My understanding is that all parts of MESA run as user processes, outside of the kernel space. If such code is allowed to pass parameters into kernel functions that make the kernel crash, that would be a veritable security hole which attackers could exploit to stage at least denial-of-service attacks, if not worse. I got that one too and was able to track the problem down a bit further. Chrome and video with the gpu enabled will blow it up too. Interesting I was able to reproduce it consistantly with my rtl8188eu usb driver plug it in connect and wpa_supplicant will cause it to explode. I ended up due to working on a live dev cd for codexl since all my machines are memory based and use no magnetic media. Just cherry picking the code back to the last 4.16 and no problems Heres the working 4.16 . I chased this rabbit for awhile and it pops up like the dam wood chuck in caddie shack. Here is the latest as of 11 hours ago 4.19-wip https://github.com/tekcomm/linux-image-4.19-wip-generic Here is the latest as of 11 hours ago 4.16 version from three weeks ago with no woodchucks https://github.com/tekcomm/linux-kernel-amdgpu-binaries I think it may be something as stupid as a var too. (In reply to Doctor from comment #32) > Just cherry picking the code > back to the last 4.16 and no problems Heres the working 4.16 . I chased > this rabbit for awhile and it pops up like the dam wood chuck in caddie > shack. > > Here is the latest as of 11 hours ago 4.19-wip > https://github.com/tekcomm/linux-image-4.19-wip-generic I am not sure I understand what you are trying to tell us, here. The repository you linked does not seem to contain any relevant commits changing kernel source code. (In reply to dwagner from comment #30) > (In reply to Andrey Grodzovsky from comment #29) > > > (If that is a Mesa issue, no more than user processes / X11 should have > > > crashed - but not the kernel amdgpu driver... right?) > > > > Not exactly, MESA could create a bad request (faulty GPU address) which > > would lead to this. It can even be triggered on purpose using a debug flag > > from MESA. > > My understanding is that all parts of MESA run as user processes, outside of > the kernel space. If such code is allowed to pass parameters into kernel > functions that make the kernel crash, that would be a veritable security > hole which attackers could exploit to stage at least denial-of-service > attacks, if not worse. There is no impact on the kernlel, please note that this is a GPU page fault, not CPU page fault so the kernel keeps working normal, doesn't hang and workable. You might get black screen out of this and have to reboot the graphic card or maybe the entire system to recover but I don't see any system security and stability compromise here. *** Bug 107311 has been marked as a duplicate of this bug. *** In the related bug report (https://bugs.freedesktop.org/show_bug.cgi?id=107152) I noticed that this bug can be triggered very reliably and quickly by playing a video with a deliberately lowered frame rate: "mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm" This led me to assume this bug might be caused by the dynamic power management, that often ramps performance up/down when a video is played at such a low frame rate. And indeed, I found this confirmed by many experiments: If I use a script like > #!/bin/bash > cd /sys/class/drm/card0/device > echo manual >power_dpm_force_performance_level > # low > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk > # medium > #echo 1 >pp_dpm_mclk > #echo 1 >pp_dpm_sclk > # high > #echo 1 >pp_dpm_mclk > #echo 6 >pp_dpm_sclk to enforce just any performance level, then the crashes do not occur anymore - also with the "low frame rate video test". So it seems that the transition from one "dpm" performance level to another, with a certain probability, causes these crashes. And the more often the transitions occur, the sooner one will experience them. (BTW: For unknown reason, invoking "xrandr" or enabling a monitor after sleep causes the above settings to get lost, so one has to invoke above script again.) *** Bug 107152 has been marked as a duplicate of this bug. *** (In reply to dwagner from comment #37) > In the related bug report > (https://bugs.freedesktop.org/show_bug.cgi?id=107152) I noticed that this > bug can be triggered very reliably and quickly by playing a video with a > deliberately lowered frame rate: > "mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm" > > This led me to assume this bug might be caused by the dynamic power > management, that often ramps performance up/down when a video is played at > such a low frame rate. I tried exactly the same - reproduce with same card model and latest kernel and run webm clip with mpv same way you did and it didn't happen. > > And indeed, I found this confirmed by many experiments: If I use a script > like > > #!/bin/bash > > cd /sys/class/drm/card0/device > > echo manual >power_dpm_force_performance_level > > # low > > echo 0 >pp_dpm_mclk > > echo 0 >pp_dpm_sclk > > # medium > > #echo 1 >pp_dpm_mclk > > #echo 1 >pp_dpm_sclk > > # high > > #echo 1 >pp_dpm_mclk > > #echo 6 >pp_dpm_sclk > to enforce just any performance level, then the crashes do not occur anymore > - also with the "low frame rate video test". > > So it seems that the transition from one "dpm" performance level to another, > with a certain probability, causes these crashes. And the more often the > transitions occur, the sooner one will experience them. > > (BTW: For unknown reason, invoking "xrandr" or enabling a monitor after > sleep causes the above settings to get lost, so one has to invoke above > script again.) Created attachment 141112 [details] .config I uploaded my .config file - maybe something in your Kconfig flags makes this happen - you can try and rebuild latest kernel from Alex's repository using my .config and see if you don't experience this anymore. https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next Other than that, since you system hard hangs so you can't do any postmortem dumps, you can at least provide output from events tracing though trace_pipe to catch live logs on the fly. Maybe we can infer something from there... So again - Load the system and before starting reproduce run the following trace command - sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv" then cd /sys/kernel/debug/tracing && cat trace_pipe When the problem happens just copy all the output from the terminal to a log file. Make sure your terminal app has largest possible buffer to catch ALL the output. (In reply to Andrey Grodzovsky from comment #40) > Created attachment 141112 [details] > .config > > I uploaded my .config file - maybe something in your Kconfig flags makes > this happen - you can try and rebuild latest kernel from Alex's repository > using my .config and see if you don't experience this anymore. > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next Did just that - but still the video test crashes after at most few minutes, and does not crash with DPM turned off. So we can rule out our .config differences (of which there are many). > Other than that, since you system hard hangs so you can't do any postmortem > dumps, you can at least provide output from events tracing though trace_pipe > to catch live logs on the fly. Maybe we can infer something from there... > > So again - > Load the system and before starting reproduce run the following trace > command - > > sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e > "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv" > > then cd /sys/kernel/debug/tracing && cat trace_pipe > > When the problem happens just copy all the output from the terminal to a log > file. Make sure your terminal app has largest possible buffer to catch ALL > the output. Will try that on next opportunity, probably tomorrow evening. Ok, did the proposed debugging session with trace-cmd, with output to a different PC over ssh. Using today's amd-staging-drm-next and btw., Arch updated the Xorg server earlier today. This time it took about 4 minutes until the video playback with 3 fps crashed - the symptom was the same (as in one-colored blank screen and a subsequent system crash), but this time the kernel and ssh session survived the crash for some seconds, enough for me to also issue the earlier suggested "umr -O verbose -R gfx[.]" command after the amdgpu crash, so I can upload the output of that, too, but this was the last command executed, the system crashed completely while running it (so its output may be partial). Find attached dmesg, trace, and umr output. Created attachment 141155 [details]
trace-cmd induced output during 3-fps-video replay and crash
Created attachment 141156 [details]
dmesg from boot to after the 3-fps-video test crash
Created attachment 141157 [details]
output of umr command after 3-fps-video test crash
Thanks. Created attachment 141174 [details] [review] add_debug_info.patch A am attaching a basic debug patch, please try to apply it. It should give a bit more info in dmesg whe VM fault happens. I wasn't able to test it on my system so it might be buggy or crash. Reproduce again like before with the cmd-trace like before and once the fault happens if possible try quickly run sudo umr -O halt_waves -wa and only if you still have running system after that do the sudo umr -O verbose -R gfx[.] The driver should be loaded amdgpu.vm_fault_stop=2 from grub Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly (In reply to Andrey Grodzovsky from comment #47) > Created attachment 141174 [details] [review] [review] > add_debug_info.patch > > A am attaching a basic debug patch, please try to apply it. Done. > It should give a > bit more info in dmesg whe VM fault happens. Hmm - I could not see any additional output resulting from it. > Reproduce again like before with the cmd-trace like before and once the > fault happens if possible try quickly run > > sudo umr -O halt_waves -wa > > and only if you still have running system after that do the > sudo umr -O verbose -R gfx[.] > > The driver should be loaded amdgpu.vm_fault_stop=2 from grub Did that - will attach the script "gpu_debug3.sh" and its output - this time, dmesg and trace output are in the same file, if you want to look only at the dmesg part, "grep '^\[' gpu_debug_3.txt" will get it. I reproduced the bug 4 times, on 2 occasions no error was emitted before crashing, the 2 other times both umr commands could still run - since the error message looked the same, I'll attach the shorter file, where the crash occurred more quickly. > Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly I used that setting, but it did not seem to make a difference for how quickly the crash occurred - still "some seconds to some minutes". Created attachment 141189 [details]
script used to generate the gpu_debug_3.txt (when executed via ssh -t ...)
Created attachment 141190 [details]
dmesg / trace / umr output from gpu_debug3.sh
Created attachment 141191 [details]
xz-compressed output of gpu_debug3.sh - dmesg, trace, umr
One other experiment I made: I wrote a script to quickly toggle pp_dpm_mclk and pp_dpm_sclk while playing a 3 fps video with power_dpm_force_performance_level=manual. Could not reproduce the crashes that happen with power_dpm_force_performance_level=auto this way. Created attachment 141198 [details] [review] add_debug_info2.patch Try this patch instead, i might be missing some prints in the first one. In the last log you attached I haven't seen any UMR dumps or GPU fault prints in dmesg. THe GPU fault has to be in the log to compare the faulty address against the debug prints in the patch. (In reply to Andrey Grodzovsky from comment #53) > Created attachment 141198 [details] [review] [review] > add_debug_info2.patch > > Try this patch instead, i might be missing some prints in the first one. Can try that this evening. > In the last log you attached I haven't seen any UMR dumps or GPU fault > prints in dmesg. THe GPU fault has to be in the log to compare the faulty > address against the debug prints in the patch. In above attached file "xz-compressed output of gpu_debug3.sh" there is umr output at the time of the crash (238 seconds after the reboot): ---------------------------------------------- ... mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: driver=drm_sched timeline=gfx context=162 seqno=87 mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: driver=drm_sched timeline=gfx context=162 seqno=87 kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=210 kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=211 [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=32624, emitted seq=32626 [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! crash detected! executing umr -O halt_waves -wa No active waves! executing umr -O verbose -R gfx[.] polaris11.gfx.rptr == 1792 polaris11.gfx.wptr == 1792 polaris11.gfx.drv_wptr == 1792 polaris11.gfx.ring[1761] == 0xffff1000 ... polaris11.gfx.ring[1762] == 0xffff1000 ... polaris11.gfx.ring[1763] == 0xffff1000 ... polaris11.gfx.ring[1764] == 0xffff1000 ... polaris11.gfx.ring[1765] == 0xffff1000 ... polaris11.gfx.ring[1766] == 0xffff1000 ... polaris11.gfx.ring[1767] == 0xffff1000 ... polaris11.gfx.ring[1768] == 0xffff1000 ... polaris11.gfx.ring[1769] == 0xffff1000 ... polaris11.gfx.ring[1770] == 0xffff1000 ... polaris11.gfx.ring[1771] == 0xffff1000 ... polaris11.gfx.ring[1772] == 0xffff1000 ... polaris11.gfx.ring[1773] == 0xffff1000 ... polaris11.gfx.ring[1774] == 0xffff1000 ... polaris11.gfx.ring[1775] == 0xffff1000 ... polaris11.gfx.ring[1776] == 0xffff1000 ... polaris11.gfx.ring[1777] == 0xffff1000 ... polaris11.gfx.ring[1778] == 0xffff1000 ... polaris11.gfx.ring[1779] == 0xffff1000 ... polaris11.gfx.ring[1780] == 0xffff1000 ... polaris11.gfx.ring[1781] == 0xffff1000 ... polaris11.gfx.ring[1782] == 0xffff1000 ... polaris11.gfx.ring[1783] == 0xffff1000 ... polaris11.gfx.ring[1784] == 0xffff1000 ... polaris11.gfx.ring[1785] == 0xffff1000 ... polaris11.gfx.ring[1786] == 0xffff1000 ... polaris11.gfx.ring[1787] == 0xffff1000 ... polaris11.gfx.ring[1788] == 0xffff1000 ... polaris11.gfx.ring[1789] == 0xffff1000 ... polaris11.gfx.ring[1790] == 0xffff1000 ... polaris11.gfx.ring[1791] == 0xffff1000 ... polaris11.gfx.ring[1792] == 0xc0032200 rwD trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...' done after crash, flashing NUMLOCK LED. amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set: list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072 amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set: list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072 ... ---------------------------------------------- But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages this time. Sometimes such are emitted, sometimes not. (In reply to dwagner from comment #54) > (In reply to Andrey Grodzovsky from comment #53) > > Created attachment 141198 [details] [review] [review] [review] > > add_debug_info2.patch > > > > Try this patch instead, i might be missing some prints in the first one. > > Can try that this evening. > > > In the last log you attached I haven't seen any UMR dumps or GPU fault > > prints in dmesg. THe GPU fault has to be in the log to compare the faulty > > address against the debug prints in the patch. > > In above attached file "xz-compressed output of gpu_debug3.sh" there is umr > output at the time of the crash (238 seconds after the reboot): > > ---------------------------------------------- > ... > mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: > driver=drm_sched timeline=gfx context=162 seqno=87 > mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: > driver=drm_sched timeline=gfx context=162 seqno=87 > kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: > driver=amdgpu timeline=sdma1 context=11 seqno=210 > kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: > driver=amdgpu timeline=sdma1 context=11 seqno=211 > [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > timeout, signaled seq=32624, emitted seq=32626 > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > > crash detected! > > executing umr -O halt_waves -wa > No active waves! Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that should have froze GPUs compute units and hence the above command would produce a lot of wave info. > > > executing umr -O verbose -R gfx[.] > > polaris11.gfx.rptr == 1792 > polaris11.gfx.wptr == 1792 > polaris11.gfx.drv_wptr == 1792 > polaris11.gfx.ring[1761] == 0xffff1000 ... > polaris11.gfx.ring[1762] == 0xffff1000 ... > polaris11.gfx.ring[1763] == 0xffff1000 ... > polaris11.gfx.ring[1764] == 0xffff1000 ... > polaris11.gfx.ring[1765] == 0xffff1000 ... > polaris11.gfx.ring[1766] == 0xffff1000 ... > polaris11.gfx.ring[1767] == 0xffff1000 ... > polaris11.gfx.ring[1768] == 0xffff1000 ... > polaris11.gfx.ring[1769] == 0xffff1000 ... > polaris11.gfx.ring[1770] == 0xffff1000 ... > polaris11.gfx.ring[1771] == 0xffff1000 ... > polaris11.gfx.ring[1772] == 0xffff1000 ... > polaris11.gfx.ring[1773] == 0xffff1000 ... > polaris11.gfx.ring[1774] == 0xffff1000 ... > polaris11.gfx.ring[1775] == 0xffff1000 ... > polaris11.gfx.ring[1776] == 0xffff1000 ... > polaris11.gfx.ring[1777] == 0xffff1000 ... > polaris11.gfx.ring[1778] == 0xffff1000 ... > polaris11.gfx.ring[1779] == 0xffff1000 ... > polaris11.gfx.ring[1780] == 0xffff1000 ... > polaris11.gfx.ring[1781] == 0xffff1000 ... > polaris11.gfx.ring[1782] == 0xffff1000 ... > polaris11.gfx.ring[1783] == 0xffff1000 ... > polaris11.gfx.ring[1784] == 0xffff1000 ... > polaris11.gfx.ring[1785] == 0xffff1000 ... > polaris11.gfx.ring[1786] == 0xffff1000 ... > polaris11.gfx.ring[1787] == 0xffff1000 ... > polaris11.gfx.ring[1788] == 0xffff1000 ... > polaris11.gfx.ring[1789] == 0xffff1000 ... > polaris11.gfx.ring[1790] == 0xffff1000 ... > polaris11.gfx.ring[1791] == 0xffff1000 ... > polaris11.gfx.ring[1792] == 0xc0032200 rwD > > trying to get ADR from dmesg output for 'umr -O verbose -vm ...' > trying to get VMID from dmesg output for 'umr -O verbose -vm ...' > > done after crash, flashing NUMLOCK LED. > amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set: > list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072 > amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set: > list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072 > ... > ---------------------------------------------- > > But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages > this time. Sometimes such are emitted, sometimes not. (In reply to Andrey Grodzovsky from comment #55) > > In above attached file "xz-compressed output of gpu_debug3.sh" there is umr > > output at the time of the crash (238 seconds after the reboot): > > > > ---------------------------------------------- > > ... > > mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: > > driver=drm_sched timeline=gfx context=162 seqno=87 > > mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: > > driver=drm_sched timeline=gfx context=162 seqno=87 > > kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: > > driver=amdgpu timeline=sdma1 context=11 seqno=210 > > kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: > > driver=amdgpu timeline=sdma1 context=11 seqno=211 > > [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > > timeout, signaled seq=32624, emitted seq=32626 > > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > > > > crash detected! > > > > executing umr -O halt_waves -wa > > No active waves! > > Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that > should have froze GPUs compute units and hence the above command would > produce a lot of wave info. Yes I did, as can be seen from the kernel command line at the very beginning of the file I attached: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0 amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2 amdgpu.vm_debug=1 Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a procedure that discards whatever has been in thoses "waves" before? If yes, could amdgpu.gpu_recovery=0 prevent that from happening? (In reply to dwagner from comment #56) > (In reply to Andrey Grodzovsky from comment #55) > > > In above attached file "xz-compressed output of gpu_debug3.sh" there is umr > > > output at the time of the crash (238 seconds after the reboot): > > > > > > ---------------------------------------------- > > > ... > > > mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: > > > driver=drm_sched timeline=gfx context=162 seqno=87 > > > mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: > > > driver=drm_sched timeline=gfx context=162 seqno=87 > > > kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: > > > driver=amdgpu timeline=sdma1 context=11 seqno=210 > > > kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: > > > driver=amdgpu timeline=sdma1 context=11 seqno=211 > > > [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > > > timeout, signaled seq=32624, emitted seq=32626 > > > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > > > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > > > > > > crash detected! > > > > > > executing umr -O halt_waves -wa > > > No active waves! > > > > Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that > > should have froze GPUs compute units and hence the above command would > > produce a lot of wave info. > > Yes I did, as can be seen from the kernel command line at the very beginning > of the file I attached: > [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd > root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw > cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d > video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0 > amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2 > amdgpu.vm_debug=1 > > Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a > procedure that discards whatever has been in thoses "waves" before? If yes, > could amdgpu.gpu_recovery=0 prevent that from happening? Yes, missed that one. No resets. Here comes another trace log, with your info2.patch applied. Something must have changed since the last test, as it took pretty long this time to reproduce the crash. Could that have been caused by https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c?h=amd-staging-drm-next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the kernel? However, the latest trace you find attached below is not much different to the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you: [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=475104, emitted seq=475106 [ 1510.023117] [drm] GPU recovery disabled. amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs: soffs=00001001a0, eoffs=00001001b9, flags=70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=0000100200, eoffs=00001021e0, flags=70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=0000102200, eoffs=00001041e0, flags=70 amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs: soffs=000010c1e0, eoffs=000010c2e1, flags=70 amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job: entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job count:8, hw job count:0 And later in the file you can find: ------------------------------------------------------ crash detected! executing umr -O halt_waves -wa No active waves! executing umr -O verbose -R gfx[.] polaris11.gfx.rptr == 512 polaris11.gfx.wptr == 512 polaris11.gfx.drv_wptr == 512 polaris11.gfx.ring[ 481] == 0xffff1000 ... polaris11.gfx.ring[ 482] == 0xffff1000 ... polaris11.gfx.ring[ 483] == 0xffff1000 ... polaris11.gfx.ring[ 484] == 0xffff1000 ... polaris11.gfx.ring[ 485] == 0xffff1000 ... polaris11.gfx.ring[ 486] == 0xffff1000 ... polaris11.gfx.ring[ 487] == 0xffff1000 ... polaris11.gfx.ring[ 488] == 0xffff1000 ... polaris11.gfx.ring[ 489] == 0xffff1000 ... polaris11.gfx.ring[ 490] == 0xffff1000 ... polaris11.gfx.ring[ 491] == 0xffff1000 ... polaris11.gfx.ring[ 492] == 0xffff1000 ... polaris11.gfx.ring[ 493] == 0xffff1000 ... polaris11.gfx.ring[ 494] == 0xffff1000 ... polaris11.gfx.ring[ 495] == 0xffff1000 ... polaris11.gfx.ring[ 496] == 0xffff1000 ... polaris11.gfx.ring[ 497] == 0xffff1000 ... polaris11.gfx.ring[ 498] == 0xffff1000 ... polaris11.gfx.ring[ 499] == 0xffff1000 ... polaris11.gfx.ring[ 500] == 0xffff1000 ... polaris11.gfx.ring[ 501] == 0xffff1000 ... polaris11.gfx.ring[ 502] == 0xffff1000 ... polaris11.gfx.ring[ 503] == 0xffff1000 ... polaris11.gfx.ring[ 504] == 0xffff1000 ... polaris11.gfx.ring[ 505] == 0xffff1000 ... polaris11.gfx.ring[ 506] == 0xffff1000 ... polaris11.gfx.ring[ 507] == 0xffff1000 ... polaris11.gfx.ring[ 508] == 0xffff1000 ... polaris11.gfx.ring[ 509] == 0xffff1000 ... polaris11.gfx.ring[ 510] == 0xffff1000 ... polaris11.gfx.ring[ 511] == 0xffff1000 ... polaris11.gfx.ring[ 512] == 0xc0032200 rwD trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...' done after crash. ------------------------------------------- So even without GPU reset, still no "waves". And the error message also does not state any VM fault address. Created attachment 141228 [details]
latest crash trace output, without gpu_reset
(In reply to dwagner from comment #58) > Here comes another trace log, with your info2.patch applied. > > Something must have changed since the last test, as it took pretty long this > time to reproduce the crash. Could that have been caused by > https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/ > nbio_v7_4.c?h=amd-staging-drm- > next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the > kernel? Don't think it's related. This code is more related to virtualization. > > However, the latest trace you find attached below is not much different to > the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you: > > [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > timeout, signaled seq=475104, emitted seq=475106 > [ 1510.023117] [drm] GPU recovery disabled. That just means you are again running with GPU VM update mode set to use SDMA. Which is seen in you dmesg (amdgpu.vm_update_mode=0) , so are again experiencing the original issue of SDMA hang. Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue. > > amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs: > soffs=00001001a0, eoffs=00001001b9, flags=70 > amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: > soffs=0000100200, eoffs=00001021e0, flags=70 > amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: > soffs=0000102200, eoffs=00001041e0, flags=70 > amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs: > soffs=000010c1e0, eoffs=000010c2e1, flags=70 > amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job: > entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job > count:8, hw job count:0 > > And later in the file you can find: > ------------------------------------------------------ > crash detected! > > executing umr -O halt_waves -wa > No active waves! > > executing umr -O verbose -R gfx[.] > > polaris11.gfx.rptr == 512 > polaris11.gfx.wptr == 512 > polaris11.gfx.drv_wptr == 512 > polaris11.gfx.ring[ 481] == 0xffff1000 ... > polaris11.gfx.ring[ 482] == 0xffff1000 ... > polaris11.gfx.ring[ 483] == 0xffff1000 ... > polaris11.gfx.ring[ 484] == 0xffff1000 ... > polaris11.gfx.ring[ 485] == 0xffff1000 ... > polaris11.gfx.ring[ 486] == 0xffff1000 ... > polaris11.gfx.ring[ 487] == 0xffff1000 ... > polaris11.gfx.ring[ 488] == 0xffff1000 ... > polaris11.gfx.ring[ 489] == 0xffff1000 ... > polaris11.gfx.ring[ 490] == 0xffff1000 ... > polaris11.gfx.ring[ 491] == 0xffff1000 ... > polaris11.gfx.ring[ 492] == 0xffff1000 ... > polaris11.gfx.ring[ 493] == 0xffff1000 ... > polaris11.gfx.ring[ 494] == 0xffff1000 ... > polaris11.gfx.ring[ 495] == 0xffff1000 ... > polaris11.gfx.ring[ 496] == 0xffff1000 ... > polaris11.gfx.ring[ 497] == 0xffff1000 ... > polaris11.gfx.ring[ 498] == 0xffff1000 ... > polaris11.gfx.ring[ 499] == 0xffff1000 ... > polaris11.gfx.ring[ 500] == 0xffff1000 ... > polaris11.gfx.ring[ 501] == 0xffff1000 ... > polaris11.gfx.ring[ 502] == 0xffff1000 ... > polaris11.gfx.ring[ 503] == 0xffff1000 ... > polaris11.gfx.ring[ 504] == 0xffff1000 ... > polaris11.gfx.ring[ 505] == 0xffff1000 ... > polaris11.gfx.ring[ 506] == 0xffff1000 ... > polaris11.gfx.ring[ 507] == 0xffff1000 ... > polaris11.gfx.ring[ 508] == 0xffff1000 ... > polaris11.gfx.ring[ 509] == 0xffff1000 ... > polaris11.gfx.ring[ 510] == 0xffff1000 ... > polaris11.gfx.ring[ 511] == 0xffff1000 ... > polaris11.gfx.ring[ 512] == 0xc0032200 rwD > > > trying to get ADR from dmesg output for 'umr -O verbose -vm ...' > trying to get VMID from dmesg output for 'umr -O verbose -vm ...' > > done after crash. > ------------------------------------------- > > So even without GPU reset, still no "waves". And the error message also does > not state any VM fault address. > Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
The "good" news is that reproduction of the crashes with 3-fps-video-replay is very quick when using amdgpu.vm_update_mode=3.
But the bad news is that I have not been able to get useful error output when using vm_update_mode=3.
At first I tried with also amdgpu.vm_debug=1, and with that in 10 crashes not a single error output line was emitted to either the ssh channel or the system journal.
I then tried with amdgpu.vm_debug=0, and while a few error lines output become logged, then, not quite anything useful - see also in attached example:
[ 912.447139] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12818, emitted seq=12819
[ 912.447145] [drm] GPU recovery disabled.
These are the only lines indicating the error, not even the
echo "crash detected!"
after the
"dmesg -w | tee /dev/tty | grep -m 1 -e "amdgpu.*GPU" -e "amdgpu.*ERROR"
gets emitted, much less the theoretically following umr commands.
What could I do to not let the kernel die so quickly when using amdgpu.vm_update_mode=3?
Created attachment 141243 [details]
crash trace with amdgpu.vm_update_mode=3
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed. (In reply to Anthony Ruhier from comment #63) > FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have > been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed. Forgot to say that I have a vega 64. (In reply to Anthony Ruhier from comment #63) > FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have > been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed. Unluckily, I cannot confirm either observation: The current amd-staging-drm-next git head still crashes on me quickly, still well reproduceable with the 3-fps-video-replay test. And going into S3 suspend does not work for me with the current amd-staging-drm-next either. (In reply to dwagner from comment #65) > (In reply to Anthony Ruhier from comment #63) > > FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have > > been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed. > > Unluckily, I cannot confirm either observation: The current > amd-staging-drm-next git head still crashes on me quickly, still well > reproduceable with the 3-fps-video-replay test. > > And going into S3 suspend does not work for me with the current > amd-staging-drm-next either. Last time I tested, amd-staging-drm-next seemed to be based on 4.19-rc1, on which I had the issue too. I switched to vanilla 4.19-rc4 (now -rc5) and it was fixed. Tried on 4.19-rc5, still crashes for me after about 2-3 days (of 6-12h use) Tested today's current amd-staging-drm-next git head, to see if there has been any improvement over the last two months. The bad news: The 3-fps-video-replay test still crashes the driver reproducably after few minutes, as long as the default automatic power management is active. The mediocre news: At least it looks as if the linux kernel now survives the driver crash to some extent, I found messages in the journal like this: Nov 14 00:59:36 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22008, emitted seq=22010 Nov 14 00:59:36 ryzen kernel: [drm] GPU recovery disabled. Nov 14 00:59:37 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=107, emitted seq=109 Nov 14 00:59:37 ryzen kernel: [drm] GPU recovery disabled. Nov 14 00:59:40 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22008, emitted seq=22010 Nov 14 00:59:40 ryzen kernel: [drm] GPU recovery disabled. Nov 14 00:59:41 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=107, emitted seq=109 ... and so on repeating for several minutes after the screen went blank. Will test tomorrow if this means I can now collect the diagnostics outputs that were asked for earlier. Some good news: S3 suspends/resumes are working fine right now. There are some scary messages emitted upon resume, but they do not seem to have bad consequences: [ 281.465654] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID [ 281.490719] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID [ 282.006225] [drm] Fence fallback timer expired on ring sdma0 [ 282.512879] [drm] Fence fallback timer expired on ring sdma0 [ 282.556651] [drm] UVD and UVD ENC initialized successfully. [ 282.657771] [drm] VCE initialized successfully. As promised in above comment, today I ran my debug script "gpu_debug4.sh" to obtain the diagnostic output after the crash as requested above.
This output is in attached "gpu_debug4_output.txt".
Since the trace output, the "dmesg -w" output and stdout are written to the same file, they are roughly chronologic.
If you want to look only at the dmesg-output, use
> grep '^\[' gpu_debug4_output.txt
(gpu_debug4.sh is a slight variation of earlier gpu_debug3.sh, just writing to a local log file.)
BTW: I ran the script multiple times, crashes occurred after 5 to 300 seconds, the diagnostic output always looked like in attached gpu_debug4_output.txt.
Created attachment 142483 [details]
test script
Created attachment 142484 [details]
gpu_debug4_output.txt.gz
Just for the record, since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next within minutes. (Also using the very latest firmware files from https://people.freedesktop.org/~agd5f/radeon_ucode/ ) Someone suggested I buy Ryzen 2400G APU, but almost every time some network lag happens while watching TV stream through Kodi and FPS of that video goes to 0, display just freezes and you have to power cycle the computer. There is no space for external graphics card in my case and I don't want the increased power consumption, so at this point I'm just considering switch to Intel CPU. I have been following this case for 4 months now with hope that it would move forward a bit but it seems stuck. I can give additional dumps and test some patches if that would help but seems like others have given plenty of information on how to reproduce it. The Firefox browser requires the pulseaudio driver. Use the Alsa audio and the chrome/chromium browser. Disable hardware acceleration in browser settings. Audio is unrelated to this bug. In my reproduction scripts, I do not output any audio at all. The video-at-3-fps replay that I use for reproduction seems to just trigger a certain pattern of the memory- and shader-clocks getting increased/decreased (with dynamic power management being enabled) that makes the occurrence of this bug likely. Any other GPU-usage pattern that triggers a lot of memory/shader clock changes seems to also increase the crash likelihood - manual use of some web-browser where GPU load spikes are caused a few times per second seems to be also a scenario where this bug is triggered now and then. Just for the record, since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next within minutes. As a bonus bug, with today's git head I also get unexplainable "minimal" memory and shader clock values - and a doubled power consumption (12W instead of 6W) for my default 3840x2160 60Hz display mode in comparison to last month's drm-next of the day: > cd /sys/class/drm/card0/device > xrandr --output HDMI-A-0 --mode 3840x2160 --rate 30 > echo manual >power_dpm_force_performance_level > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk > grep -H \\* pp_dpm_mclk pp_dpm_sclk pp_dpm_mclk:0: 300Mhz * pp_dpm_sclk:0: 214Mhz * > xrandr --output HDMI-A-0 --mode 3840x2160 --rate 50 > echo manual >power_dpm_force_performance_level > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk > grep -H \\* pp_dpm_mclk pp_dpm_sclk pp_dpm_mclk:1: 1750Mhz * pp_dpm_sclk:1: 481Mhz * > xrandr --output HDMI-A-0 --mode 3840x2160 --rate 60 > echo manual >power_dpm_force_performance_level > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk > grep -H \\* pp_dpm_mclk pp_dpm_sclk pp_dpm_mclk:0: 300Mhz * pp_dpm_sclk:6: 1180Mhz * But that power consumption issue is negligible in comparison to the show-stopping crashes that are the topic of this bug report. Since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next (and an up-to-date Arch Linux) within minutes by replaying a video at 3 fps. Additional new bonus bugs this time: - system consistently hangs at soft-reboots if X11 was started before - system crashes immediately upon X11 start if vm_update_mode=3 is used - system crashes if the HDMI-connected TV is shut off while screen blanking Again, the bonus bugs are either irrelevant in comparison to the instability this report is about or have been reported already by others. Hi, I am affected by similar issues too using AMDGPU drivers on linux, and I have opened another bug, before finding this. You can have a look at my findings and the workarounds I am applying. So far I had good success with those, but I am interested in knowing your thoughts, recommendations, and feedback. Also if the bug I opened is a duplicate of this one, feel free to let me know and I will mark it as duplicate. https://bugs.freedesktop.org/show_bug.cgi?id=109955 Cheers Mauro I am also running into the same issue. I have two questions that might help tracking down why we are having issues, but not all people that are running a Vega graphics card. 1) What is the output of the following command for you guys? cat /sys/class/drm/card0/device/vbios_version I am running the following version: 113-D0500100-103 According to the techpowerup GPU bios database, this is a vega bios that was replaced two days (!) later by a new version. Perhaps issues were found that required another bios update? I might install Windows on a spare HDD and try to flash my Vega to see if that changes anything. 2) Memory clocking is different for people running multiple monitors. Are you guys also running multiple monitors by any chance? (In reply to Jaap Buurman from comment #79) > I am also running into the same issue. I have two questions that might help > tracking down why we are having issues, but not all people that are running > a Vega graphics card. As you can see from my initial description, I'm running an RX460, which uses not a "Vega", but a "Polaris 11" AMD GPU. > What is the output of the following command for you guys? > > cat /sys/class/drm/card0/device/vbios_version "113-BAFFIN_PRO_1606" I have not heard of any update to this from the vendor - there is just some unofficial hacked version around (which I do not use) that is said to enable some switched-off CUs. > Memory clocking is different for people running multiple monitors. Are you > guys also running multiple monitors by any chance? No, I'm using just one 3840x2160 @ 60Hz HDMI display. (In reply to Alex Deucher from comment #14) > (In reply to dwagner from comment #13) > > > > Much lower shader clocks are used only if I lower the refresh rate of the > > screen. Is there a reason why the shader clocks should stay high even in the > > absence of 3d/compute load? > > > > Certain display requirements can cause the engine clock to be kept higher as > well. In this bug report and another similar one (https://bugs.freedesktop.org/show_bug.cgi?id=109955), everybody having the issue seems to be using a setup that requires higher engine clocks in idle AFAIK. Either high refresh displays, or in my case, multiple monitors. Could this be part of the issue that seems to trigger this bug? I might be grasping at straws here, but I have had this problem for as long as I have this Vega64 (bought at launch), while it is 100% stable under Windows 10 in the same setup. I am also experiencing this issue. * Kernel: 5.1.3-arch2-1-ARCH * LLVM 8.0.0 * AMDVLK (dev branch pulled 20190602) * Mesa 19.0.4 * Card: XFX Radeon RX 590 I've seen this error, bug 105733, bug 105152, bug 107536, and bug 109955 all repeatable (which one each time appears to be non-deterministic) with the same process. I just launch "House Flipper" from Steam (DX11 title), with DXVK 1.2.1, on either the mesa RADV or AMDVLK vulkan implementations. At 2560x1440 resolution (both 60Hz and 144Hz refresh rates), the crash(es) occur. At 1080p@60Hz, I get no crashes, but they come back if I disable v-sync and framerate limiting. I logged power consumption with `sensors | egrep '^power' | awk '{ print $1 " " $2; }'`, and found that the crash often occurs soon after the card hits its maximum power draw at around 190W. I don't have much experience debugging or developing software at the kernel/driver level, but I'm happy to help with providing information as I go through the learning process here. I'll compile the amd-staging-drm-next kernel later tonight and post some results and logs. Please let me know if there's more information I could provide that may be of use here. Thanks for your hard work! (In reply to Jaap Buurman from comment #81) > issue seems to be using a setup that requires higher engine clocks in idle > AFAIK. Either high refresh displays, or in my case, multiple monitors. Could > this be part of the issue that seems to trigger this bug? I might be > grasping at straws here, but I have had this problem for as long as I have > this Vega64 (bought at launch), while it is 100% stable under Windows 10 in > the same setup. This might be true. I was running i3 with xrandr set to 144hz when the freeze scenario began (somewhat last mont, did not "game" much before). Than switched to icewm to test and issue was gone. Later when i configured icewm to also have proper xrandr setting issue comes back. I didnt know that could be related. Will test this tonight. (In reply to Wilko Bartels from comment #83) > (In reply to Jaap Buurman from comment #81) > > issue seems to be using a setup that requires higher engine clocks in idle > > AFAIK. Either high refresh displays, or in my case, multiple monitors. Could > > this be part of the issue that seems to trigger this bug? I might be > > grasping at straws here, but I have had this problem for as long as I have > > this Vega64 (bought at launch), while it is 100% stable under Windows 10 in > > the same setup. > > This might be true. I was running i3 with xrandr set to 144hz when the > freeze scenario began (somewhat last mont, did not "game" much before). Than > switched to icewm to test and issue was gone. Later when i configured icewm > to also have proper xrandr setting issue comes back. I didnt know that could > be related. Will test this tonight. nevermind. it crashed on 60hz as well (once) yesterday (In reply to Wilko Bartels from comment #84) > nevermind. it crashed on 60hz as well (once) yesterday It sure does. This bug is now about two years old, during which amdgpu has never been stable, got worse, and every contemporary kernel, whether "official" ones or ones compiled from git heads of development trees has this very problem, which I can reproduce within minutes. I've given up hoping for a fix. I'll buy an Intel Xe GPU as soon as it hits the shelves. I was also impacted by this bug (amdgpu hangs on random conditions with similar messages as the one exposed) with any kernel/mesa version combination other than the ones on Debian Stretch (any other distro or using Mesa from backports would trigger those crashes). This was on a Ryzen 1700 platform with chipset B450. I had this issue with a RX480 and a RX560 (as I tried to replace the GPU in case it was faulty, I also replace the motherboard). I was still impacted with Fedora 30 with recurring GPU hangs. Then I replaced the CPU/motherboard with a Core i7-9700k/Z390 platform. Since then I did not have a single GPU hang on Fedora 30. My hypothesis on this problem not being easily reproducible is that it would happen only on specific GPU/CPU combinations. (In reply to Paul Ezvan from comment #86) > My hypothesis on this problem not being easily reproducible is that it would > happen only on specific GPU/CPU combinations. ... and at least a specific operating system (Linux) and a specific driver (amdgpu with dc=1). If your hypothesis was true - do you suggest everyone plagued by this bug just buys a new main-board and an Intel CPU to evade it? Since my Ryzen system is perfectly stable when used as a server, not displaying anything but the text console, I'm inclined to rather keep my main-board and CPU and just exchange the GPU for another brand that comes with stable drivers. Found this thread while googling the error from the log. AMD Ryzen 3600 Asrock B350 motherboard ASrock RX560 Radeon GPU Ubuntu and Xubuntu 18.04 and 19.04 both lockups so not useable, after login almost imminent black screen, ssh access still possible. Seems a newer kernel and mesa drivers. sometimes 5 min , sometimes after 2 secomds Linux mint 19.2 Seems a lot more stable but so far only 1 lockup with black screen uname -a Linux jeroenimo-amd 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Last log from mint: Sep 25 23:01:57 jeroenimo-amd kernel: [ 4980.207322] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out Sep 25 23:01:57 jeroenimo-amd kernel: [ 4980.207331] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:45:crtc-1] flip_done timed out Sep 25 23:02:07 jeroenimo-amd kernel: [ 4990.451366] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out I suspect I'm in the same trouble as most. Win 10 flawless so it's really software.. I found a way to crash the system with glmark2 It almost instantly crashes it. I managed to run glmark2 without crashing the system with By running the card manual at lowest frequency from root shell: echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level echo 0 > /sys/class/drm/card0/device/pp_dpm_sclk root@jeroenimo-amd:/home/jeroen# cat /sys/class/drm/card0/device/pp_dpm_sclk 0: 214Mhz * 1: 387Mhz 2: 843Mhz 3: 995Mhz 4: 1062Mhz 5: 1108Mhz 6: 1149Mhz 7: 1176Mhz root@jeroenimo-amd:/home/jeroen# If I go to higher e.g. 2: 843Mhz I manage to crash it.. although it takes a while before it crashes. when I force the card to anything above 4 I get an immediate crash without even starting glmark2 I hope this helps! -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/226. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.