Summary: | GPU fault detected: 146 / VM_CONTEXT1_PROTECTION_FAULT / ring gfx timeout | ||
---|---|---|---|
Product: | DRI | Reporter: | dwagner <jb5sgc1n.nya> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | andrey.grodzovsky, goetzchrist |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
dwagner
2018-07-07 20:03:25 UTC
Created attachment 140497 [details]
dmesg of from booting until gpu fault few minutes later
Created attachment 140498 [details]
Xorg.log from the session ending in the "gpu fault"
Hi, I get this GPU hung, 1-2 a day, mostly when using PHPStorm (Java based PHP Editor) System is KDE Neon (Ubuntu 18.04 + latest KDE), I use padoka PPA (currently 1:18.2~git180730133900.0ea243d~b~padoka0) GPU POLARIS11 0x1002:0x67EF 0x1458:0x230A 0xE5 [ 8004.993577] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000620c [ 8004.993584] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 8004.993587] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0806200C [ 8004.993591] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 32771) at page 0, read from 'CBC0' (0x43424330) (98) [ 8263.966497] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=194416, last emitted seq=194418 [ 8263.966510] [drm] IP block:gfx_v8_0 is hung! [ 8263.966562] amdgpu 0000:01:00.0: GPU reset begin! If you need more info please let me know Krzysiek Saw this kind of crash (still with the latest amd-staging-drm-next kernel) three times in a row today, just by playing a specific video immediately after rebooting and starting X11 with mpv, before the 10 minute video ended. The video (which just shows a static cover image) can be obtained via: youtube-dl -f 248+251 'https://www.youtube.com/watch?v=kYKE78Pcjog' The log messages were just like reported above, I guess the additional "hw_done or flip_done timed out" after the "GPU reset begin!" is not really relevant: Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 147 0x0f580402 for process Xorg pid 793 thread amdgpu_cs:0 pid 794 Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010C3EB Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02004002 Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x02, vmid 1, pasid 32768) at page 1098731, read from 'TC3' (0x54433300) (4) Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146 0x0c984424 for process Xorg pid 793 thread amdgpu_cs:0 pid 794 Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100193 Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04044024 Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x24, vmid 2, pasid 32768) at page 1048979, read from 'TC1' (0x54433100) (68) Jul 31 22:20:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=26570, emitted seq=26573 Jul 31 22:20:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin! Jul 31 22:20:35 ryzen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:44:crtc-0] hw_done or flip_done timed out Just in case somebody aims at reproducing this with mpv, this is the content of the .config/mpv/mpv.conf file in use: audio-device='alsa/iec958:CARD=Generic,DEV=0' audio-delay=0.2 fs=no vo=gpu gpu-api=auto profile=gpu-hq fbo-format=rgba16f hwdec=no video-align-y=-1 hidpi-window-scale=no target-prim=bt.709 tone-mapping=hable cache=65536 cache-initial=2048 cache-secs=20 So not the amdgpu HDMI audio output was used, and no video decoding hardware acceleration. dwanger, how quickly is this reproducible ? A wild guess, what if you boot kernel with IOMMU disabled ? Add iommu=off to grub command line. (In reply to Andrey Grodzovsky from comment #6) > dwanger, how quickly is this reproducible ? With the above video playback test (which I should refer to as the "Othan" test, because that is the name of the song in the video) actually quite fast - never took more than 10 minutes so far to get to the crash. > A wild guess, what if you boot kernel with IOMMU disabled ? Add iommu=off to > grub command line. Tried this: No difference, two attempts with current amd-staging-drm-next, one with hw_update_mode=0 and one with hw_update_mode=3, both crashed in < 1 minute of replay. Interestingly, the "Othan test" can even crash the 4.13 kernel quicker then the usual one or two days of uptime I can get with that old kernel. There isn't really anything special with the video other than it being encoded at only 6 frames per second. And btw., the video replay crashes even with --vo=xv, so without mpv making use of opengl. Replay does not crash with --vo=null. In contrast, when I replay videos with the usual 24fps, this runs much longer without crashing. dwanger, i think you already have all the trace tools installed from previous debug sessions so this should be quick for you - Update to latest kernel from https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next Load the system and before starting reproduce run the following trace command - sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv" after VM_FAULT happened extract the log from /sys/kernel/debug/tracing also run sudo umr -O verbose -R gfx[.] sudo umr -O halt_waves -wa Now let's say this your log crash Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100190 Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7, pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68) Do umr -O verbose -vm 7@100190000 1 where 7 is vmid value and 100190000 is VM_CONTEXT1_PROTECTION_FAULT_ADDR value with extra '000' to get from virtual page number to actual virtual address (left shift 4096b). I can look at the log then and also run it by our MESA/LLVM experts to try and figure out what's going on. Created attachment 140959 [details]
test script that attempted to catch useful output after crashes - but failed
(In reply to Andrey Grodzovsky from comment #8) > dwanger, i think you already have all the trace tools installed from > previous debug sessions so this should be quick for you Yes, and I tried really hard (with above attached script run as "root" on a text console while the "othan_test.sh" script played the video on the screen) to catch any useful output, but that failed for the same reason I mentioned in https://bugs.freedesktop.org/show_bug.cgi?id=102322#c20 - the system simply crashes too hard to quickly to be able to do anything after amdgpu.ko crashes. The output I get in gpu_result.txt stops at the "waiting for the crash" line. It is only in about 1 out of 10 crashes that the syslog at least contains the error messages from the amdgpu crash, in the other 90% of cases the same crash occurs with no message being recorded at all. If there was any method to let other processes survive for a while after amdgpu crashes, please let me know. I did some additional experiments to understand what is so special about the "Othan" video that playing it causes amdgpu to crash relatively fast.
Since the only "odd" parameter of it is its "6 fps" frame rate, I tried replaying other videos, first at their normal rate (like 24 fps), which did not cause quick crashes, then at an artificially lower set rate - and indeed, that causes fast crashing regardless of what video I play.
The framerate that caused the "quickest" crashing seemed to be 3 fps, running
> mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm
was usually crashing amd-staging-drm-next within < 1 minute for me.
Just some random thought: Could the reason be some timed hysteresis in power management of the GPU?
Would there be some possibility to lock the GPU on a specific power level to then try if those crashes still occur?
Indeed, I found my theory confirmed by many experiments: If I use a script like > #!/bin/bash > cd /sys/class/drm/card0/device > echo manual >power_dpm_force_performance_level > # low > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk > # medium > #echo 1 >pp_dpm_mclk > #echo 1 >pp_dpm_sclk > # high > #echo 1 >pp_dpm_mclk > #echo 6 >pp_dpm_sclk to enforce just any performance level, then the crashes do not occur anymore - also with the "low frame rate video test". So it seems that the transition from one "dpm" performance level to another, with a certain probability, causes these crashes. And the more often the transitions occur, the sooner one will experience them. The dynamic power management issue can now be pursued with the original bug report https://bugs.freedesktop.org/show_bug.cgi?id=102322 for the vm_update_mode=0 case - there is probably not much sense in keeping this bug report open just because errors also occur with wm_update_mode=3, just less often. (In reply to dwagner from comment #12) > Indeed, I found my theory confirmed by many experiments: If I use a script > like > > #!/bin/bash > > cd /sys/class/drm/card0/device > > echo manual >power_dpm_force_performance_level > > # low > > echo 0 >pp_dpm_mclk > > echo 0 >pp_dpm_sclk > > # medium > > #echo 1 >pp_dpm_mclk > > #echo 1 >pp_dpm_sclk > > # high > > #echo 1 >pp_dpm_mclk > > #echo 6 >pp_dpm_sclk > to enforce just any performance level, then the crashes do not occur anymore > - also with the "low frame rate video test". > > So it seems that the transition from one "dpm" performance level to another, > with a certain probability, causes these crashes. And the more often the > transitions occur, the sooner one will experience them. > > The dynamic power management issue can now be pursued with the original bug > report https://bugs.freedesktop.org/show_bug.cgi?id=102322 for the > vm_update_mode=0 case - there is probably not much sense in keeping this bug > report open just because errors also occur with wm_update_mode=3, just less > often. Agreed. *** This bug has been marked as a duplicate of bug 102322 *** Thanks for letting us know about the duplicate bug of GPU fault and System crashes, so solution seekers can refer both references to understand the bug and try to solve it easily. Ida, http://www.assignmenthelpfolks.com/ |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.