Summary: | Unigine Heaven at 4K crashes amdgpu and causes a GPU hang | ||
---|---|---|---|
Product: | DRI | Reporter: | Timur Kristóf <venemo> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | FD, keramidasceid |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Timur Kristóf
2018-10-19 10:24:55 UTC
I have the exact same problem. The only differences are the following: * I have the Asus Radeon RX 580 ROG Strix TOP OC 8GB GPU * I use the Unigine Superposition to reproduce the problem quickly * I have kernel 4.18.14 on Fedora 28 * I have mesa 18.0.5 Created attachment 142139 [details]
dmesg after the crash
Created attachment 142140 [details]
ddebug dumb from unigine heaven 0
Created attachment 142141 [details]
ddebug dumb from unigine heaven 1
Created attachment 142142 [details]
ddebug dumb from unigine heaven 2
On freenode in #dri-devel I got the suggestion to run unigine heaven with GALLIUM_DDEBUG="1000". So I just did that. It created 3 files, which I attached to this bug report along with the dmesg log that I took after the crash. Some people suggested that this may be in fact an issue with radeonsi (and not amdgpu), if this is the case, please reassign this bug appropriately. 1. It was suggested that this is a thermal issue. So I monitored the GPU temperatures with GALLIUM_HUD and it was about 40 Celsius when the crash happened. 2. It was also suggested that this is a VRAM memory leak. Again with GALLIUM_HUD I could see that about 1 GB of VRAM gets used (out of the 4 GB), when the crash happens. 3. Also, just to see if this is a power consumption issue (ie. the GPU drawing more power than can be supplied), I tried to lower the value from /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap to 80 Watts. It did not stop the crash from happening. Upgraded to kernel 4.18.16 and mesa 18.2.3 which is supposed to fix a GPU hang. Did not help, the problem is still there. I think I discovered a possible reason for this issue. If you look at the DDEBUG dumps, it says in several places: "This slot was corrupted in GPU memory". So I began to suspect something was wrong with the VRAM. After looking around a bit, I found that the amdgpu driver does not honor the voltage settings from the VBIOS, and sets the memory to use lower voltages instead. So basically the driver undervolts the VRAM without me asking to do so. I guess this might be considered a feature for some people. However, when I manually edit pp_od_clk_voltage to increase the OD_MCLK voltages, then the card begins to work in a stable manner and the GPU hang is gone. (Or at the very least I haven't seen a hang yet, whereas previously it used to hang in less than a minute.) In my case, the VBIOS wants to set the MCLK voltages to 1000 mV at all frequencies, while amdgpu sets them to 750 mv, 800 mV, and 900mV. And it turns out that 900 mV is just too low for my card at 1750 MHz. [root@timur-xps ~]# cat /sys/class/drm/card0/device/pp_od_clk_voltage OD_SCLK: 0: 300MHz 750mV 1: 588MHz 765mV 2: 952MHz 900mV 3: 1041MHz 975mV 4: 1106MHz 1031mV 5: 1168MHz 1093mV 6: 1209MHz 1143mV 7: 1244MHz 1150mV OD_MCLK: 0: 300MHz 750mV 1: 1000MHz 800mV 2: 1750MHz 900mV OD_RANGE: SCLK: 300MHz 2000MHz MCLK: 300MHz 2250MHz VDDC: 750mV 1150mV [root@timur-xps ~]# cat /sys/kernel/debug/dri/0/amdgpu_vbios > mybios.rom [root@timur-xps ~]# pbec -i mybios.rom -s -r MEMORY_CLOCK ---- [DEFAULT] ATOM_MCLK_ENTRY Array ---- Entry: 0 Frequency: 300 MHz. Voltage:. 1000 MV Entry: 1 Frequency: 1000 MHz. Voltage:. 1000 MV Entry: 2 Frequency: 1750 MHz. Voltage:. 1000 MV ---- Here is some info about the VBIOS: [root@timur-xps ~]# cat /sys/class/drm/card0/device/subsystem_device 0xe343 [root@timur-xps ~]# cat /sys/class/drm/card0/device/subsystem_vendor 0x1da2 [root@timur-xps ~]# cat /sys/class/drm/card0/device/vbios_version 113-D00034-S07 Created attachment 142227 [details]
Content of /sys/kernel/debug/dri/0/amdgpu_vbios
Created attachment 142228 [details]
Content of /sys/class/drm/card0/device/pp_table
By the way the voltage issue has already been reported against ROCm and is supposed to be already fixed. The details are here: https://github.com/RadeonOpenCompute/ROCm/issues/348 (In reply to Timur Kristóf from comment #9) > OD_MCLK: > 0: 300MHz 750mV > 1: 1000MHz 800mV > 2: 1750MHz 900mV This is vddc. > [DEFAULT] ATOM_MCLK_ENTRY Array > ---- > > Entry: 0 > Frequency: 300 MHz. > Voltage:. 1000 MV > Entry: 1 > Frequency: 1000 MHz. > Voltage:. 1000 MV > Entry: 2 > Frequency: 1750 MHz. > Voltage:. 1000 MV > ---- This is mvdd. these are not the same voltages. The pp_od_clk_voltage interface only allows you to adjust vddc. The vddc values match what is in the vbios for your card. I suspect the display may require additional voltage in your case which is why you see the issue at 4k. The display requirements are not handled as finely on Linux as they are in windows. (In reply to Alex Deucher from comment #14) > I suspect the display may require additional voltage in your case which is > why you see the issue at 4k. The display requirements are not handled as > finely on Linux as they are in windows. Thanks Alex for explaining the difference between vddc and mvdd. After using the GPU in this way for a couple of days I can tell you that increasing the voltage definitely improves the stability of the system but ultimately it doesn't fix the problem. The GPU can still hang with the same "ring gfx timeout", it just takes more time before the problem occours. Some additonal comments: * I'd like to emphasize that the problem is not specific to 4K and will happen on 1080p, just later. Ie. the GPU hangs in a couple of minutes instead of immediately. * echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level does not help at all. * I also tried amdgpu.vm_update_mode=3 (found as a suggestion from another similar bug report) but it doesn't help at all. * I also tried manually setting the sclk to a fixed, lower level (another suggestion from another bugreport) which seems to improve the stability by a small margin but it also doesn't prevent the GPU from hanging. Hi, After some more experimentation it seems that increasing the highest mclk voltage above 900 mV and setting all other voltages in pp_od_clk_voltage in such a way that they remain below 1000 mV, is a viable workaround that makes the GPU stable. Here is what I do to achieve this: echo "2" > /sys/class/drm/card0/device/pp_sclk_od echo "2" > /sys/class/drm/card0/device/pp_mclk_od echo "s 0 300 750" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 1 588 765" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 2 952 900" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 3 1041 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 4 1106 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 5 1168 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 6 1209 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "s 7 1244 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "m 0 300 750" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "m 1 1000 850" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "m 2 1750 970" > /sys/class/drm/card0/device/pp_od_clk_voltage echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage After running this script, I can play on the GPU for several hours and I don't see the hang happening anymore. Hi Everyone, I've just tested Linux 5.0-rc1 and have not encountered the problem so far. Looking into it more, I think the same patch set that fixed the Sapphire RX 590 for Michael @ Phoronix also fixed my Sapphire RX 570. Assuming this is the main patch that fixed the issue: https://github.com/torvalds/linux/commit/816b6931315b641c5864cf33a9363cb89da05d0b (specifically the line that sets ucEnableApplyAVFS_CKS_OFF_Voltage). Looking at the code, it seems a bunch other GPUs are affected (besides the RX 590 and RX 570). Could you guys please send this patch series for inclusion into the stable kernel? Since it fixes a huge stability issue I think it is a reasonable request. Add amdgpu.ppfeaturemask=0xfffd7fff to the kernel command line to make the powerplay work with RX 570 at 4K60Hz. Since this is fixed by kernel 5.0, I'm marking it as resolved fixed. Since it is my first (technically second, but by GPU crashed before I was able to finish it the first time), hi everyone. I am expering very similar problem on Kernel 5.2.8 on Fedora 30 and AMD RX570 pci express card. I described it here - https://bugzilla.redhat.com/show_bug.cgi?id=1739766 . I can run X with 'low' in sys/class/drm/card0/device/power_dpm_force_performance_level. Basically everytime GPU has ~100% load, the card resets itself and never come back. I can ssh into the system and switch to the text console, it is just GPU with X Server that doesn't work. Card also crashes sometimes with 'low' setting, usually after 30 minutes or a hour of gaming., but then it's a hard crash and I can't even switch to the text console. Thanks for any help. Łukasz -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/564. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.