Bug 108493 - Unigine Heaven at 4K crashes amdgpu and causes a GPU hang
Summary: Unigine Heaven at 4K crashes amdgpu and causes a GPU hang
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-19 10:24 UTC by Timur Kristóf
Modified: 2019-11-19 08:59 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg after the crash (95.66 KB, text/plain)
2018-10-22 19:26 UTC, Timur Kristóf
no flags Details
ddebug dumb from unigine heaven 0 (100.18 KB, text/plain)
2018-10-22 19:27 UTC, Timur Kristóf
no flags Details
ddebug dumb from unigine heaven 1 (79.16 KB, text/plain)
2018-10-22 19:27 UTC, Timur Kristóf
no flags Details
ddebug dumb from unigine heaven 2 (120.22 KB, text/plain)
2018-10-22 19:27 UTC, Timur Kristóf
no flags Details
Content of /sys/kernel/debug/dri/0/amdgpu_vbios (116.00 KB, application/octet-stream)
2018-10-27 05:38 UTC, Timur Kristóf
no flags Details
Content of /sys/class/drm/card0/device/pp_table (833 bytes, application/octet-stream)
2018-10-27 05:39 UTC, Timur Kristóf
no flags Details

Description Timur Kristóf 2018-10-19 10:24:55 UTC
I experience a consistent amdgpu crash when using my AMD GPU with a 4K screen.

Hardware:
* Sapphire Radeon RX 570 Pulse ITX 4GB
* Zotac AMP box mini external GPU enclosure
* Dell XPS 13 9370 laptop
* Dell U2718Q 4K display

Software:
First tried with Fedora 28. Now using Fedora 29. Tried kernel versions 4.18.12, 4.18.13 and 4.19-rc7, the issue appears with all of these. Mesa version is 18.2.2, but the crash is also there with 18.0 (on Fedora 28).

Steps to reproduce the crash:
1. Turn off the laptop
2. Attach the eGPU to the laptop
3. Attach a 4K screen to the HDMI output of the AMD GPU
4. Turn on the laptop
5. Add the following to the kernel command line: 'module_blacklist=i915 3' (to ensure the Intel GPU is not used at all, plus the graphical login won't interfere)
6. Launch the operating system
7. Log in from the console
8. Launch an X session with 'startx'
9. Start the Unigine Heaven benchmark in fullscreen 4K

Expected outcome:
Unigine Heaven should show up and run in a stable and performant manner.

Actual outcome:
Unigine Heaven shows up, runs for a couple of seconds and then the screen goes dark. I can still log into the machine with SSH, but can not kill X or interact with the AMD GPU in any way. Can't even reboot the machine, the only thing that works is long pressing the power key.

Relevant lines from dmesg log:
[  305.078426] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=147930, emitted seq=147933
[  305.078567] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3176, emitted seq=3178
[  305.078573] [drm] GPU recovery disabled.

Possible workaround:
* The crash does not happen when I disable power management with amdgpu.dpm=0, however then it has very poor performance.
* The crash also doesn't happen when I use 'echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level' with the same note about bad performance.

Additional information:
* Note that running any other graphics intensive application (ie. your favourite game) will also result in the same crash, but Unigine Heaven is what I found to be the quickest way to reproduce it.
* Also note that the crash is not X-specific but again this is what I found to be the simplest way to reproduce it.
* The very same hardware works correctly on Windows without a crash. So this is probably not a hardware defect.
* The crash is almost immediate on 4K, but it also occours with other resolutions, just takes more time. At 1440p it takes a couple of minutes but still crashes. At 1080p I could run it for several minutes without a crash (did not test further than that).
* The problem seems to be similar to these: https://bugs.freedesktop.org/show_bug.cgi?id=105733 and https://bugs.freedesktop.org/show_bug.cgi?id=102322 - the difference is that the suggested workarounds don't help, just seem to postpone the crash by a very small margin. It still crashes in less than a minute though.
* Enabling GPU recovery does not actually manage to recover the GPU.

If you need any other kind of log or any more info, please let me know. Thank you in advance for looking into solving this problem.
Comment 1 keramidasceid 2018-10-22 19:20:13 UTC
I have the exact same problem.

The only differences are the following:
* I have the Asus Radeon RX 580 ROG Strix TOP OC 8GB GPU
* I use the Unigine Superposition to reproduce the problem quickly
* I have kernel 4.18.14 on Fedora 28
* I have mesa 18.0.5
Comment 2 Timur Kristóf 2018-10-22 19:26:52 UTC
Created attachment 142139 [details]
dmesg after the crash
Comment 3 Timur Kristóf 2018-10-22 19:27:21 UTC
Created attachment 142140 [details]
ddebug dumb from unigine heaven 0
Comment 4 Timur Kristóf 2018-10-22 19:27:33 UTC
Created attachment 142141 [details]
ddebug dumb from unigine heaven 1
Comment 5 Timur Kristóf 2018-10-22 19:27:46 UTC
Created attachment 142142 [details]
ddebug dumb from unigine heaven 2
Comment 6 Timur Kristóf 2018-10-22 19:30:16 UTC
On freenode in #dri-devel I got the suggestion to run unigine heaven with GALLIUM_DDEBUG="1000". So I just did that. It created 3 files, which I attached to this bug report along with the dmesg log that I took after the crash.

Some people suggested that this may be in fact an issue with radeonsi (and not amdgpu), if this is the case, please reassign this bug appropriately.
Comment 7 Timur Kristóf 2018-10-23 16:01:39 UTC
1. It was suggested that this is a thermal issue. So I monitored the GPU temperatures with GALLIUM_HUD and it was about 40 Celsius when the crash happened.

2. It was also suggested that this is a VRAM memory leak. Again with GALLIUM_HUD I could see that about 1 GB of VRAM gets used (out of the 4 GB), when the crash happens.

3. Also, just to see if this is a power consumption issue (ie. the GPU drawing more power than can be supplied), I tried to lower the value from /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap to 80 Watts. It did not stop the crash from happening.
Comment 8 Timur Kristóf 2018-10-23 16:12:40 UTC
Upgraded to kernel 4.18.16 and mesa 18.2.3 which is supposed to fix a GPU hang. Did not help, the problem is still there.
Comment 9 Timur Kristóf 2018-10-27 05:32:20 UTC
I think I discovered a possible reason for this issue. If you look at the DDEBUG dumps, it says in several places: "This slot was corrupted in GPU memory". So I began to suspect something was wrong with the VRAM.

After looking around a bit, I found that the amdgpu driver does not honor the voltage settings from the VBIOS, and sets the memory to use lower voltages instead. So basically the driver undervolts the VRAM without me asking to do so. I guess this might be considered a feature for some people.

However, when I manually edit pp_od_clk_voltage to increase the OD_MCLK voltages, then the card begins to work in a stable manner and the GPU hang is gone. (Or at the very least I haven't seen a hang yet, whereas previously it used to hang in less than a minute.)

In my case, the VBIOS wants to set the MCLK voltages to 1000 mV at all frequencies, while amdgpu sets them to 750 mv, 800 mV, and 900mV. And it turns out that 900 mV is just too low for my card at 1750 MHz.

[root@timur-xps ~]# cat /sys/class/drm/card0/device/pp_od_clk_voltage 
OD_SCLK:
0:        300MHz        750mV
1:        588MHz        765mV
2:        952MHz        900mV
3:       1041MHz        975mV
4:       1106MHz       1031mV
5:       1168MHz       1093mV
6:       1209MHz       1143mV
7:       1244MHz       1150mV
OD_MCLK:
0:        300MHz        750mV
1:       1000MHz        800mV
2:       1750MHz        900mV
OD_RANGE:
SCLK:     300MHz       2000MHz
MCLK:     300MHz       2250MHz
VDDC:     750mV        1150mV
[root@timur-xps ~]# cat /sys/kernel/debug/dri/0/amdgpu_vbios > mybios.rom
[root@timur-xps ~]# pbec -i mybios.rom -s -r MEMORY_CLOCK

----
[DEFAULT] ATOM_MCLK_ENTRY Array
----

Entry: 0
	Frequency: 300 MHz.
	Voltage:. 1000 MV
Entry: 1
	Frequency: 1000 MHz.
	Voltage:. 1000 MV
Entry: 2
	Frequency: 1750 MHz.
	Voltage:. 1000 MV
----


Here is some info about the VBIOS:

[root@timur-xps ~]# cat /sys/class/drm/card0/device/subsystem_device
0xe343
[root@timur-xps ~]# cat /sys/class/drm/card0/device/subsystem_vendor
0x1da2
[root@timur-xps ~]# cat /sys/class/drm/card0/device/vbios_version
113-D00034-S07
Comment 10 Timur Kristóf 2018-10-27 05:38:04 UTC
Created attachment 142227 [details]
Content of /sys/kernel/debug/dri/0/amdgpu_vbios
Comment 11 Timur Kristóf 2018-10-27 05:39:53 UTC
Created attachment 142228 [details]
Content of /sys/class/drm/card0/device/pp_table
Comment 12 Timur Kristóf 2018-10-27 05:50:30 UTC
By the way the voltage issue has already been reported against ROCm and is supposed to be already fixed. The details are here: https://github.com/RadeonOpenCompute/ROCm/issues/348
Comment 13 Alex Deucher 2018-10-29 20:41:50 UTC
(In reply to Timur Kristóf from comment #9)

> OD_MCLK:
> 0:        300MHz        750mV
> 1:       1000MHz        800mV
> 2:       1750MHz        900mV

This is vddc.

> [DEFAULT] ATOM_MCLK_ENTRY Array
> ----
> 
> Entry: 0
> 	Frequency: 300 MHz.
> 	Voltage:. 1000 MV
> Entry: 1
> 	Frequency: 1000 MHz.
> 	Voltage:. 1000 MV
> Entry: 2
> 	Frequency: 1750 MHz.
> 	Voltage:. 1000 MV
> ----

This is mvdd.  these are not the same voltages.  The pp_od_clk_voltage interface only allows you to adjust vddc.  The vddc values match what is in the vbios for your card.
Comment 14 Alex Deucher 2018-10-29 21:01:03 UTC
I suspect the display may require additional voltage in your case which is why you see the issue at 4k.  The display requirements are not handled as finely on Linux as they are in windows.
Comment 15 Timur Kristóf 2018-10-31 08:34:42 UTC
(In reply to Alex Deucher from comment #14)
> I suspect the display may require additional voltage in your case which is
> why you see the issue at 4k.  The display requirements are not handled as
> finely on Linux as they are in windows.

Thanks Alex for explaining the difference between vddc and mvdd.

After using the GPU in this way for a couple of days I can tell you that increasing the voltage definitely improves the stability of the system but ultimately it doesn't fix the problem. The GPU can still hang with the same "ring gfx timeout", it just takes more time before the problem occours.

Some additonal comments:

* I'd like to emphasize that the problem is not specific to 4K and will happen on 1080p, just later. Ie. the GPU hangs in a couple of minutes instead of immediately.
* echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level does not help at all.
* I also tried amdgpu.vm_update_mode=3 (found as a suggestion from another similar bug report) but it doesn't help at all.
* I also tried manually setting the sclk to a fixed, lower level (another suggestion from another bugreport) which seems to improve the stability by a small margin but it also doesn't prevent the GPU from hanging.
Comment 16 Timur Kristóf 2018-11-18 21:52:59 UTC
Hi,

After some more experimentation it seems that increasing the highest mclk voltage above 900 mV and  setting all other voltages in pp_od_clk_voltage in such a way that they remain below 1000 mV, is a viable workaround that makes the GPU stable.

Here is what I do to achieve this:

echo "2" > /sys/class/drm/card0/device/pp_sclk_od
echo "2" > /sys/class/drm/card0/device/pp_mclk_od
echo "s 0 300 750" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 1 588 765" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 2 952 900" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 3 1041 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 4 1106 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 5 1168 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 6 1209 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 7 1244 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "m 0 300 750" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "m 1 1000 850" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "m 2 1750 970" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

After running this script, I can play on the GPU for several hours and I don't see the hang happening anymore.
Comment 17 Timur Kristóf 2019-01-14 16:17:41 UTC
Hi Everyone,

I've just tested Linux 5.0-rc1 and have not encountered the problem so far. Looking into it more, I think the same patch set that fixed the Sapphire RX 590 for Michael @ Phoronix also fixed my Sapphire RX 570.

Assuming this is the main patch that fixed the issue: https://github.com/torvalds/linux/commit/816b6931315b641c5864cf33a9363cb89da05d0b (specifically the line that sets ucEnableApplyAVFS_CKS_OFF_Voltage). Looking at the code, it seems a bunch other GPUs are affected (besides the RX 590 and RX 570).

Could you guys please send this patch series for inclusion into the stable kernel? Since it fixes a huge stability issue I think it is a reasonable request.
Comment 18 fin4478 2019-03-14 01:47:49 UTC
Add amdgpu.ppfeaturemask=0xfffd7fff to the kernel command line to make the powerplay work with RX 570 at 4K60Hz.
Comment 19 Timur Kristóf 2019-03-15 13:04:06 UTC
Since this is fixed by kernel 5.0, I'm marking it as resolved fixed.
Comment 20 Łukasz Posadowski 2019-08-25 16:26:37 UTC
Since it is my first (technically second, but by GPU crashed before I was able to finish it the first time), hi everyone.

I am expering very similar problem on Kernel 5.2.8 on Fedora 30 and AMD RX570 pci express card. I described it here - https://bugzilla.redhat.com/show_bug.cgi?id=1739766 .

I can run X with 'low' in sys/class/drm/card0/device/power_dpm_force_performance_level. Basically everytime GPU has ~100% load, the card resets itself and never come back. I can ssh into the system and switch to the text console, it is just GPU with X Server that doesn't work.

Card also crashes sometimes with 'low' setting, usually after 30 minutes or a hour of gaming., but then it's a hard crash and I can't even switch to the text console.

Thanks for any help.
Łukasz
Comment 21 Martin Peres 2019-11-19 08:59:29 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/564.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.