110674 – Crashes / Resets From AMDGPU / Radeon VII

Bug 110674 - Crashes / Resets From AMDGPU / Radeon VII

Summary: Crashes / Resets From AMDGPU / Radeon VII

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-05-14 05:55 UTC by Chris Hodapp
Modified:	2019-11-26 23:13 UTC (History)
CC List:	10 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Kernel Log (123.29 KB, text/x-log) 2019-05-14 05:55 UTC, Chris Hodapp	no flags	Details
dmesg.log (123.29 KB, text/plain) 2019-05-14 05:55 UTC, Chris Hodapp	no flags	Details
dmesg.color.log (153.32 KB, text/plain) 2019-05-14 05:56 UTC, Chris Hodapp	no flags	Details
display-manager.service.log (3.55 MB, text/plain) 2019-05-14 05:56 UTC, Chris Hodapp	no flags	Details
display-manager.service.lastboot.log (76.93 KB, text/plain) 2019-05-14 09:34 UTC, Chris Hodapp	no flags	Details
dmesg.nomwait.log (114.28 KB, text/plain) 2019-05-15 02:16 UTC, Chris Hodapp	no flags	Details
display-manager.service.youtube.log (79.50 KB, text/plain) 2019-05-15 03:09 UTC, Chris Hodapp	no flags	Details
dmesg.youtube.log (152.96 KB, text/plain) 2019-05-15 03:09 UTC, Chris Hodapp	no flags	Details
display-manager.service.vaporwave.log (71.30 KB, text/plain) 2019-05-15 03:10 UTC, Chris Hodapp	no flags	Details
dmesg.vaporwave.log (152.35 KB, text/plain) 2019-05-15 03:10 UTC, Chris Hodapp	no flags	Details
5.1.3 crash after resume (38.41 KB, text/x-log) 2019-05-19 22:05 UTC, Tom B	no flags	Details
dmesg.log vega20 crash after idle (86.33 KB, text/plain) 2019-06-04 04:21 UTC, sehellion	no flags	Details
5.1.9 dmesg (8.89 KB, text/plain) 2019-06-15 16:59 UTC, Tom B	no flags	Details
5.1.9 full dmesg (103.23 KB, text/plain) 2019-06-17 10:18 UTC, Tom B	no flags	Details
5.2-rc2 full dmesg with amdgpu.ppfeaturemask=0xfffdbfff (79.30 KB, text/x-log) 2019-06-22 04:20 UTC, sehellion	no flags	Details
5.2.7 full dmesg (148.27 KB, text/plain) 2019-08-10 13:15 UTC, Tom B	no flags	Details
diff of vega20_hwmgr.c from 5.0.13 to 5.2.7 (37.58 KB, patch) 2019-08-11 18:45 UTC, Tom B	no flags	Details \| Splinter Review
5.2.7 dmesg with hard_min_level logged (7.49 KB, text/plain) 2019-08-12 14:34 UTC, Tom B	no flags	Details
logging anywhere the number of screens is set (9.32 KB, text/plain) 2019-08-13 15:20 UTC, Tom B	no flags	Details
a list of commits 5.0.13 - 5.1.0 (15.96 KB, text/plain) 2019-08-14 15:44 UTC, Tom B	no flags	Details
dmesg with amdgpu.dpm=2 (4.29 KB, text/plain) 2019-08-16 22:14 UTC, Tom B	no flags	Details
dmesgAMD2Monitors (356.14 KB, text/plain) 2019-08-25 20:46 UTC, ReddestDream	no flags	Details
AMDInteliGPUBoot (311.83 KB, text/plain) 2019-08-25 20:47 UTC, ReddestDream	no flags	Details
DebugAMD2Monitors (104.08 KB, text/plain) 2019-08-26 03:20 UTC, ReddestDream	no flags	Details
DebugAMDiGPU (99.60 KB, text/plain) 2019-08-26 03:21 UTC, ReddestDream	no flags	Details
Dmesg 5.3-rc7 w/ Two monitors (93.93 KB, text/plain) 2019-09-03 16:46 UTC, ReddestDream	no flags	Details
linux-mainline5.3 dmesg without patches (143.71 KB, text/plain) 2019-09-21 15:38 UTC, Anthony Rabbito	no flags	Details
dsmeg log with Alex's patches (138.63 KB, text/plain) 2019-09-21 15:57 UTC, Anthony Rabbito	no flags	Details
5.3.1 with Alex's patches and dual monitors (71.04 KB, text/x-log) 2019-09-22 21:38 UTC, sehellion	no flags	Details
5.3.1 with Alex's patches and dual monitors, crash (79.75 KB, text/plain) 2019-09-23 04:11 UTC, sehellion	no flags	Details
5.3.1 plus Alex's patches, kde wayland crash, then kde xorg crash (128.22 KB, text/x-log) 2019-09-29 19:25 UTC, linedot	no flags	Details
5.3.1 patched, wayland crash (108.55 KB, text/x-log) 2019-09-29 19:28 UTC, linedot	no flags	Details
5.3.1 patched, xorg crash (115.04 KB, text/x-log) 2019-09-29 19:30 UTC, linedot	no flags	Details
5.4.0-rc1 hangup (104.59 KB, text/x-log) 2019-10-03 06:54 UTC, linedot	no flags	Details
Freeze/Black screen/Crash on 5.3.6 (105.23 KB, text/x-log) 2019-10-14 09:15 UTC, linedot	no flags	Details
5.3.7: Fence fallback timer expired on ring <x> (108.06 KB, text/x-log) 2019-10-21 08:11 UTC, linedot	no flags	Details
5.4.0-arch1-1 GPU initialization fails (105.08 KB, text/plain) 2019-11-26 12:03 UTC, linedot	no flags	Details
Show Obsolete (3) View All

Description Chris Hodapp 2019-05-14 05:55:02 UTC

Created attachment 144254 [details]
Kernel Log

I'm getting frequent crashes and resets. They seem to occur most often right after boot, right after login, and right after wake from standby.

See the attachments for more (recommend `less --raw` if working with the color dmesg).

Comment 1 Chris Hodapp 2019-05-14 05:55:46 UTC

Created attachment 144255 [details]
dmesg.log

Comment 2 Chris Hodapp 2019-05-14 05:56:04 UTC

Created attachment 144256 [details]
dmesg.color.log

Comment 3 Chris Hodapp 2019-05-14 05:56:25 UTC

Created attachment 144257 [details]
display-manager.service.log

Comment 4 Chris Hodapp 2019-05-14 09:34:08 UTC

Created attachment 144261 [details]
display-manager.service.lastboot.log

Add a copy of display-manager.service log filtered down to contain just content since the last boot.

Comment 5 Alex Deucher 2019-05-14 15:32:50 UTC

Does appending idle=nomwait on the kernel command line in grub help?

Comment 6 Chris Hodapp 2019-05-15 02:15:45 UTC

I use systemd-boot but I doubt that matters very much here.

I tried adding idle=nomwait to the kernel command line but it seemed not to affect the problem (I actually had a crash the very first time I tried adding it). I'll attach a dmesg log just in case you want to double-check.

Comment 7 Chris Hodapp 2019-05-15 02:16:25 UTC

Created attachment 144272 [details]
dmesg.nomwait.log

Comment 8 Chris Hodapp 2019-05-15 03:05:50 UTC

I've actually found another crash which triggers pretty promptly whenever I play (presumably-accelerated) YouTube videos. I'll attach dmesg and display-manager.service logs for that crash here but I'm happy to file them as a separate bug upon request.

I'm also going to go ahead and upload *another* set of logs from an event which happens from time to time where I get blocks of vaporwave-looking colored blocks mixed with garbled past-images (presumably hanging around in freed memory). I didn't include this before because I mistakenly thought that the dmesg output was the same as for the original less-visually-striking lock-ups that I described up front. However, it's not clear that I was wrong about that. Anyway, like I said, I'm going to post these logs too and, once again, I'm happy to move them to a separate bug upon request.


I'll also describe my technique for capturing these logs, in case it matters:
When these crashes happen, my first response is to try and shift to a different (text-mode) virtual console to capture the logs. When that works, I save off the logs and then reboot. However, sometimes the crash is so bad that I'm not able to switch to a text-mode virtual console, in which case I have to hard-cut the power and capture the logs retroactively with `journalctl` and a negative boot-number arg. I was able to capture the original log by switching virtual consoles but both of these new ones were captured after a reboot with `journalctl`.

Comment 9 Chris Hodapp 2019-05-15 03:09:31 UTC

Created attachment 144273 [details]
display-manager.service.youtube.log

Comment 10 Chris Hodapp 2019-05-15 03:09:47 UTC

Created attachment 144274 [details]
dmesg.youtube.log

Comment 11 Chris Hodapp 2019-05-15 03:10:14 UTC

Created attachment 144275 [details]
display-manager.service.vaporwave.log

Comment 12 Chris Hodapp 2019-05-15 03:10:30 UTC

Created attachment 144276 [details]
dmesg.vaporwave.log

Comment 13 Chris Hodapp 2019-05-19 09:36:12 UTC

So! It turns out that things are stable with 5.0.X kernels (despite there still being some amdgpu errors in the kernel log). It's slow going because the search space is so big but I'm trying to figure out where in the commit history things actually broke.

Comment 14 Hameer Abbasi 2019-05-19 09:39:51 UTC

I have additional information to report: 5.1.3 fixes this somewhat, but not completely. For example, login is mostly fine, but restarting from the login screen causes crashes.

I also agree that things were fine on 5.0.x

Comment 15 Tom B 2019-05-19 14:27:30 UTC

Have been running 5.0 since release without issue but upgraded this morning and got crashes as described here within a few seconds of boot. 

5.1.3 also fixed it for me, however I am still seeing powerplay errors in dmesg:

[    6.198409] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    6.198411] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
[    7.396661] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    7.396662] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    8.587385] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    8.587386] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[    9.779135] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[    9.779136] amdgpu: [powerplay] Failed to set soft min gfxclk !
[    9.779136] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!


The GPU seems to be boosting as expected so I don't think there is any major issue.

Comment 16 Chris Hodapp 2019-05-19 17:52:27 UTC

Hrm, 5.1.3 does not truly fix things for me. Would you folks mind rebooting a few times and then maybe playing a couple Youtube videos and reporting back?

Comment 17 Hameer Abbasi 2019-05-19 20:30:27 UTC

Hmm. 5.1.3 had issues for me too, on login and when launching Evolution (the GNOME mail client). Seems the success was intermittent.

One additional piece of information: I have two 1440p 144 Hz Freesync displays with audio... I'm not sure if anything about that is a contributing factor.

Comment 18 Hameer Abbasi 2019-05-19 20:53:17 UTC

Hmm, I'm fairly certain at this point that the issue happened between 5.0.13 and 5.1.0. Those are the ones available in the Arch repos, I lack the knowledge to build the kernel myself.

I restarted thrice on 5.0.13, no issue.
Restarted once on 5.1.0, there was an issue.

Comment 19 Tom B 2019-05-19 22:04:21 UTC

I've just resumed from suspend (5.1.3). Had complete graphical corruption and a frozen system. I couldn't switch TTY and had to do a hard reset.

First two reboots froze, third is working fine. Youtube was fine for my 30 second test, as is running unigine-heaven to try GPU load. 


I'll attach my journal from after suspend.

Comment 20 Tom B 2019-05-19 22:05:27 UTC

Created attachment 144303 [details]
5.1.3 crash after resume

Journal output from suspend to crash on resume

Comment 21 Tom B 2019-05-19 22:14:20 UTC

Ignore my last post, I just tried unigine-heaven again and it crashed instantly. I don't know why it worked once. 

It took me 5 hard resets to be able to log in. It seems like if it lasts long enough to log in then it's fine until the GPU is intermittently under load.

As such, it's probably worth mentioning that I'm using SDDM and KDE.

Comment 22 Tom B 2019-05-19 22:19:38 UTC

I was just able to run heaven again without issue by setting high performance mode. 

Can anyone get a crash after running

# echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level


> One additional piece of information: I have two 1440p 144 Hz Freesync displays with audio... I'm not sure if anything about that is a contributing factor.

I am running two 4k 60hz monitors without Freesync. The only common factor there is that we are both running two displays. Both of mine are DisplayPort.

Comment 23 Chris Hodapp 2019-05-19 22:28:50 UTC

I've been building kernels like a fiend the last couple days. Making matters more difficult is the fact that most of the intermediate commits produce kernels that get stuck waiting for devices to come up (something that went in the run-up to 5.1 disagrees with my machine but then got fixed for the actual 5.1 release). No conclusive results yet except that both 5.0 and 5.0.17 work (so I'm assuming the whole 5.0.X series works) but nothing in the 5.1 series has truly been stable for me.

It's interesting that you both have two monitors. I am also trying to run two monitors over displayport. For what it's worth, I'm also using KDE with sddm, though if KDE or sddm is making the GPU reset then that is on the graphics driver, not the userspace programs that are trying to use it.

Comment 24 Hameer Abbasi 2019-05-19 22:37:39 UTC

> I restarted thrice on 5.0.13, no issue.

Scratch that... I get minor corruption even on 5.0.13 on accelerated video playback during high CPU usage (such as compilation). The videos freeze, glitch out, and sometimes there's graphical corruption.

> I am running two 4k 60hz monitors without Freesync. The only common factor there is that we are both running two displays. Both of mine are DisplayPort.

Not the only thing... We're both doing similar amounts of pixels per second, in a sense, more or less. Mine are DisplayPort too, so that's also common for us.

I'm on GNOME desktop and the entire GNOME suite.

Comment 25 Tom B 2019-05-19 23:02:27 UTC

On 5.1.3 (and presumably all 5.1 kernels) I am seeing a strange power profile.

Can everyone else run sensors (after sensors-detect if you don't have the amdgpu device showing)

I'm seeing this:

amdgpu-pci-4400
Adapter: PCI adapter
vddgfx:       +1.11 V  
fan1:           0 RPM  (min =    0 RPM, max = 3850 RPM)
temp1:        +33.0°C  (crit = +118.0°C, hyst = -273.1°C)
power1:      135.00 W  (cap = 250.00 W)


Even at idle my GPU is running at 1100mv (my default base voltage) and constantly running at 135w.


My output of cat /sys/kernel/debug/dri/0/amdgpu_pm_info shows the same thing:

Clock Gating Flags Mask: 0x36974f
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: On
        Graphics 3D Coarse Grain memory Light Sleep: On
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: On
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: Off
        Video Compression Engine Medium Grain Clock Gating: Off
        Host Data Path Light Sleep: On
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: On
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        351 MHz (MCLK)
        0 MHz (SCLK)
        1373 MHz (PSTATE_SCLK)
        1001 MHz (PSTATE_MCLK)
        1106 mV (VDDGFX)
        135.0 W (average GPU)

GPU Temperature: 33 C
GPU Load: 0 %

SMC Feature Mask: 0x0000000000c0c002
UVD: Disabled

VCE: Disabled


It's locked at 135w and 1106mv. Are you guys seeing similar? Apologies for the multiple posts but I'll post in a second after running unigine to see if it tries to boost before it crashes.

Comment 26 Tom B 2019-05-19 23:05:56 UTC

Ok, running unigine-heaven and watching  /sys/kernel/debug/dri/0/amdgpu_pm_info
 the wattage and voltage never change. It also never boosts to 1800mhz as it should and sticks at 1373.

I should have mentioned in my last post that previously the card went down to about 23w when idle.

I'm guessing the crash occurs when the GPU needs more than the 135w that it's getting. 

Chris Hodapp, did you come across any commits referencing power profiles as that looks to be the cause of the issue.

Comment 27 Tom B 2019-05-19 23:18:06 UTC

Boost is definitely the problem.

Idle 5.0.13:

$ cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Clock Gating Flags Mask: 0x36974f
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: On
        Graphics 3D Coarse Grain memory Light Sleep: On
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: On
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: Off
        Video Compression Engine Medium Grain Clock Gating: Off
        Host Data Path Light Sleep: On
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: On
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        351 MHz (MCLK)
        809 MHz (SCLK)
        1373 MHz (PSTATE_SCLK)
        1001 MHz (PSTATE_MCLK)
        737 mV (VDDGFX)
        23.0 W (average GPU)

GPU Temperature: 31 C
GPU Load: 0 %

SMC Feature Mask: 0x0000000019f0e3cf
UVD: Disabled

VCE: Disabled


Load 5.0.13:


Clock Gating Flags Mask: 0x36974f
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: On
        Graphics 3D Coarse Grain memory Light Sleep: On
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: On
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: Off
        Video Compression Engine Medium Grain Clock Gating: Off
        Host Data Path Light Sleep: On
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: On
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        1001 MHz (MCLK)
        1802 MHz (SCLK)
        1373 MHz (PSTATE_SCLK)
        1001 MHz (PSTATE_MCLK)
        1068 mV (VDDGFX)
        191.0 W (average GPU)

GPU Temperature: 63 C
GPU Load: 0 %

SMC Feature Mask: 0x0000000019f0e3cf
UVD: Disabled

VCE: Disabled



On 5.1, the same clocks, voltage and wattage are used, it never changes power states. On 5.0 it idles at 23w low clocks and boosts to 191w with 1802mhz.

Comment 28 Tom B 2019-05-19 23:49:39 UTC

as a complete amateur looking at the commit history, this looks like a possible culprit: https://github.com/torvalds/linux/commit/e9c5b46e3c50f58403aeca6d6419b9235d2518b2#diff-db8ff8bb932e2ba3f89c9402f6856661

It has a block specifically for Vega20 and deals with power states. 

Though the related tag is 5.2-rc1 it's from back in January so that seems unlikely.

Comment 29 Chris Hodapp 2019-05-21 07:38:23 UTC

Tom B, as far as I can tell, that commit didn't get merged in until relatively recently and is not in 5.1.

Comment 30 Chris Hodapp 2019-05-21 08:11:34 UTC

Some interesting findings:

First, I think I may have identified the problematic commit (or at least the most-problematic one): d1a3e239a6016f2bb42a91696056e223982e8538 (drm/amd/powerplay: drop the unnecessary uclk hard min setting). I eventually gave up on doing a normal bisect since so many of the commits between 5.0 and 5.1 were non-viable. Instead, I made a list of all the commits that touched vega20-related files. I then started repeatedly picking out the non-tested commit with the most related-sounding message, checking out the v5.1 tag, and reverting the commit in order to test it as the culprit. When I revert that one, my system boots reliably. I still see 133.0 watts of power draw, though.

This brings me to the second thing: When looking through the commits, I noticed that there were multiple commits that claim to prevent or reduce crashing in high-resolution situations (one references 5k displays, another references 3+ 4k displays). I want to note that we all seem to have relatively demanding display setups: Hameer has two 144hz 1440p displays, Tom B has two 60hz 4k displays, and I have two 120hz 4k displays. Putting these together I decided to try unplugging one of my displays. Imagine my surprise when things booted completely smoothly on a stock 5.1 kernel: glitch-free boot, *no powerplay errors in the kernel log*, and 25 watts of power draw when usage is low. So I think it is safe to say that one "workaround" is to unplug a monitor if you can stand to work that way.

I actually have access to another Radeon VII so I may try running one per monitor tomorrow.

Comment 31 Tom B 2019-05-21 09:42:05 UTC

That's interesting because a single one of your 120hz 4k displays would require the same bandwidth as both of my 60hz 4k displays together. That means the issue is either related only to resolution and not bandwidth or it's something to do with having two displays connected at the same time.

Comment 32 Tom B 2019-05-30 16:15:12 UTC

This is still an issue in 5.1.5. It seems slightly more stable but I'm still getting the high power usage and no boost clocks. 

On a successful boot I see the following in dmesg:



[    3.628369] [drm] amdgpu: 16368M of VRAM memory ready
[    3.628371] [drm] amdgpu: 16368M of GTT memory ready.
[    3.629241] amdgpu 0000:44:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    3.629243] amdgpu 0000:44:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    4.260631] fbcon: amdgpudrmfb (fb0) is primary device
[    4.376861] amdgpu 0000:44:00.0: fb0: amdgpudrmfb frame buffer device
[    4.410360] amdgpu 0000:44:00.0: ring gfx uses VM inv eng 0 on hub 0
[    4.410363] amdgpu 0000:44:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    4.410365] amdgpu 0000:44:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    4.410367] amdgpu 0000:44:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    4.410369] amdgpu 0000:44:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    4.410371] amdgpu 0000:44:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    4.410372] amdgpu 0000:44:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    4.410374] amdgpu 0000:44:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    4.410376] amdgpu 0000:44:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    4.410378] amdgpu 0000:44:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    4.410380] amdgpu 0000:44:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    4.410382] amdgpu 0000:44:00.0: ring page0 uses VM inv eng 1 on hub 1
[    4.410383] amdgpu 0000:44:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    4.410385] amdgpu 0000:44:00.0: ring page1 uses VM inv eng 5 on hub 1
[    4.410386] amdgpu 0000:44:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    4.410388] amdgpu 0000:44:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    4.410390] amdgpu 0000:44:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    4.410391] amdgpu 0000:44:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    4.410392] amdgpu 0000:44:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    4.410393] amdgpu 0000:44:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    4.410394] amdgpu 0000:44:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    4.410396] amdgpu 0000:44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    4.410397] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    5.088344] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:44:00.0 on minor 0
[    5.247245] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    5.247247] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
[    6.092850] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    6.092851] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    6.939351] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    6.939351] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[    7.784543] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[    7.784544] amdgpu: [powerplay] Failed to set soft min gfxclk !
[    7.784545] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[    7.842345] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.143759] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.143761] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    8.159090] amdgpu: [powerplay] Failed to send message 0x26, response 0xff
[    8.159091] amdgpu: [powerplay] Failed to set soft min socclk!
[    8.159092] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[    8.245063] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.825759] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.825760] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    8.825919] amdgpu: [powerplay] Failed to send message 0x26, response 0xff
[    8.825919] amdgpu: [powerplay] Failed to set soft min socclk!
[    8.825920] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[    8.826116] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.842518] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.842519] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    8.842691] amdgpu: [powerplay] Failed to send message 0x26, response 0xff
[    8.842692] amdgpu: [powerplay] Failed to set soft min socclk!
[    8.842692] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[    8.885751] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.892421] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.892422] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
[    8.892614] amdgpu: [powerplay] Failed to send message 0x26, response 0xff
[    8.892614] amdgpu: [powerplay] Failed to set soft min socclk!
[    8.892615] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[    8.892741] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.893595] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.893732] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.920997] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.921135] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.941712] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    8.941834] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.153837] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.154359] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.166532] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.170008] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.211796] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[    9.227359] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[   15.447508] amdgpu: [powerplay] Failed to send message 0x28, response 0xff
[   15.449293] amdgpu: [powerplay] Failed to send message 0x28, response 0xff

Comment 33 Tom B 2019-06-03 11:39:01 UTC

is this likely to be fixed in 5.2 or before? It's a showstopping bug for those affected.

Comment 34 Anton Herzfeld 2019-06-03 14:57:16 UTC

I think this is not just affecting Vega 20 but also Vega 10 is now stuck on memclock pstate 0 (167MHz) since kernel 5.1.

I assume this is related to fclk and defclk

Comment 35 sehellion 2019-06-04 04:19:26 UTC

Vega20 affected to these or similar bugs, too. On are kernels 5.0.x the primary monitor falls. Starting with version 5.1.x, hangs and resets gpu already after login to x-session or after workiing dpms. This is not fixed in version 5.2-rc2 yet. But yesterday I successfully boot and work with two monitors. Problems appeared only after idle time. https://bugzilla.kernel.org/show_bug.cgi?id=203781

Comment 36 sehellion 2019-06-04 04:21:41 UTC

Created attachment 144438 [details]
dmesg.log vega20 crash after idle

Comment 37 Tom B 2019-06-15 16:58:59 UTC

5.1.9 makes this bug even worse. It now crashes as soon as the display server is started.

Running sensors now gives an error:


ERROR: Can't get value of subfeature fan1_input: I/O error
ERROR: Can't get value of subfeature power1_average: I/O error
iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +37.0°C  

k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +34.8°C  (high = +70.0°C)
Tctl:         +61.8°C  

amdgpu-pci-4400
Adapter: PCI adapter
vddgfx:       +0.74 V  
fan1:             N/A  (min =    0 RPM, max = 3850 RPM)
temp1:        +39.0°C  (crit = +118.0°C, hyst = -273.1°C)
power1:           N/A  (cap = 250.00 W)

k10temp-pci-00cb
Adapter: PCI adapter
Tdie:         +33.2°C  (high = +70.0°C)
Tctl:         +60.2°C  



I can't even see the wattage now. 

# cat /sys/kernel/debug/dri/0/amdgpu_pm_info

Clock Gating Flags Mask: 0x860200
	Graphics Medium Grain Clock Gating: Off
	Graphics Medium Grain memory Light Sleep: Off
	Graphics Coarse Grain Clock Gating: Off
	Graphics Coarse Grain memory Light Sleep: Off
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: Off
	Graphics Run List Controller Light Sleep: Off
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: Off
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: Off
	Unified Video Decoder Medium Grain Clock Gating: Off
	Video Compression Engine Medium Grain Clock Gating: Off
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: On
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: On

GFX Clocks and Power:
	1373 MHz (PSTATE_SCLK)
	1001 MHz (PSTATE_MCLK)
	737 mV (VDDGFX)

GPU Temperature: 39 C

UVD: Disabled

VCE: Disabled


No clocks or wattage! 

I'm guessing 34d07ce3d6a120056e4763ae9a3db0d769ab7c63 "fix ring test failure issue during s3 in vce 3.0 (V2)" is to blame as dmesg (attached in next post) says


[   20.584937] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=25, emitted seq=27

It would be nice to see some acknowledgement from AMD on this.

Comment 38 Tom B 2019-06-15 16:59:47 UTC

Created attachment 144554 [details]
5.1.9 dmesg

Comment 39 Chris Hodapp 2019-06-15 22:15:02 UTC

The fact that amdgpu is getting less functional over time with this high-end part _is_ definitely annoying, but let's all keep in mind that this is not an official support channel from AMD, it's the issue tracker for an open source project that AMD contribute to. AMD don't actually owe us anything though this channel. Instead, the way to pressure them for concrete answers is actually to choose an option from https://www.amd.com/en/support/contact.

Comment 40 Alex Deucher 2019-06-16 16:08:23 UTC

Please attach your full dmesg output.  Are you passing any parameters to the driver?

Comment 41 Tom B 2019-06-17 10:18:19 UTC

Created attachment 144569 [details]
5.1.9 full dmesg

Interestingly I just reinstalled 5.1.9 and I'm not seeing the same immediate crash. It may be another package as I tried three boots and all had the same issue of immediate crashing on SDDM start. The only way I managed to get dmesg output was switching TTY immediately, switching back to tty1 where SDDM was running caused the immediate crash

After reinstalling I'm getting the same issue as earlier 5.1 kernels where it freezes the PC under load and is stuck in the same power state. Oddly I'm seeing 137w constantly in 5.1.9 where I was getting 135w in 5.1.3 though I didn't test 5.1.3 multiple times, it might reach a wattage on boot and then stick to it.

I have attached the full dmesg anyway.

> Are you passing any parameters to the driver?

I have nothing related to amdgpu in /etc/modprobe.d and my kernel commend line is:

[    0.364597] Kernel command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=fc6ad741-d52d-47eb-b6a6-0026f27b29f3 rw quiet

Comment 42 Matt Coffin 2019-06-21 20:17:34 UTC

For what it's worth, I've experienced a bunch of issues similar to this with OVERDRIVE enabled. You can try disabling it by setting the following in modprobe.d or your kernel launch line

amdgpu.ppfeaturemask=0xfffdbfff

Comment 43 Matt Coffin 2019-06-21 20:18:39 UTC

(In reply to Matt Coffin from comment #42)
> For what it's worth, I've experienced a bunch of issues similar to this with
> OVERDRIVE enabled. You can try disabling it by setting the following in
> modprobe.d or your kernel launch line
> 
> amdgpu.ppfeaturemask=0xfffdbfff

Also worth noting that I've found that using `fancontrol` creates a race condition if you have the OD fuzzy fan control enabled, so try just maxing out the fans via the sysfs hwmon interface instead just as a test.

Comment 44 sehellion 2019-06-22 04:19:12 UTC

(In reply to Matt Coffin from comment #42)
> For what it's worth, I've experienced a bunch of issues similar to this with
> OVERDRIVE enabled. You can try disabling it by setting the following in
> modprobe.d or your kernel launch line
> 
> amdgpu.ppfeaturemask=0xfffdbfff


It doesn't seem that in this case the problem with OVERDRIVE. I will attach full dmesg log with amdgpu.ppfeaturemask=0xfffdbfff

Comment 45 sehellion 2019-06-22 04:20:46 UTC

Created attachment 144611 [details]
5.2-rc2 full dmesg with amdgpu.ppfeaturemask=0xfffdbfff

Comment 46 Tom B 2019-07-08 12:29:56 UTC

Has anyone tested 5.3 yet? I noticed there are a lot of powerplay changes.

Since this bug messes up the card's power profile, how safe is testing new kernels? Is there any danger of my card being damaged due to wrong voltages if the powerplay code is as buggy or worse than it has been since 5.1?

Comment 47 ReddestDream 2019-07-25 05:36:25 UTC

(In reply to Tom B from comment #46)
> Has anyone tested 5.3 yet? I noticed there are a lot of powerplay changes.
> 
> Since this bug messes up the card's power profile, how safe is testing new
> kernels? Is there any danger of my card being damaged due to wrong voltages
> if the powerplay code is as buggy or worse than it has been since 5.1?

I've tested 5.3-rc-1 and no dice. I still get the PowerPlay Failed to send message errors in dmesg when I have more than one monitor connected to Radeon VII. :(

My current workaround is to connect my second monitor to the iGPU before boot. Then the PowerPlay errors do not happen. As long as I don't get the PowerPlay errors in dmesg, graphics are stable. If the errors do appear, graphics will be unstable. It's a pretty clear connection . . .

Comment 48 Anthony Rabbito 2019-07-26 01:19:39 UTC

I'm able to run dual monitors with one HDMI and one DP.

Running 3 monitors (2 DP 1 HDMI) at 1440p 144Hz causes all the issues noted here. Linux 5.2.2

Comment 49 Tom B 2019-07-26 01:24:46 UTC

Unfortunately iGPU isn't an option for me as I don't have one. 

> I'm able to run dual monitors with one HDMI and one DP.

> Running 3 monitors (2 DP 1 HDMI) at 1440p 144Hz causes all the issues noted here. Linux 5.2.2

That's interesting, as I was originally using HDMI + DP but it caused its own set of similar issues as reported here: https://bugs.freedesktop.org/show_bug.cgi?id=110510 

I wonder whether 5.1+ reversed it so that HDMI+DP now works, I'll test it when I get a chance.

Comment 50 ReddestDream 2019-07-26 03:19:22 UTC

(In reply to Anthony Rabbito from comment #48)
> I'm able to run dual monitors with one HDMI and one DP.
> 
> Running 3 monitors (2 DP 1 HDMI) at 1440p 144Hz causes all the issues noted
> here. Linux 5.2.2

Hmm. That's very interesting. I have not tried HDMI. All my testing was done with just 2 DP monitors.

Comment 51 ReddestDream 2019-07-28 05:20:03 UTC

Also, just FYI, it does look like there are some fixes to display type detection on AMD GPUs coming in 5.3-rc2. These might fix or at least improve the multimonitor issue on Radeon VII:

https://github.com/torvalds/linux/commit/e2921f9f95f1c1355a39e54dc038ad95b6e032be

Comment 52 Peter Hercek 2019-07-29 10:52:09 UTC

I'm getting hangs-up with kernels 5.2.3 (often) and 5.1.15 (less often).
Radeon VII with 3 monitors. Each monitor connected through DP.

Comment 53 Anthony Rabbito 2019-07-29 19:25:28 UTC

Interesting, on 5.2.x with 2 monitors hooked up via HDMI and DP it behaves 75% of the time with most issues coming from xinit or sleep. Hopefully 5.3 will contain fixes

Comment 54 ReddestDream 2019-07-29 21:40:21 UTC

(In reply to Peter Hercek from comment #52)
> I'm getting hangs-up with kernels 5.2.3 (often) and 5.1.15 (less often).
> Radeon VII with 3 monitors. Each monitor connected through DP.

I hear that 5.0.0.13 is from before this regression and should work without issue if you are willing to downgrade:

https://bbs.archlinux.org/viewtopic.php?id=247733

(In reply to Anthony Rabbito from comment #53)
> Interesting, on 5.2.x with 2 monitors hooked up via HDMI and DP it behaves
> 75% of the time with most issues coming from xinit or sleep. Hopefully 5.3
> will contain fixes

Would be interesting if it turns out that using HDMI+DP fixes the issue. Not that HDMI doesn't come with its own issues sometimes with color. I do have some faith that 5.3 will fix it since AMDGPU is getting a lot of work for Navi. I plan to try out 5.3-rc2 (or whatever mainline is at) sometime this week.

Comment 55 Anthony Rabbito 2019-07-31 15:37:56 UTC

(In reply to ReddestDream from comment #54)
> (In reply to Peter Hercek from comment #52)
> > I'm getting hangs-up with kernels 5.2.3 (often) and 5.1.15 (less often).
> > Radeon VII with 3 monitors. Each monitor connected through DP.
> 
> I hear that 5.0.0.13 is from before this regression and should work without
> issue if you are willing to downgrade:
> 
> https://bbs.archlinux.org/viewtopic.php?id=247733
> 
> (In reply to Anthony Rabbito from comment #53)
> > Interesting, on 5.2.x with 2 monitors hooked up via HDMI and DP it behaves
> > 75% of the time with most issues coming from xinit or sleep. Hopefully 5.3
> > will contain fixes
> 
> Would be interesting if it turns out that using HDMI+DP fixes the issue. Not
> that HDMI doesn't come with its own issues sometimes with color. I do have
> some faith that 5.3 will fix it since AMDGPU is getting a lot of work for
> Navi. I plan to try out 5.3-rc2 (or whatever mainline is at) sometime this
> week.

I will check my package cache to see of I still have kernel 5.0.0.13 to see if it's available to me otherwise I'll build it. I'll report back how it goes. I miss my third monitor.

Comment 56 Peter Hercek 2019-07-31 17:09:10 UTC

I use 5.0.13 for 3 days. It works OK so far. But 3 days is too little to tell. E.g. 5.1.15 hanged up after about 5 days. But from that time it hanged up always after I launched two youtube videos just after login. I probably did not launch youtube videos that early in my session in the first days of my 5.1.15 use. Kernel 5.0.13 can handle this situation.

Comment 57 Tom B 2019-07-31 17:13:24 UTC

5.0.13 works fine, I've been using it since I first encountered the problem. 5.1+ introduces this issue.

The way to tell whether it's working correctly is to run sensors and check the power1 number. The bug causes the GPU to be stuck in a high power state (for me 135w) where in previous kernels it idles at 23w.

Alternatively run cat /sys/kernel/debug/dri/0/amdgpu_pm_info which will show the same thing, it will be stuck at 1.1v/135w and the clocks will be maxed rather that clocked down when idle.

Comment 58 Peter Hercek 2019-08-03 12:10:10 UTC

It is probably not related to changes from 5.0 to 5.1.
I have got the hang up with 5.0.13 as well as with 4.20.11.
It may be only less common with older kernels.

In my case, it is triggered mostly by playing a video stream in parallel with some other activity. My logs with 5.1 and 5.2 kernels look just like Chris' log. First amdgpu_job_timedout, then an attempt to reset gpu followed by endless stream of parser initialization failures.

I did not check the logs with older kernels but it all looked the same at the user level. The video subsystem is hung up. The rest of the machine (e.g. an ssh session) work ok.

My /sys/class/hwmon/hwmon1/power1_average reported normal values around 25W after hang up. I'm not seeing unusually high power values like Tom B.

Comment 59 Tom B 2019-08-03 12:31:28 UTC

@Peter Hercek, do you see the wattage/voltage change at all? For me it's stuck on 135w, perhaps it hits a power state and then cant change and for you it's stuck on 25w.

Comment 60 Peter Hercek 2019-08-03 13:35:27 UTC

The power value changes from 24 W to about 75 W (when I tried xonotic). I checked the power value two times after hang up. It was 25 W in both cases. It does not change after video subsystem hangs up.

Comment 61 ReddestDream 2019-08-08 14:37:35 UTC

This issue is still not fixed with 5.3-rc3, at least not with two DisplayPort monitors.

I am not able to test with DP+HDMI configuration.

Comment 62 Peter Hercek 2019-08-10 12:10:08 UTC

OK, I started to use 5.2.5 kernel after the my last hang up with 4.20.11. It worked fine for 1 week. I'm trying 5.2.7 now.

It is possible something was fixed in 5.2.5 because there was one commit which seemed related (drm/amdgpu: Reserve shared fence for eviction fence dd68722c427d5b33420dce0ed0c44b4881e0a416). But there are reasons to think I was just lucky for the week: the commit seems to relate to some VM support and I have got crashes without VM use, and ReddestDream reported the problem in 5.4.rc3 as well.

Comment 63 Tom B 2019-08-10 13:02:39 UTC

I've just done some testing with 5.2.7

- I still get the 135w/1.1v constant power state and crashing with DP+DP.

- HDMI+DP works, but this was my original setup when I got the VII. Unfortunately  I get random flickering and black screens on the HDMI monitor every 3-5 minutes as described in https://bugs.freedesktop.org/show_bug.cgi?id=110510

Comment 64 Tom B 2019-08-10 13:14:23 UTC

Scratch that, I just rebooted with HDMI+DP and it froze as soon as SDDM started. I was eventually able to switch TTY and the voltages looked correct (it was boosted down) but I was never able to log in to KDE as SDDM was frozen. Restarting sddm allowed me to enter my password but it froze as soon as I logged in. Not that HDMI is an optimal solution anyway as I get the flickering, and I've tried 3 different cables. 

Back to 5.0.13 which works mostly fine. I do get a crash very occasionally, the machine will appear to wake up from sleep with a black screen and a cursor. Very rare, once a week or so and only when sleep/resume cycle has been run multiple times.

Peter Hercek mentioned virtual machines, as such I tried with iommu enabled and disabled in the bios, it didn't make any difference but thought it was worth reporting to save others time trying it.

Comment 65 Tom B 2019-08-10 13:15:53 UTC

Created attachment 145018 [details]
5.2.7  full dmesg

Full dmesg from 5.2.7, 2xdisplayport monitors the error that keeps repeating is:

*ERROR* Failed to initialize parser -125!

Comment 66 Tom B 2019-08-10 13:29:41 UTC

One thing I haven't mentioned is I don't have a GPU fan installed as my VII is water cooled, it's unlikely but perhaps this explains the different behaviour of my card to others.

Comment 67 Tom B 2019-08-10 16:39:55 UTC

I had a look around at similar bugs and came across this:

https://bugs.freedesktop.org/show_bug.cgi?id=110822

It's for a 580, not a VII but the problems started at 5.1 and gives a similar powerplay related crash.

The suggested fix there is to revert ad51c46eec739c18be24178a30b47801b10e0357.

I just tried this and after 4 reboots I can report it has two effects:

1. I don't have any crashing at all and my card boosts GPU clocks, voltages and wattages. I can run unigine-heaven for several minutes without the system freezing.

2. The memory is forced to 351mhz, limiting performance.

If I run

cat /sys/class/drm/card0/device/pp_dpm_mclk

it shows:

0: 351Mhz *
1: 801Mhz
2: 1001Mhz

Which looks correct for idle, but it never, even under load, boosts to the next memory clock. It also can't be set manually:

echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 2 > /sys/class/drm/card0/device/pp_dpm_mclk
-bash: echo: write error: Invalid argument

While this isn't a proper fix it does give us some valuable insight. If anyone wants to run at 351mhz memory with a stable card and 2 screens they can. It would be nice if someone can verify my findings as my card seemed to behave differently to others for some reason.

This bug may be related to https://bugs.freedesktop.org/show_bug.cgi?id=110822 alternatively, it's possible the crash occurs when the memory clock changes (which might mean it's related to https://bugs.freedesktop.org/show_bug.cgi?id=102646 as there are issues with memory clock changes there) There seem to be several powerplay related issues which may have the same root cause.

I'm now going to:

1. Revert to the stock kernel and set the mclk to 1001 manually before starting SDDM and see if the crash occurs.

2. See if I can manage to get stability and the mclk stuck at 1001mhz as this would be an acceptable compromise, even if not ideal.

Comment 68 Tom B 2019-08-10 19:00:17 UTC

Apologies for the multiple replies/emails. I think I must just have got lucky. It worked several boots (in a row) and now only works very occasionally. I think it was just coincidence that it worked a few times after I installed that kernel, sorry guys.

During my tests with 5.2.7 I have noticed some interesting findings with the wattage though. It will indeed get stuck on a specific wattage, I've had 33, 24, 45, 133, 134 and on several wattages there is some fluctuation.  e.g. 33-34.

Higher wattages are significantly more stable, 133w lasts quite a while before it crashes, 33w crashes instantly. I'm assuming this is because the card just doesn't have enough power to do what's required.

When the wattage gets stuck, if you force the performance mode:

# echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level

it confuses the driver and sensors then shows

ERROR: Can't get value of subfeature power1_average: I/O error

Despite working until manually setting the power state. There doesn't seem to be a way to get it back to a state where sensors shows the wattage after it reaches this state, other than rebooting.


The inconsistent nature of this bug and the fact that it sometimes doesn't appear suggests a race condition. I'd assume something else on the system happens before or after amdgpu is expecting.

Is there any way to delay loading the amdgpu driver and manually loading it after everything else?

Comment 69 ReddestDream 2019-08-11 01:15:48 UTC

>The inconsistent nature of this bug and the fact that it sometimes doesn't appear suggests a race condition. I'd assume something else on the system happens before or after amdgpu is expecting.

>Is there any way to delay loading the amdgpu driver and manually loading it after everything else?

Based on all the data you (Tom B) and others have provided as well as my own tests, my current suspicion is that there is a bug in the display mode/type detection and enumeration, leading to the driver losing state consistency and eventually contact entirely with the hardware.

I think the clock dysregulation and excessive voltage/wattage are symptoms of the underlying disease rather than the cause. If something is wrong between what the driver thinks the hardware state is and what the hardware state actually is, it's only a matter of time before this inconsistency leads to dysregulation, instability, and crashing. For this reason, I'm not convinced there is any better workaround than "just use one monitor." Pushing up the clocks only seems to at best prolong the inevitable. :(

I'm also not convinced there is one commit in particular to point to here. Rather it was probably in the restructuring of something between 5.0 and 5.1 that it became fundamentally broken while it was always somewhat flawed before.

Unfortunately, Radeon VII probably isn't really being tested by kernel developers anymore and it's likely that multimonitor with this card on Linux was never fully tested at all. It also seems like AMD's kernel development has moved on to Navi and that the upcoming new Vega card, Arcturus, won't have display outs at all, so work on that can't fix this issue.

As this card is fairly uncommon and expensive, the only real hope for a fix seems to be to get the card into the hands of someone who has the skill to fix graphics drivers and a willingness/need to test multimonitor.

Perhaps someone like gnif who has been able to solve the infamous Vega Reset Bug on Vega 10 cards might be able to fix it. It's likely he will encounter our issue while testing Radeon VII with Looking Glass and such. Someone has already offered to lend him a Radeon VII as he states in the video, so there's some hope that his work will lead to a solution.

https://www.youtube.com/watch?v=1ShkjXoG0O0

Comment 70 Tom B 2019-08-11 15:26:13 UTC

> Based on all the data you (Tom B) and others have provided as well as my own tests, my current suspicion is that there is a bug in the display mode/type detection and enumeration, leading to the driver losing state consistency and eventually contact entirely with the hardware.

I looked through the commits and the code trying to find anything that dealt with multiple displays as that seems to be the trigger but couldn't find anything that looked promising.

It's probably worth noting what I tried/found even though I was unsuccessful as it may help someone. I'm fairly sure that the problem must be this file: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/powerplay/vega20_ppt.c There is a variable called NumOfDisplays and related code.  Maybe someone who understands driver development can point me in the right direction:

Line 2049 seems promising.

smu_send_smc_msg_with_param(smu, SMU_MSG_NumOfDisplays, 0);
	ret = vega20_set_uclk_to_highest_dpm_level(smu,
						   &dpm_table->mem_table);

						   
						   
if (ret)
		pr_err("Failed to set uclk to highest dpm level");



		
Although that error message is not displayed in dmesg, this function deals with multiple displays and the power levels. Unfortunatelely I cannot find documenation for the driver code. What does smu_send_smc_msg_with_param do? Because here the last argument is 0. In the next function, vega20_display_config_changed the final argument is the number of displays:

smu_send_smc_msg_with_param(smu,
					    SMU_MSG_NumOfDisplays,
					    smu->display_config->num_display);
					    


The next point of interest is line 2091. I don't think it's the cause of the bug but:

disable_mclk_switching = ((1 < smu->display_config->num_display) &&
				  !smu->display_config->multi_monitor_in_sync) || vblank_too_short;


 disable_mclk_switching is set if the number of displays is more than 1 and "multi_monitor_in_sync" (whatever that is, possibly mirrored displays?)  or "vblank_too_short". I don't believe this is a problem because the code has existed since January, presumably for the February release, but perhaps the contents of the different variables has chagned so this code runs differently.

I only mention this because it's the only point in the code I found where it does something different if more than one display is connected. 
					    
My questions for the driver devs:

1. Why is smu_send_smc_msg_with_param called with zero in the function vega20_pre_display_config_changed but the number of displays in the next function?
2. Is num_displays an index (so 0 is actually the first display and we're assuming 1 display in index 0) or is it actually 0, no displays?
3. Is there any way to see which code appears in which kernel version? The tags are definitely incorrect, the first commit for that file: https://github.com/torvalds/linux/commit/74e07f9d3b77034cd1546617afce1d014a68d1ca#diff-2575675126169f3c0c971db736852af9 says 5.2 but was done in December last year so I can't imagine this file isn't used.



However, as a customer this is very frustrating. I bought the VII instead of an nvidia card because AMD were supporting open source drivers.

As it stands:

- The AMDGPU driver worked for 4 months after the VII's release and now we've had nearly the same amount of time where it hasn't worked with the latest kernel.
- The AMDGPU-Pro driver only supports Ubuntu, I've never managed to get it to run successfully on Arch and the latest version only supports The RX5700 cards anyway.

I emailed AMD technical support about this bug over a month ago and never got a reply.

The VII appears to be completely unsupported other than the initial driver release when the card came out. I'll be going back to nvidia next time and although I had intended to keep the VII for several years it looks like that won't be possible as I can't run an old kernel forever.

Comment 71 Sylvain BERTRAND 2019-08-11 17:00:25 UTC

On Sun, Aug 11, 2019 at 01:15:48AM +0000, bugzilla-daemon@freedesktop.org wrote:
> I think the clock dysregulation and excessive voltage/wattage are symptoms of

Is there a way to configure the smu block to keep the memory clock to its max
with the appropriate power/voltage? If the smu block do configure some of the
vram arbiter block priority, could we tell it to keep the dc[en]x to max
priority and ignore display vram watermarks? (due to the realtime requirement
of monitor data transmission, I still don't understand the existence of
watermarks in the first place, I would need data which proves me wrong).

On my AMD TAHITI XT, the memory clock seems to be locked to the max (only 1
full hd 144Hz monitor). I recall dce6 has fancy inner-blocks configuration: I
simplified it in my custom driver (something about availability of display
clocks and memory bandwidth. Maybe the smu while clock/power managing breaks
due this dc[en]x "fancy" inner-blocks configuration. 

Additionnally, never heard of 2 displays which would be driven by a common
display block and being in sync. Is the sync dependant on the monitors and not
the display block??  What I am missing ? The nasty displayport mst thingy?
I would always set this to false.

Comment 72 Tom B 2019-08-11 18:43:48 UTC

> The nasty displayport mst thingy? I would always set this to false.

I don't believe mst is being used here, it's two monitors both with separate cables.


Here's some additional investigation.

[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of the first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by:


		PP_ASSERT_WITH_CODE(!(ret = smum_send_msg_to_smc_with_parameter(hwmgr,
				PPSMC_MSG_SetHardMinByFreq,
				(PPCLK_UCLK << 16 ) | dpm_table->dpm_state.hard_min_level)),
				"[SetUclkToHightestDpmLevel] Set hard min uclk failed!",
				return ret);


				

hard_min_level is adjusted if disable_mclk_switching is set on line 3497.


	disable_mclk_switching = ((1 < hwmgr->display_config->num_display) &&
                           !hwmgr->display_config->multi_monitor_in_sync) ||
                            vblank_too_short;
                            
                            
	/* Hardmin is dependent on displayconfig */
	if (disable_mclk_switching) {
		dpm_table->dpm_state.hard_min_level = dpm_table->dpm_levels[dpm_table->count - 1].value;
		for (i = 0; i < data->mclk_latency_table.count - 1; i++) {
			if (data->mclk_latency_table.entries[i].latency <= latency) {
				if (dpm_table->dpm_levels[i].value >= (hwmgr->display_config->min_mem_set_clock / 100)) {
					dpm_table->dpm_state.hard_min_level = dpm_table->dpm_levels[i].value;
					break;
				}
			}
		}
	}


Interestingly, this also checks for the presence of multiple displays so we at least have a connection between the code, error message and cause of the bug (multiple displays). As a very crude test, I tried forcing it on and compiling with

disable_mclk_switching = true;

No difference, so I also tried:

disable_mclk_switching = false;

Again, it didn't help. I will note that this code is identical in 5.0.13 so my test was really only checking for an incorrect value being set elsewhere in hwmgr->display_config->multi_monitor_in_sync or  hwmgr->display_config->num_display. In 5.0.13 I do get mclk boosting, It idles at 351mhz and boosts to 1001mhz so I don't think that forcing the memory to max clock all the time is the correct solution.


I also diff'd vega20_hwmgr.c from 5.0.13 and 5.2.7  (I'll attach it). Here's a few things I noticed:


in vega20_init_smc_table, this line has been added in this commit https://github.com/torvalds/linux/commit/f5e79735cab448981e245a41ee6cbebf0e334f61 : 

+	data->vbios_boot_state.fclock = boot_up_values.ulFClk;

I don't know what fclock is, but this was never set in 5.0.13.


in vega20_setup_default_dpm_tables:

@@ -710,8 +729,10 @@ static int vega20_setup_default_dpm_tables(struct pp_hwmgr *hwmgr)
 		PP_ASSERT_WITH_CODE(!ret,
 				"[SetupDefaultDpmTable] failed to get fclk dpm levels!",
 				return ret);
-	} else
-		dpm_table->count = 0;
+	} else {
+		dpm_table->count = 1;
+		dpm_table->dpm_levels[0].value = data->vbios_boot_state.fclock / 100;
+	}


in 5.0.13, dpm_table->count is set to 0, in 5.2.7 it's set and a dpm_level added based on fclock. fclock appears throughout as a new addition. I don't think this is the cause, but the addition of fclock may be worth exploring.

Comment 73 Tom B 2019-08-11 18:45:21 UTC

Created attachment 145026 [details] [review]
diff of vega20_hwmgr.c from 5.0.13 to 5.2.7

Comment 74 Sylvain BERTRAND 2019-08-11 22:31:24 UTC

Forcing the memory clock and voltage is not enough: the dc[en]x memory requests
should be given also the highest priority in the arbiter block. I don't recall
how it interacts with the dc[en]x watermarks, but they should be "disabled" or
"maxed out". Basically, whatever the 3D/compute/(vcn|vce/uvd) load, the dc[en]x
will always come first (due to the realtime nature of display data transmission
to monitors). Oh and of course, the smu/smc should not manage the dc[en]x. Very
probably, there are some smc/smu commands to do that.

If the GPU did not crash with dpm disabled as a whole, the proper way to
proceed would be to start from there and step by step add dpm features and see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.

Comment 75 ReddestDream 2019-08-11 23:44:16 UTC

>Here's some additional investigation.

>[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of the first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by:

I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as it always seems to be the first thing that fails after dysregulation occurs. The "Failed to send message 0x28, response 0x0" errors show that the driver is sending wrong or at least wrongly timed commands to the GPU that eventually cascade into complete failure.

>Again, it didn't help. I will note that this code is identical in 5.0.13 

I have also been unable to find changed code since 5.0 that could be directly connected to display detect/init/enumeration issues on Radeon VII/Vega 20. This is why I've come to suspect the error is triggered indirectly in a way that will probably not be obvious and by code that was likely flawed from the beginning of Radeon VII/Vega 20 support.

This is also why I was hopeful that 5.3-rc2 would fix this issue since it has commits that do seem to affect display detection on AMD GPUs. Alas, it did not. :(

>If the GPU did not crash with dpm disabled as a whole, the proper way to
proceed would be to start from there and step by step add dpm features and see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.

Unfortunately, it does look like going through and slowing disabling features and/or bisecting might be the only way to find how this issue got started. At least if we could narrow it down, we might be in better shape. :/

I must admit I don't have much experience with graphics drivers and when I tell other people about this issue, they immediately want to blame X or Mesa until I explain that I can get these errors w/o starting any graphics at all. lol.

In any case, I really appreciate your testing Tom B. And any advice you might have on debugging, Sylvain BERTRAND, is greatly appreciated. :)

Comment 76 Sylvain BERTRAND 2019-08-12 03:12:57 UTC

> Unfortunately, it does look like going through and slowing disabling features
> and/or bisecting might be the only way to find how this issue got started. At
> least if we could narrow it down, we might be in better shape. :/

I guess, you are good for a bisection if you have a "working" kernel.

Comment 77 ReddestDream 2019-08-12 03:29:22 UTC

>I guess, you are good for a bisection if you have a "working" kernel.

This is, based on everything here, I'm not convinced that 5.0.13 has 0 issues. Only that it seems to have fewer issues. But yeah. I don't see anywhere else to go but bisection from 5.0.13 to 5.1. That should at least find something . . .

Comment 78 Chris Hodapp 2019-08-12 05:18:18 UTC

> I don't see anywhere else to go but bisection from 5.0.13 to 5.1. That should at least find something . . .

I tried something like that before but a huge portion of the commits in that range won't build kernels that can boot (at least on my system). I ended up resorting to trying reverting individual vega20-affecting  commits out of 5.1. See my results far above in the thread (though someone else willing to spend more time doing a deeper analysis of the code could probably take my approach much further).

Comment 79 ReddestDream 2019-08-12 05:58:56 UTC

>I tried something like that before but a huge portion of the commits in that range won't build kernels that can boot (at least on my system).

It's interesting that you found d1a3e239a6016f2bb42a91696056e223982e8538 to improve the issue:

https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de

From Tom B.'s and my review of the code, it seems very likely that somehow a failure to set a hard minimum properly is at the heart of the issue. 

>This brings me to the second thing: When looking through the commits, I noticed that there were multiple commits that claim to prevent or reduce crashing in high-resolution situations (one references 5k displays, another references 3+ 4k displays).

Yeah. I have 2 4K displays as well. But I don't think it should really be straining the card. These commits are probably overzealous for Radeon VII. Rather it could be that at least part of the issue, especially the excessive power draw at idle, is just due to these commits artificially setting minimums very high. In fact, that could be why it's stable at all with just one monitor, since the code to set the minimums up is only being triggered when there are more monitors connected.

I'd suspect a boottime configuration issue too, but others have reported instability even when the monitors are hotplugged later on. So, it seems like maybe the monitor detect might at least partially be okay, but the follow-through with raising the clock minimums is broken. I suspect the issue is in the code calculating the minimum to set, so the driver gets stuck trying to send incomplete/incorrect values to the card.

https://bbs.archlinux.org/viewtopic.php?id=247733

It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz displays. Maybe that wouldn't trigger this issue. Not that that would really be of use to me. But it might help distinguish between just monitor detect generally being broken and "high monitor load" being broken . . .

Comment 80 Tom B 2019-08-12 13:21:03 UTC

> I tried something like that before but a huge portion of the commits in that range won't build kernels that can boot (at least on my system). I ended up resorting to trying reverting individual vega20-affecting  commits out of 5.1. See my results far above in the thread (though someone else willing to spend more time doing a deeper analysis of the code could probably take my approach much further).

That's why my focus has been finding places in the code where something different happens based on the number of displays. Though this may be a futile avenue of exploration as it could just be an issue of additional memory bandwith requirements or even something that should be done differently with 2 displays that isn't.

> It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz displays. Maybe that wouldn't trigger this issue. Not that that would really be of use to me. But it might help distinguish between just monitor detect generally being broken and "high monitor load" being broken . . .

This would be an interesting test but I think 1080p 60hz monitors with displayport are fairly uncommon and I don't have any to test with. My guess is anyone with a Radeon VII, a high end card with 16gb VRAM, is likely to have a high end display which could equally explain why there are no reports here of people running 1080p 60hz displays. 

My next test is going to be logging dpm_table->dpm_state.hard_min_level on line 3354 (just before it's sent to the smc) on both 5.0.13 and 5.2.7 to see if the same hard_min_level value is sent to the smc on both kernels. This will at least let us know whether it's something that's incorrectly setting hard_min_level or something that prevents the smc accepting the value. My hunch from my previous tests is that it's the latter but I'll try it and report back.

I know nothing about driver development so I have no idea how this stuff should work, I can only compare the differences between 5.0.13 and later kernels.

Anyway, thanks everyone for your input. Any information, even on things that you tried and didn't work, is valuable as it can help us narrow down the problem.

Comment 81 Tom B 2019-08-12 14:34:52 UTC

Created attachment 145038 [details]
5.2.7 dmesg with hard_min_level logged

As mentioned in the previous post, I started logging the value of hard_min_level. I hadn't realised that vega20_set_uclk_to_highest_dpm_level would be called so many times.

Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and 5.2.7 so the issue is not the value from the dpm table. The dpm table is probably correct. Something prevents smum_send_msg_to_smc_with_parameter accepting the value.

However, what is interesting is that it doesn't always fail.


[    4.082105] amdgpu: [powerplay] hard_min_level: 1001
[    4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on minor 0
[    4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!





Each hard_min_level line in the log is from vega20_set_uclk_to_highest_dpm_level and there are multiple calls to it, which don't fail, before the card is initialised.


This is from 5.2.7:

[    3.698907] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    4.082105] amdgpu: [powerplay] hard_min_level: 1001
[    4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on minor 0
[    4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
[    5.361482] amdgpu: [powerplay] Failed to send message 0x28, response 0x0


And the same from 5.0.13:

[    3.352380] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    3.722422] amdgpu: [powerplay] hard_min_level: 1001
[    3.766269] amdgpu: [powerplay] hard_min_level: 1001
[    4.029679] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:44:00.0 on minor 0


There are a couple of things here:

1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2" line and "Initialized"

2. My patched code looks like this:

		pr_err("hard_min_level: %d\n",
					dpm_table->dpm_state.hard_min_level);

		PP_ASSERT_WITH_CODE(!(ret = smum_send_msg_to_smc_with_parameter(hwmgr,
				PPSMC_MSG_SetHardMinByFreq,
				(PPCLK_UCLK << 16 ) | dpm_table->dpm_state.hard_min_level)),
				"[SetUclkToHightestDpmLevel] Set hard min uclk failed!",
				return ret);

Yet the log shows:

- My debug line 
- Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on minor 0
- [SetUclkToHightestDpmLevel] Set hard min uclk failed!

So initialization is happening between (and possibly a result of) sending the message and getting the response.

Comment 82 Tom B 2019-08-12 15:34:46 UTC

In addition, I will note that the file vega20_baco.c has been added in 5.1 

details: https://www.phoronix.com/scan.php?page=news_item&px=AMD-Vega-12-BACO


commit: https://github.com/torvalds/linux/commit/0c5ccf14f50431d0196b96025c878ae9f45676a9#diff-c2d82e6f1326b5b4e0a09c9cb42cbcc2 


This seems like quite a large change, and requires a special "workaround" for Vega 20. Unfortunately, this seems like quite a large code restructure in the driver as I cannot just revert that single commit. 

I mention this because part of the problem I am seeing is with the wrong wattage. I wonder whether BACO wrongly tries to turn off a part of the card that is required for a secondary monitor and as such puts the card in an invalid state.

I'm going to see if I can disable/revert BACO entirely to at least rule it out.

Comment 83 ReddestDream 2019-08-12 15:42:17 UTC

> Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and 5.2.7 so the issue is not the value from the dpm table. The dpm table is probably correct. 

Fantastic! Glad you tested this. I had suspected the hard_min_level was bogus and that's why it was failing. Card was rejecting the bogus value. Glad to know that's not the case.

> However, what is interesting is that it doesn't always fail.

Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get powerplay error on boot. In fact, it can go a bit and seem stable. But then the powerplay errors suddenly (not related to some high load on the card) start showing up again and the graphics become unstable. Similarly others have reported that on hotplugging a second monitor after boot, the powerplay errors will start showing up.

So, maybe there is a timing problem involved with sending the message. It's generally a question of when rather than if it's going to fail.

> 1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2" line and "Initialized"

Is it always called twice? Even on 5.2.7? Because it looks like it might get called two times right before "Initialized" on 5.0.13 but then only once on 5.2.7 before "Initialized" kicks in. Maybe "Initialized" is interrupting on 5.2.7 but not on 5.0.13. It's possible that Initialization of the card is messing up values that powerplay needs to read off the card or making the card unavailable for receiving messages or something . . .

> So initialization is happening between (and possibly a result of) sending the message and getting the response

Yeah. Something is definitely happening while vega20_set_uclk_to_highest_dpm_level is running . . . Not 100% sure that's really problematic tho . . .  But it could be an atomicity issue. Need to figure out what exactly what is generating the line "[drm] Initialized amdgpu 3.27.0 20150101 for 0000:44:00.0 on minor 0." Looks like it's coming from the drm core rather than amdgpu specifically.

> I'm going to see if I can disable/revert BACO entirely to at least rule it out.

I thought BACO was reverted for Vega 20 here:

https://github.com/torvalds/linux/commit/7db329e57b90ddebcb58fc88eedbb3082d22a957#diff-8a4d25be8ad5d9c3ff27bb54b678dab2

Your commit seems to have been introduced in 5.2-rc1, not 5.1.

Comment 84 ReddestDream 2019-08-12 15:53:55 UTC

>Need to figure out what exactly what is generating the line "[drm] Initialized amdgpu 3.27.0 20150101 for 0000:44:00.0 on minor 0."

That "Initialized amdgpu" message seems to be coming from here:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994

Comment 85 Tom B 2019-08-12 15:56:32 UTC

> Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get powerplay error on boot. In fact, it can go a bit and seem stable.

In addition to that, vega20_set_fclk_to_highest_dpm_level is called several times before the card is initialized and even on 5.2.7 works. Something happens during or just before the initialization stage that stops smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it does until that point.

I think you're right about BACO, it was worth looking at but I applied a quick hack to ensure it's disabled:

int vega20_baco_set_state(struct pp_hwmgr *hwmgr, enum BACO_STATE state)
{
	return 0;
}

int vega20_baco_get_capability(struct pp_hwmgr *hwmgr, bool *cap)
{
    *cap = false;
    return 0;
}

No difference, I still get the errors and wrong wattage so unless BACO is somehow on by default and only turned off in the proper version of this code, we can rule it out.

Comment 86 ReddestDream 2019-08-12 16:32:01 UTC

>In addition to that, vega20_set_fclk_to_highest_dpm_level is called several times before the card is initialized and even on 5.2.7 works. Something happens during or just before the initialization stage that stops smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it does until that point.

Could be we've got a race condition between the powerplay setup and amdgpu handing off the card to drm_dev_register to advertise it for normal use.

drm_dev_register is responsible for the "[drm] Initialized" message:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994

And it seems like amdgpu calls it here:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L1054

Odd that it's doing this if powerplay still has more work to do. And that might be why vega20_set_uclk_to_highest_dpm_level fails that last time.

Comment 87 Tom B 2019-08-12 16:38:54 UTC

> Could be we've got a race condition between the powerplay setup and amdgpu
handing off the card to drm_dev_register to advertise it for normal use.

The question then becomes: Why doesn't the race condition happen with only one screen? Perhaps it's a matter of speed. With a single display, the driver detect the displays, read/parse the EDID data, initialize in time. But then that doesn't explain why the crash still occurs if you boot with one DisplayPort monitor and attach another after X is running.

One thing I've been trying to work out is the difference between vega21_ppt.c and   vega20_hwmgr.c is, as they both contain slightly different or identical versions of the same functions. It looks like the functions in vega20_hwmgr.c  take precedence but it's strange to see this duplication and both files are worked on in the commit history.

Take a look at vega20_set_uclk_to_highest_dpm_level and vega20_apply_clocks_adjust_rules in both for examples.

Comment 88 ReddestDream 2019-08-12 16:47:31 UTC

>The question then becomes: Why doesn't the race condition happen with only one screen? Perhaps it's a matter of speed. With a single display, the driver detect the displays, read/parse the EDID data, initialize in time. But then that doesn't explain why the crash still occurs if you boot with one DisplayPort monitor and attach another after X is running.

I do suspect it's a matter of speed and complexity when you have more monitors. Also maybe the clock it tries to set (the value of hard_min_level) is different if you only have one monitor and somehow that takes more time (resetting it away from some default).

I do wonder if maybe in:

"[SetUclkToHightestDpmLevel] Set hard min uclk failed!",
				return ret);

It should return -EINVAL instead. Maybe then it would reset and try again instead of just ignoring it and continuing with initialization anyway, leading to instability.

>One thing I've been trying to work out is the difference between vega21_ppt.c and   vega20_hwmgr.c is, as they both contain slightly different or identical versions of the same functions. It looks like the functions in vega20_hwmgr.c  take precedence but it's strange to see this duplication and both files are worked on in the commit history.

Hmm. That is interesting. I'll take a look.

Comment 89 Tom B 2019-08-12 16:57:16 UTC

> It should return -EINVAL instead. Maybe then it would reset and try again instead of just ignoring it and continuing with initialization anyway, leading to instability.

If you look at vega20_send_msg_to_smc_with_parameter: 

static int vega20_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr,
		uint16_t msg, uint32_t parameter)
{
	struct amdgpu_device *adev = hwmgr->adev;
	int ret = 0;

	vega20_wait_for_response(hwmgr);

	WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);

	WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, parameter);

	vega20_send_msg_to_smc_without_waiting(hwmgr, msg);

	ret = vega20_wait_for_response(hwmgr);
	if (ret != PPSMC_Result_OK)
		pr_err("Failed to send message 0x%x, response 0x%x\n", msg, ret);

	return (ret == PPSMC_Result_OK) ? 0 : -EIO;
}


It returns 0 on success and -EIO on failure, which is then in turn returned from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on EINVAL? Perhaps -EIO should be -EINVAL?

Comment 90 Tom B 2019-08-12 17:40:12 UTC

I'm not sure this is helpful but I managed to somewhat test the race condition theory.

If you follow the callstack:

vega20_set_fclk_to_highest_dpm_level -> smum_send_msg_to_smc_with_parameter -> vega20_send_msg_to_smc_with_parameter -> vega20_wait_for_response -> phm_wait_for_register_unequal you find this code in smu_helper.c:

int phm_wait_on_register(struct pp_hwmgr *hwmgr, uint32_t index,
			 uint32_t value, uint32_t mask)
{
	uint32_t i;
	uint32_t cur_value;

	if (hwmgr == NULL || hwmgr->device == NULL) {
		pr_err("Invalid Hardware Manager!");
		return -EINVAL;
	}

	for (i = 0; i < hwmgr->usec_timeout; i++) {
		cur_value = cgs_read_register(hwmgr->device, index);
		if ((cur_value & mask) == (value & mask))
			break;
		udelay(1);
	}

	/* timeout means wrong logic*/
	if (i == hwmgr->usec_timeout)
		return -1;
	return 0;
}


The timeout there is interesting. I increased it.


for (i = 0; i < hwmgr->usec_timeout*10; i++) {
		cur_value = cgs_read_register(hwmgr->device, index);
		if ((cur_value & mask) == (value & mask))
			break;
		udelay(1);
	}


The PC takes significantly longer to boot (10 or so seconds when it's usually instant) and the error still occurs. So I'm not sure it's just a matter of waiting.

Comment 91 ReddestDream 2019-08-12 18:37:40 UTC

>It returns 0 on success and -EIO on failure, which is then in turn returned from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on EINVAL? Perhaps -EIO should be -EINVAL?

I didn't find check/retry code. It was more just a thought that maybe we could keep vega20_set_uclk_to_highest_dpm_level from just returning despite the error and allowing further initialization to proceed. Even if it crashed, that might be even be helpful since it's not clear if it's the initialization (drm_dev_register) or something else that is silent in the logs that is changing something and causing vega20_set_uclk_to_highest_dpm_level to fail where we know it succeeded so many times before.

>I'm not sure this is helpful but I managed to somewhat test the race condition theory.

If there is a race, I'm not sure it's in the time the driver waits for the hardware registers to respond and/or the value to set. But it's still enlightening.

At this point it seems more likely that something else we aren't seeing in the logs is breaking vega20_set_uclk_to_highest_dpm_level in the last moments (unlikely due to the dpm_state.hard_min_level value), it falls through and drm_dev_register runs and initialization message prints. amdgpu doesn't consider the "[SetUclkToHightestDpmLevel] Set hard min uclk failed!" to be a significant enough error to stop initialization. But maybe it should . . .

Comment 92 ReddestDream 2019-08-13 03:15:19 UTC

>If you follow the callstack:

I've been thinking all this over. The only thing unfortunately that really sticks out at me still is how Chris Hodapp says that reverting this commit:

https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de

Seems to improve things. Considering that we now know from Tom B.'s work that dpm_state.hard_min_level is apparently calculated correctly and stable the entire time, it doesn't make sense that reverting this commit could fix anything. 

The code seems very similar to what we see in vega20_notify_smc_display_config_after_ps_adjustment near where we get the " [SetHardMinFreq] Set hard min uclk failed!" Maybe this smum_send_msg_to_smc_with_parameter get through where others fail because of the formatting or something?

Thanks again Tom B. for all your testing. I'd like to do some tests of my own, but time's just not permitting for me ATM. Hoping to be more free next weekend. :/

Comment 93 Chris Hodapp 2019-08-13 03:33:37 UTC

Note: It might be good for someone else to double-check my conclusion before too much stock is put into it. Scientific method and all that.

Comment 94 Tom B 2019-08-13 13:05:17 UTC

Reverting d1a3e239a6016f2bb42a91696056e223982e8538 didn't fix it for me. But that commit may give some insight because it is related to uclk which is the first error we get.

I also tried globally increasing usec_timeout as it's used in a few places (patch below). This makes the PC take about a minute to boot up, so clearly the GPU is in an invalid state before these timeouts are hit and then each subsequent call to smum_send_msg_to_smc_with_parameter causes a delay because each call times out. Whatever happens, puts the card into a state that it can't recover from.

The next step is to try to find where vega20_set_uclk_to_highest_dpm_level is called from and see what happens just before the call to this function.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f4ac632a87b2..9b878c74b17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2418,7 +2418,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	adev->pdev = pdev;
 	adev->flags = flags;
 	adev->asic_type = flags & AMD_ASIC_MASK;
-	adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT;
+	adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT*10;
 	if (amdgpu_emu_mode == 1)
 		adev->usec_timeout *= 2;
 	adev->gmc.gart_size = 512 * 1024 * 1024;
diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
index a7e8340baf90..a6b2bc4277ef 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
@@ -84,7 +84,7 @@ int hwmgr_early_init(struct pp_hwmgr *hwmgr)
 	if (!hwmgr)
 		return -EINVAL;

-	hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT;
+	hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT*10;
 	hwmgr->pp_table_version = PP_TABLE_V1;
 	hwmgr->dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;
 	hwmgr->request_dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;

Comment 95 Tom B 2019-08-13 13:35:11 UTC

So here's something interesting. In 5.0.13 there is no function vega20_display_config_changed.  This function issues smu_send_smc_msg_with_param(smu, SMU_MSG_NumOfDisplays, 0);

In fact, in 5.0.13 there is no reference at all to SMU_MSG_NumOfDisplays anywhere in the amdgpu driver. 

Which means, the way that the number of displays is configured is changed in 5.0.13, or done with a hardcoded value instead of a constant.

Comment 96 Tom B 2019-08-13 15:20:05 UTC

Created attachment 145047 [details]
logging anywhere the number of screens is set

Again, no closer to a fix but another thing to rule out. In addition to SMU_MSG_NumOfDisplays, PPSMC_MSG_NumOfDisplays is also used.

I put a debug message anywhere PPSMC_MSG_NumOfDisplays or SMU_MSG_NumOfDisplays is set end put else blocks in places where it may have been set:

	if ((data->water_marks_bitmap & WaterMarksExist) &&
	    data->smu_features[GNLD_DPM_DCEFCLK].supported &&
	    data->smu_features[GNLD_DPM_SOCCLK].supported) {

		pr_err("vega20_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to %d\n", hwmgr->display_config->num_display);

		result = smum_send_msg_to_smc_with_parameter(hwmgr,
			PPSMC_MSG_NumOfDisplays,
			hwmgr->display_config->num_display);
	}
	else {
		pr_err("vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays\n");
	}

	return result;
}


Here's what I found:

- The functions dealing with screesn in vega20_ppt.c are never used ( vega20_display_config_changed, vega20_pre_display_config_changed) and can be ignored for our further tests

- The line: 

result = smum_send_msg_to_smc_with_parameter(hwmgr, 			
PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display);

Is never executed, it always triggers the else block so PPSMC_MSG_NumOfDisplays is never set using num_display.

- The same thing happens in 5.0.13, when I saw the above result I had hoped that the problem was that  smum_send_msg_to_smc_with_parameter(hwmgr, 			
PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display); was never called with the correct number of displays. Unfortunately the behaviour is the same on 5.0.13, PPSMC_MSG_NumOfDisplays is only ever set to zero in both versions of the kernel.


Unfortunately this doesn't get us any closer.


The instruction is sent a lot more in 5.0.13 though. 

5.0.13:

[    3.475471] amdgpu 0000:44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    3.475472] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    3.475508] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays
[    3.794037] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0
[    3.800180] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays
[    3.833502] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0
[    3.833647] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays
[    4.153232] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:44:00.0 on minor 0
[    4.664044] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0


5.2.7
[    3.711028] amdgpu 0000:44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    3.711028] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    4.086310] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0
[    4.385470] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on minor 0
[    4.522398] amdgpu: [powerplay] Failed to send message 0x28, response 0x0

Notice that vega20_pre_display_configuration_changed_task is run 5 times between the ring lines and initilization line in 5.0.13 and only once in 5.2.7.

This might not mean anything, but it could be another clue that initilization is happening before the card is really ready.

Comment 97 Tom B 2019-08-13 17:11:35 UTC

I've been investigating this:

https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036

Because vega20 doesn't export display_configuration_change, it jumps to the newly added else block and calls smu_display_configuration_change. This didn't happen in 5.0.13. It's not the cause of this as I commented it out and it still breaks. 
I'll also note that pp_display_cfg->display_count is correct at this point, it shows 2 for me with 2 screens connected. But why doesn't vega20 export display_configuration_change? It has display_config_changed and I can't find where that's called from so I wonder if display_config_changed should be being called at this point.

Comment 98 Sylvain BERTRAND 2019-08-13 18:33:04 UTC

> The code seems very similar to what we see in
> vega20_notify_smc_display_config_after_ps_adjustment near where we get the "
> [SetHardMinFreq] Set hard min uclk failed!" Maybe this
> smum_send_msg_to_smc_with_parameter get through where others fail because of
> the formatting or something?

It seems there is a patch from amd about smu v11 and this smc/smu command.
I may be wrong though.

Comment 99 Tom B 2019-08-14 15:44:45 UTC

Created attachment 145062 [details]
a list of commits 5.0.13 - 5.1.0

Attached is a list of all amdgpu and powerplay commits from 5.0.13 - 5.1.0. 

I have tried reverting the following which looked most likely culprits:

919a94d8101ebc29868940b580fe9e9811b7dc86 drm/amdgpu: fix CPDMA hang in PRT mode for VEGA20

f7b1844bacecca96dd8d813675e4d8adec02cd66 drm/amdgpu: Update gc golden setting for vega family

d25689760b747287c6ca03cfe0729da63e0717f4 drm/amdgpu/display: drm/amdgpu/display: Keep malloc ref to MST port  -- A change to the way displayport connectors are handled, looked promising.

db64a2f43c1bc22c5ff2d22606000b8c3587d0ec drm/amd/powerplay: fix possible hang with 3+ 4K monitors


I also looked at that last one in detail as it seems very close to this bug. Nothing in the code looks for 3+ monitors or even 4k. It only actually looks for > 1 monitor.

Although it's based on disable_mclk_switching, I also tried forcing disable_fclk_switching to true and false, neither had any affect. The result is that mclk would be calculated based on screens but fclk would be forced on/off.  It didn't help but I can't help think that this commit is a little too close to this issue to be irrelevant.

Comment 100 Tom B 2019-08-14 17:30:55 UTC

I've bee trying to work backwards to find the place where screens get initialised and eventually call vega20_pre_display_configuration_changed_task. 

vega20_pre_display_configuration_changed_task is exported as pp_hwmgr_func::display_config_changed

Which is called form hardwaremanager.c:phm_pre_display_configuration_changed 

phm_pre_display_configuration_changed is called from hwmghr.c:hwmgr_handle_task:

	switch (task_id) {
	case AMD_PP_TASK_DISPLAY_CONFIG_CHANGE:
		ret = phm_pre_display_configuration_changed(hwmgr);
		

pp_dpm_dispatch_tasks is exported as amd_pm_funcs::dispatch_tasks is called from amdgpu_dpm_dispatch_task which is called in amdgpu_pm.c:


void amdgpu_pm_compute_clocks(struct amdgpu_device *adev)
{
	int i = 0;

	if (!adev->pm.dpm_enabled)
		return;

	if (adev->mode_info.num_crtc)
		amdgpu_display_bandwidth_update(adev);

	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
		struct amdgpu_ring *ring = adev->rings[i];
		if (ring && ring->sched.ready)
			amdgpu_fence_wait_empty(ring);
	}

	if (is_support_sw_smu(adev)) {
		struct smu_context *smu = &adev->smu;
		struct smu_dpm_context *smu_dpm = &adev->smu.smu_dpm;
		mutex_lock(&(smu->mutex));
		smu_handle_task(&adev->smu,
				smu_dpm->dpm_level,
				AMD_PP_TASK_DISPLAY_CONFIG_CHANGE);
		mutex_unlock(&(smu->mutex));
	} else {
		if (adev->powerplay.pp_funcs->dispatch_tasks) {
			if (!amdgpu_device_has_dc_support(adev)) {
				mutex_lock(&adev->pm.mutex);
				amdgpu_dpm_get_active_displays(adev);
				adev->pm.pm_display_cfg.num_display = adev->pm.dpm.new_active_crtc_count;
				adev->pm.pm_display_cfg.vrefresh = amdgpu_dpm_get_vrefresh(adev);
				adev->pm.pm_display_cfg.min_vblank_time = amdgpu_dpm_get_vblank_time(adev);
				/* we have issues with mclk switching with refresh rates over 120 hz on the non-DC code. */
				if (adev->pm.pm_display_cfg.vrefresh > 120)
					adev->pm.pm_display_cfg.min_vblank_time = 0;
				if (adev->powerplay.pp_funcs->display_configuration_change)
					adev->powerplay.pp_funcs->display_configuration_change(
									adev->powerplay.pp_handle,
									&adev->pm.pm_display_cfg);
				mutex_unlock(&adev->pm.mutex);
			}
			amdgpu_dpm_dispatch_task(adev, AMD_PP_TASK_DISPLAY_CONFIG_CHANGE, NULL);
		} else {
			mutex_lock(&adev->pm.mutex);
			amdgpu_dpm_get_active_displays(adev);
			amdgpu_dpm_change_power_state_locked(adev);
			mutex_unlock(&adev->pm.mutex);
		}
	}
}


This is the only place I can see AMD_PP_TASK_DISPLAY_CONFIG_CHANGE being called from, which eventually is where vega20_pre_display_configuration_changed_task gets called.

Presumably the code:

	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
		struct amdgpu_ring *ring = adev->rings[i];
		if (ring && ring->sched.ready)
			amdgpu_fence_wait_empty(ring);
	}



is what generates 


[    3.683718] amdgpu 0000:44:00.0: ring gfx uses VM inv eng 0 on hub 0
[    3.683719] amdgpu 0000:44:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    3.683720] amdgpu 0000:44:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    3.683720] amdgpu 0000:44:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    3.683721] amdgpu 0000:44:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    3.683722] amdgpu 0000:44:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    3.683722] amdgpu 0000:44:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    3.683723] amdgpu 0000:44:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    3.683724] amdgpu 0000:44:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    3.683724] amdgpu 0000:44:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    3.683725] amdgpu 0000:44:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    3.683726] amdgpu 0000:44:00.0: ring page0 uses VM inv eng 1 on hub 1
[    3.683726] amdgpu 0000:44:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    3.683727] amdgpu 0000:44:00.0: ring page1 uses VM inv eng 5 on hub 1
[    3.683728] amdgpu 0000:44:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    3.683728] amdgpu 0000:44:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    3.683729] amdgpu 0000:44:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    3.683730] amdgpu 0000:44:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    3.683730] amdgpu 0000:44:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    3.683731] amdgpu 0000:44:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    3.683731] amdgpu 0000:44:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    3.683732] amdgpu 0000:44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    3.683733] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1

In dmesg. I'll add a pr_err() to verify this.  If so, it means our issue is introduced somewhere between that for loop and amdgpu_dpm_dispatch_task in this function. 


amdgpu_pm_compute_clocks is called from amdgpu_dm_pp_smu.c:dm_pp_apply_display_requirements which is called in dce_clk_mgr.c in two places: dce_pplib_apply_display_requirements and dce11_pplib_apply_display_requirements. I don't know which is used for the VII, I'll add some logging to verify.

But here's something that may be relevant to this bug. In dce11_pplib_apply_display_requirements there's a check for the number of displays:


	/* TODO: is this still applicable?*/
	if (pp_display_cfg->display_count == 1) {
		const struct dc_crtc_timing *timing =
			&context->streams[0]->timing;

		pp_display_cfg->crtc_index =
			pp_display_cfg->disp_configs[0].pipe_idx;
		pp_display_cfg->line_time_in_us = timing->h_total * 10000 / timing->pix_clk_100hz;
	}
	
	
So there's something that is different when mroe than one display is connected. That's as far as I got walking backwards through the code. I'll note that this was also present in 5.0.1, but it could be that something is relying on ctrc_inxex or line_time_in_us, which wasn't checked previously as these values only appear to be set if there is a single display.

Comment 101 ReddestDream 2019-08-16 05:58:51 UTC

Grasping at straws a bit here, but it occurred to me that maybe Linux kernel testing on Radeon VII was done on an early VBIOS that didn't have full UEFI support yet. We know that AMD had to issue a VBIOS update for Radeon VII to fix UEFI support shortly after the launch. So maybe enabling the CSM/Legacy Support in the BIOS, which does impact early GPU initialization, might have some effect on the multimonitor problem? Something I plan to test, but I wanted to share the idea in case someone else has a chance first.

>This might not mean anything, but it could be another clue that initilization is happening before the card is really ready.

Also, I considered that both of my monitors have audio out support. I wonder if audio initialization might be the missing piece to the puzzle, the thing that interrupts/changes the state of the card and prevents smu_send_smc_msg_with_param from working where it did before. I know that in the past with previous AMD cards, display audio has been buggy . . .

Comment 102 Tom B 2019-08-16 10:10:26 UTC

> Grasping at straws a bit here, but it occurred to me that maybe Linux kernel testing on Radeon VII was done on an early VBIOS that didn't have full UEFI support yet. We know that AMD had to issue a VBIOS update for Radeon VII to fix UEFI support shortly after the launch. So maybe enabling the CSM/Legacy Support in the BIOS, which does impact early GPU initialization, might have some effect on the multimonitor problem? Something I plan to test, but I wanted to share the idea in case someone else has a chance first.

I had already tried that unfortunately, I tried the following BIOS options:

CSM on/off
IOMMU on/of
PCIE speed 16x/4x (the only options my motherboard allowed for some reason)

Having said that, I didn't try booting using grub in BIOS mode as I  didn't want to change my partition table, so it's possible that although I had used CSM, it was only legacy support and still booting in UEFI mode.

Comment 103 Peter Hercek 2019-08-16 10:35:47 UTC

I boot in BIOS mode and I'm still getting these errors. Though they are rare in my case with the "better" kernels (around once a week).

Just a note: There were tearing errors in windows drivers of Radeon VII too. One of the reasons for it was different refresh rate for different monitors. They recommended to set all refresh rates to 60 Hz or its multiple till it is fixed. In my case it is not completely possible (one monitor supports 60 Hz, but other two monitors support only 59.95 Hz). I have slight difference in the frequencies.

Comment 104 Tom B 2019-08-16 10:41:18 UTC

I did get very similar crashing when I was running HDMI + DP at different refresh rates ( see https://bugs.freedesktop.org/show_bug.cgi?id=110510 ). I switched to DP + DP because HDMI+DP wasn't stable, it could be related.

the tl;dir from that bug report, and this was on 5.0.9:

- HDMI alone at 60hz works but the screen flickers off every 3-5 minutes
- HDMI alone works at 59.9hz without any flickering
- HDMI 60hz + DP 60hz works, but the HDMI screen flickers off every 3-5 minutes
- HDMI 59.94hz + DP 60hz freezes the PC instantly.

Unfortunately my monitors don't support displayport at 59.94hz so I couldn't test that combination as I think it would have worked. 

Still, it does tell us that these could be related and the issue could be syncing between the two displays.

Comment 105 Tom B 2019-08-16 13:10:14 UTC

> Also, I considered that both of my monitors have audio out support. I wonder if audio initialization might be the missing piece to the puzzle, the thing that interrupts/changes the state of the card and prevents smu_send_smc_msg_with_param from working where it did before. I know that in the past with previous AMD cards, display audio has been buggy . 

I just tried setting admgpu.audio=0 and it didn't help. Though it doesn't rule out audio entirely, the audio backend is probably still used as part of the connection to the monitor, I'd imagine it just prevents the card appearing as an output device.

Comment 106 Tom B 2019-08-16 13:18:37 UTC

Booting with amdgpu.dpm=0 on 5.2.7 works.

Performance is poor and as expected I cannot get any information about power states because /sys/kernel/debug/dri/0/amdgpu_pm_info doesn't exist. I'm guessing it runs at minimum clocks as I get ~10-17fps in unigine-heaven instead of ~60-100. 

It is a DPM issue of some kind so although my earlier tests showed that hard_min_level was set correctly, it still could be an issue elsewhere in the DPM table.

Comment 107 ReddestDream 2019-08-16 14:17:31 UTC

> Booting with amdgpu.dpm=0 on 5.2.7 works.

> It is a DPM issue of some kind so although my earlier tests showed that hard_min_level was set correctly, it still could be an issue elsewhere in the DPM table.

Great news! At least now we have a better place to investigate . . .

Comment 108 ReddestDream 2019-08-16 21:06:15 UTC

> Booting with amdgpu.dpm=0 on 5.2.7 works.

Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is generally -1 for automatic)? Seems like one of those might enable the new experimental SW SMU v11 feature on Vega20 . . .

https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

https://lists.freedesktop.org/archives/amd-gfx/2019-January/030788.html?print=anzwix

Comment 109 Tom B 2019-08-16 22:14:48 UTC

Created attachment 145080 [details]
dmesg with amdgpu.dpm=2

> Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is generally -1 for automatic)? Seems like one of those might enable the new experimental SW SMU v11 feature on Vega20 . . .

Now that is interesting.dpm=-1 is the same as default, and default is 1, enabled so dpm=1 is what we've been using all along. But dpm=2 and the patch you linked to are interesting.

I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 boot. It doesn't fix the issue but it does help answer a few questions I had:


1. The functions in vega20_ppt.c are used with this new patch so that answers my question from earlier, that's what this file is for and why it contains similar/identical functions.

2. It explains the difference I found in comment 97: This commit https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036 has the new else block for smu_display_configuration_change which we now know is the software version of this function.


More importantly, though, knowing that enabling DPM causes the crash, this tells us either:

A) The bug is present in both versions of the vega20 code: vega20_hwmgr.c and vega20_ppt.c or..

B) The card reaches an invalid state before DPM is initialised and the card is fine until it receives a DPM change.

Given that two different versions of the code produce the same result, my hunch is that the problem is B. The card is not in a state where it's able to receive power changes.

Comment 110 ReddestDream 2019-08-16 23:19:25 UTC

> 1. The functions in vega20_ppt.c are used with this new patch so that answers my question from earlier, that's what this file is for and why it contains similar/identical functions.

I was hoping this was the case as the duplicated functions were confusing me too. Glad we got this figured out! :)

> I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 boot. It doesn't fix the issue but it does help answer a few questions I had:

This is disappointing tho. I was hoping that setting amdgpu.dpm=2 would use the more "actively developed" path and that would fix the issue. :/

> Given that two different versions of the code produce the same result, my hunch is that the problem is B. The card is not in a state where it's able to receive power changes.

I tend to agree, but it's still not clear why or how the card ends up in a bad state when commands to it via smu_send_smc_msg_with_param seem to just suddenly stop working. And given the amount of same/similar functions in vega20_hwmgr.c and vega20_ppt.c it's hard to rule out A entirely.

Since amdgpu.dpm=0 resolves the issue (albeit at the cost of being stuck at minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that the card is starting out in a reasonable state and then being thrown into a bad state later by bad driver code. And that code is part of the DPM (Dynamic Power Management) system. We are pretty confident that dpm_state.hard_min_level is stable the whole time, so that's probably not what's throwing the card into a bad state. But perhaps another value in the DPM table is . . . 

It doesn't make intuitive sense that the soft min/max values would be problematic since they are presumably "more flexible," but it's possible that they get calculated out of spec or something and logging them should be possible like how dpm_state.hard_min_level was logged.

Comment 111 ReddestDream 2019-08-17 01:47:11 UTC

A few other ideas to ponder:

1. Looking into DPM, I found this commit for 5.1-rc1 that looks interesting:

https://github.com/torvalds/linux/commit/7ca881a8651bdeffd99ba8e0010160f9bf60673e

Looks like it exposes "ppfeatures" interface on Vega 10 and later GPU, including some code for Vega 20.

2. I also found two interesting commits that pertain to "doorbell" register initialization on Vega 20. Also from 5.1-rc1. Might be related to setting up the GPU ASICs . I must admit I'm not exactly sure what these do . . .

https://github.com/torvalds/linux/commit/fd4855409f6ebe015406cd2b2ffa4fee4cd1f4a7

https://github.com/torvalds/linux/commit/828845b7c86c5338f6ca02aaaaf4b525718f31b2

Comment 112 ReddestDream 2019-08-17 02:15:16 UTC

More ideas:

3. Looking through the crash in sehellion's comment 45:

gfx_v9_0_ring_test_ring+0x19e/0x230 [amdgpu]
amdgpu_ring_test_helper+0x1e/0x90 [amdgpu]
gfx_v9_0_hw_fini+0x299/0x690 [amdgpu]
amdgpu_device_ip_suspend_phase2+0x6c/0xa0 [amdgpu]
amdgpu_device_ip_suspend+0x44/0x80 [amdgpu]
amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu]
amdgpu_device_gpu_recover+0x7b/0x7a3 [amdgpu]
amdgpu_job_timedout+0xfc/0x120 [amdgpu]

We see gfx_v9_0_ring_test and gfx_v9_0_hw_fini which both come from:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c

There's a 5.1-rc1 commit in this file pertaining to a "wave ID mismatch" that could cause deadlocks.

https://github.com/torvalds/linux/commit/41cca166cc57e75e94d888595a428d23a3bf4e36

Along with updated "golden values" for Vega in 5.1-rc1:

https://github.com/torvalds/linux/commit/919a94d8101ebc29868940b580fe9e9811b7dc86

https://github.com/torvalds/linux/commit/f7b1844bacecca96dd8d813675e4d8adec02cd66

Comment 113 ReddestDream 2019-08-17 02:37:53 UTC

4. 

> Given that two different versions of the code produce the same result, my hunch is that the problem is B. The card is not in a state where it's able to receive power changes.

Something to consider: In pretty much all the dmesg logs we see, amdgpu attempts to reset the GPU, sometimes successfully, and yet it still can't properly message the GPU afterward and we see the same sequence of failures starting with "amdgpu: [powerplay] Failed to send message 0x28, response 0x0 amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!"

Eventually we start to see: "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!"

This comes from:

https://github.com/torvalds/linux/commits/master/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

I'm not sure what the -125 error code indicates. My guess is ECANCELED (Operation Cancelled) as the negated error code 125.

https://github.com/torvalds/linux/blob/master/include/uapi/asm-generic/errno.h

Comment 114 ReddestDream 2019-08-17 03:16:29 UTC

5. Tom B., it is probably worth getting a full dmesg with your two monitors in on a relatively new 5.2.x kernel using at least: amdgpu.dc_log=1 drm.debug=0x1e log_buf_len=2M

And anything else you might think of. Just to try to get more debug info. Thx!

Comment 115 Tom B 2019-08-17 13:37:15 UTC

I should have noted it earlier, but I had already tried reverting both "golden values" commits. I've no idea what it does but it didn't fix this crash.

One thing that would be insightful would be logging every call to smum_send_msg_to_smc_with_parameter and printing out message/parameter:

int smum_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr,
uint16_t msg, uint32_t parameter)
{

This would cause a very busy log but we could see the last successful message that was sent and with the same log in 5.0.13 see if there are any obvious differences. It might be that the previous message causes the invalid state so knowing what that is could lead us towards the solution.

I don't think I have time to try it today but if anyone is recompiling the code adding

pr_err("msg: %d / parameter: %d\n", msg, parameter);

to this function in smumgr.c would be a useful addition.

Also, wants to try re-compiling, here's a quick guide for arch:

1. Get the kernel sources using asp as described here: https://wiki.archlinux.org/index.php/Kernel/Arch_Build_System navigate to the created linux/repos/core-x86_64 directory.

2. You will need to run makepkg -s once to get it to download the sources.

3. You can set the kernel version in PKGBUILD: e.g. _srcver=5.2.7-arch1 or _srcver=5.0.13-arch1

4. If you want to revert one or more commits put it in the prepare() block before local src:

echo "$_kernelname" > localversion.20-pkgname

git revert db64a2f43c1bc22c5ff2d22606000b8c3587d0ec --no-edit
git revert f5e79735cab448981e245a41ee6cbebf0e334f61 --no-edit

local src

It will open your editor, if you don't want to use vi set:

5. For making changes to the code you need to make a patch. Open the src/archlinux-linux directory. The files you're interested in are in drivers/drm/gpu/drm/amd/powerplay likely hwmgr/vega20_hwmgr.c Make your changes to the code. You can't just re-run makepkg as it checks out the original version of the code. After making changes, navigate to the archlinux-linux directory and run git diff > ../../vii.patch

6. Add your patch to PKGBUILD source:

source=(
"$_srcname::git+https://git.archlinux.org/linux.git?signed#tag=v$_srcver"
config # the main kernel config file
60-linux.hook # pacman hook for depmod
90-linux.hook # pacman hook for initramfs regeneration
linux.preset # standard config files for mkinitcpio ramdisk
vii.patch
)

7. I've been cheating with makepkg and getting it to skip hash checks as otherwise you have to generate the sha256sums for each patch you create. This is an extra step that only slows down testing. To compile/install run makepkg -si --skipinteg

Because of the way makepkg works, it keeps the compiled code in the src directory. That means that although the first compile will take a few minutes, subsequent compiles will be a lot faster as it'll probably only be recompiling vega20_hwmgr.c

Comment 116 ReddestDream 2019-08-25 20:46:19 UTC

Created attachment 145153 [details]
dmesgAMD2Monitors

I've been doing a few tests. I looked into and compiled 5.3-rc5 along with these patches, but nothing seemed to resolve our multimonitor issue. :/

https://phoronix.com/scan.php?page=news_item&px=AMDGPU-Multi-Monitor-vRAM-Clock

I've also gotten some dmesg output with 5.2.9 with amdgpu.dc_log=1 drm.debug=0x1e log_buf_len=2M. Turns out that amdgpu.dc_log=1 does nothing on this kernel, but I didn't know this when I ran the tests. The interesting added data appears to be coming from drm.debug=0x1e.

I have two (physically) identical LG 24UD58-B 4K60 monitors connected via DP. One test was done with both monitors connected to Radeon VII, and the other was done using my stable Intel+Radeon VII setup where one monitor is connected to Radeon VII and the other is connected to the Intel iGPU (HD 630, also via DP at 4K60).

These dmesg dumps were taken with all DMs/DEs/Graphics disabled in order to limit interference. The system was booted to a text commandline at native resolution.

Since 5.3 isn't changing anything, I plan to do a recompile of 5.2.9 (or 5.2.10 if it's out for Arch) with the smum_send_msg_to_smc_with_parameter patch suggested by Tom B.

Comment 117 ReddestDream 2019-08-25 20:47:33 UTC

Created attachment 145154 [details]
AMDInteliGPUBoot

Also find my stable Intel iGPU + AMD Graphics config dmesg here.

Comment 118 ReddestDream 2019-08-25 23:01:48 UTC

So, this is a crazy idea, but ironically I think it might be getting closer to the truth.

Tom B. attempted reverting ad51c46eec739c18be24178a30b47801b10e0357, which was known to cause some issue with an RX 580. He found that doing so fixed the multimonitor crash but locked the card to the lowest possible memory speed, which really isn't acceptable.

Perhaps our issue seem is connected to insufficient or improperly calculated PCIe bandwidth/speed. Speed mismatches can and will cause messages to not go through to the peripheral. It's also well-known that Radeon VII was originally a PCIe 4.0 card that AMD locked down to the 3.0 speeds . . .

What if when using multiple monitors and/or higher clock speeds Radeon VII uses more bandwidth than Linux expects, causing the loss of communication?

Something else I plan to investigate.

Comment 119 ReddestDream 2019-08-26 03:20:33 UTC

Created attachment 145158 [details]
DebugAMD2Monitors

>I don't think I have time to try it today but if anyone is recompiling the code adding
>pr_err("msg: %d / parameter: %d\n", msg, parameter); 
>to this function in smumgr.c would be a useful addition.


So, I've done just this. I also added a speed/width check to amdgpu_device_get_min_pci_speed_width in amdgpu_device.c to check the values of cur_speed and cur_width.

I ran two checks with 5.2.9, one with two monitors on Radeon VII and another with my stable 1 monitor on each Radeon VII and Intel iGPU.

Please find them attached.

Thanks so much for all your help!

Comment 120 ReddestDream 2019-08-26 03:21:14 UTC

Created attachment 145159 [details]
DebugAMDiGPU

Also here is the AMD + iGPU one.

Comment 121 ReddestDream 2019-08-26 03:47:41 UTC

Some observations:

1. Nothing at all seems to be up with cur_speed and cur_width. They get set several times in a row in both runs, but the values are all the same in both.

2. I can't really see anything up with msg/parameter either. When I compare them to each other nothing seems particularly wacky. And we also have an instance in my AMD+iGPU run where we see msg/parameter after "[drm] Initialized amdgpu", so the theory that all messages have to be sent before Initialization is complete must be wrong.

Now the real question is if we can decode what these msg/parameter values mean. But it looks more likely to me that vega20_hwmgr.c and vega20_ppt.c are just bugged somewhere (probably in the same way since they seem to be alternate versions of each) and that the rest of the amdgpu code is (relatively) fine.

I'm thinking we'll have to go through and knock out/debug pretty much everything in those files until we figure out where the breakage is. That's about 3000-4000 lines of code in each of those two files tho. So any thoughts anyone has about where we should start would be helpful. My focus will probably be on UCLK (since it seems to break first), SCLK (since it gets set to 0 MHz when there's multiple displays), DCEFCLK, and basically anything else that smells like it might control the memory clock and/or be affected by multiple monitors.

Thanks!

Comment 122 ReddestDream 2019-08-27 21:56:33 UTC

Tested 5.3-rc6. Still has the same issues. Only it's maybe actually worse because I lose display completely when I use amdgpu.dpm=2 w/Radeon VII multimonitor on 5.3-rc6, whereas on 5.2.9 I just got same/similar errors to default.

I'm working a kernel fork of 5.3-rc6 where I'm reverting various things and adding things in from Vega 10/12 and Navi to see if it helps. Haven't compiled and tested it yet but since I know 5.3-rc6 itself boots, compiles, and demonstrates the issue I guess it's a good base until 5.3 releases.

https://github.com/ReddestDream/linux

Any ideas anyone has are appreciated.

For now I actually find that amdgpu.dpm=0 with both 4K monitors on Radeon VII allows for much snappier generic desktop than my previous setup with AMD+iGPU. It's amazing how well this card runs 4K displays w/o any proper memory clock management at all. I'm sure the gaming performance would be pretty bad tho, but I have Windows for that for now . . .

Comment 123 ReddestDream 2019-08-31 00:11:19 UTC

A few interesting fixes that touch vega20_hwmgr.c have rolled in from drm-fixes:

The first is likely the most interesting for our issues, as it touches min/maxes (tho only the soft ones it seems). The other two are related to SMU versions.

https://github.com/torvalds/linux/commit/83e09d5bddbee749fc83063890244397896a1971

https://github.com/torvalds/linux/commit/21649c0b6b7899f4fa3099c46d3d027f60b107ec

https://github.com/torvalds/linux/commit/23b7f6c41d4717b1638eca47e09d7e99fc7b9fd9

I haven't tested them out yet, but it does give me some hope that someone is still looking at Vega 20/Radeon VII . . .

Comment 124 ReddestDream 2019-09-03 16:46:22 UTC

Created attachment 145254 [details]
Dmesg 5.3-rc7 w/ Two monitors

This issue is still not fixed on 5.3-rc7. I guess we will probably have to wait until 5.4 (the next LTS) before more people take a look at this issue. :(

Comment 125 Adrian Brown 2019-09-18 09:52:51 UTC

I am also getting frequent crashes with a Radeon VII on Kubuntu 19.10 (kernel 5.0.0-29-generic). I see there is some discussion in this thread about it possibly being related to multiple monitors. But I don't think that's the case. I have a single monitor but it is old with only a dual link DVI connection. So I am using displayport on the GPU but connected to an active adapter to convert DP to a dual link DVI connection (my monitor is a Dell 3007WFP running at 2560x1600).

I often get crashes soon after boot. They tend to happen in clusters so it crashes a few times, then stays stable for a short time and then crashes again. I don't get these crashes on the same system when dual booted into Windows 10 so the hardware itself seems good. 

One thing worth mentioning is that on Windows 10 I occasionally get a black screen and the monitor goes off for a couple of seconds. It then comes back to life. Apparently this is not uncommon and the suspicion in the Windows community is that AMD drivers sometimes crash but Windows recovers (I never had this with my Vega 64, only with the Radeon VII). It most likely is a completely different issue of course, but thought it worth mentioning.

Still hoping for a fix at some point. Also happy to help test any fix.

Comment 126 ReddestDream 2019-09-18 11:36:31 UTC

@Adrian Brown Your Linux issue is potentially related to the active adapter. Have you tried w/o it?

On Windows, the flickering on/around login, at least for me, has been mostly resolved by using the latest AMD driver + Windows 10 1903 and all the recent updates. There was a Windows update about a month ago that resolved a lot of flickering issues by fixing a bug in Windows's 10-bit color support.

Also, if you are using Ubuntu, it might be worth downgrading to 18.04.3 so that you can use the Radeon Software for Linux Driver:

https://www.amd.com/en/support/graphics/amd-radeon-2nd-generation-vega/amd-radeon-2nd-generation-vega/amd-radeon-vii

Currently, I hear that using AMD's driver + a supported distro is the best way to get stability out of Radeon VII. And it's something I will probably end up trying myself if there's no resolution to the issues forthcoming with 5.4, which will be the new LTS.

Comment 127 Alex Deucher 2019-09-20 19:12:35 UTC

(In reply to Tom B from comment #15)
> Have been running 5.0 since release without issue but upgraded this morning
> and got crashes as described here within a few seconds of boot. 
>

Can you bisect between 5.0 and 5.1 and see what commit caused the regression?

Comment 128 Alex Deucher 2019-09-20 19:13:25 UTC

Do these patches help?
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes&id=c46e5df4ac898108da66a880c4e18f69c74f6c1b
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes&id=c02d6a161395dfc0c2fdabb9e976a229017288d8

Comment 129 Tom B 2019-09-21 15:02:08 UTC

Thank you Alex! That has fixed it! The card is now correctly setting its voltages and clocks. I applied the patch to 5.3.1

However, I've noticed a few very minor problems that are probably worth reporting.

1. I still get this in dmesg:


[    6.307005] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    6.307006] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[    9.225192] amdgpu 0000:44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
[   10.238621] amdgpu 0000:44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on page0 (-110).
[   10.532004] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[   10.532005] amdgpu: [powerplay] Failed to set soft min gfxclk !
[   10.532006] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!


Though this doesn't really matter, we were focussing our issue there earlier in the thread as it looked like `Set hard min uclk failed!` was the cause of the problem, obviously it isn't.

2. This repeats indefinitely in dmesg:

[  332.575747] [drm] schedsdma0 is not ready, skipping
[  332.582657] [drm] schedsdma0 is not ready, skipping
[  332.582864] [drm] schedsdma0 is not ready, skipping
[  332.708848] [drm] schedsdma0 is not ready, skipping
[  332.715975] [drm] schedsdma0 is not ready, skipping
[  332.716229] [drm] schedsdma0 is not ready, skipping
[  332.756987] [drm] schedsdma0 is not ready, skipping
[  332.763970] [drm] schedsdma0 is not ready, skipping
[  332.764169] [drm] schedsdma0 is not ready, skipping


As you can see several dozens of times second this gets written to dmesg. This might be because the patches are intended to be used on 5.4?

3. The lowest wattage now seems to be 33w rather than 23w which means increased idle power usage and temps. This isn't really a problem but I thought it was worth mentioning and is a fair tradeoff for stability.

Comment 130 Anthony Rabbito 2019-09-21 15:12:58 UTC

(In reply to Alex Deucher from comment #128)
> Do these patches help?
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes&id=c46e5df4ac898108da66a880c4e18f69c74f6c1b
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes&id=c02d6a161395dfc0c2fdabb9e976a229017288d8

I will try to apply these patches in a few hours.Though I must say in 5.3 things have been much better. Not perfect and I haven't tried triple monitor yet, but definitely improvement

Comment 131 Tom B 2019-09-21 15:25:27 UTC

In addition to my previous comment, [drm] schedsdma0 is not ready, skipping repeating indefinitely stops after a suspend/resume. After the machine is resumed these stop appearing but it does suspend and resume correctly.

Comment 132 Anthony Rabbito 2019-09-21 15:38:10 UTC

Created attachment 145458 [details]
linux-mainline5.3 dmesg without patches

Here's my current dmesg with two out of three monitors running without the patches Alex provided. I'm currently compiling the kernel with his patches to look at the differences and see if I can get my third monitor to boot up.

Comment 133 Anthony Rabbito 2019-09-21 15:57:06 UTC

Created attachment 145459 [details]
dsmeg log with Alex's patches

Here's my dsmeg with Alex's patches. Going to mess around and see what I can find.

Comment 134 Anthony Rabbito 2019-09-21 15:59:37 UTC

Wow ! All three of my monitors are working again. 2560x1440 @ 144Hz

Comment 135 Adrian Brown 2019-09-21 19:54:08 UTC

@reddestdream Thanks. I don't think the active adapter is the problem as it works perfectly with my Vega 64. However I will try 18.04 and AMD's driver as suggested.

Comment 136 tom91136 2019-09-21 20:04:31 UTC

Been following this thread for a while now as I just got 3 4k 60Hz monitors connected to the 3 DP ports on my Radeon VII. 
I'm getting the exact same errors discussed in this report with matching dmesg outputs.

I've applied the patches to Fedora 31's 5.3.0-3 kernel and everything now works perfectly!

Just a few notes:

* Idle power draw before patch was 22W in lm_sensors, now it's reading 28W, makes sense as the memory is now properly clocked. This also loosely matches @Tom B's results.

* I did not get the repeated `[drm] schedsdma0 is not ready, skipping` in dmesg, however, it is still possible to trigger a freeze by toggling dpms:

    xset dpms force off

Resulting in:

[  155.431068] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[  155.431070] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[  161.334003] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[  161.334004] amdgpu: [powerplay] Failed to set soft min gfxclk !
[  161.334005] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[  164.622060] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[  164.622062] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!


Previously, without the patch, the machine hangs. With the patch, the display freezes for a few seconds and then power off. Mouse movement correctly turns on all screen and everything is back to normal.

Comment 137 sehellion 2019-09-22 21:36:05 UTC

(In reply to Alex Deucher from comment #128)
> Do these patches help?
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes&id=c46e5df4ac898108da66a880c4e18f69c74f6c1b
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes&id=c02d6a161395dfc0c2fdabb9e976a229017288d8

Yes, these patches fix the problem. 

amdgpu: [powerplay] Failed to send message 0x28, response 0x0
amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on page0 (-110).
amdgpu: [powerplay] Failed to send message 0x26, response 0x0
amdgpu: [powerplay] Failed to set soft min gfxclk !
amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma1 (-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).

In general system is stable.

Comment 138 sehellion 2019-09-22 21:38:31 UTC

Created attachment 145461 [details]
5.3.1 with Alex's patches and dual monitors

Comment 139 sehellion 2019-09-23 04:09:53 UTC

Today, when trying to wake up the monitors, the system crashed again. 

WARNING: CPU: 4 PID: 32 at drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link_dp.c:1720 decide_link_settings+0xe0/0x2a0 [amdgpu]

Full dmesg log has updated.

Comment 140 sehellion 2019-09-23 04:11:00 UTC

Created attachment 145463 [details]
5.3.1 with Alex's patches and dual monitors, crash

Comment 141 Alex Deucher 2019-09-23 14:20:09 UTC

(In reply to sehellion from comment #140)
> Created attachment 145463 [details]
> 5.3.1 with Alex's patches and dual monitors, crash

That's not a crash, it's just a warning.

Comment 142 sehellion 2019-09-23 15:40:14 UTC

(In reply to Alex Deucher from comment #141)
> (In reply to sehellion from comment #140)
> > Created attachment 145463 [details]
> > 5.3.1 with Alex's patches and dual monitors, crash
> 
> That's not a crash, it's just a warning.

But system hangs after. Today it happened twice. When I try to resume work, monitors turn on, then the secondary shows that there is no signal, and the primary shows a black screen. But perhaps this is not related to this bug. I can connect via ssh and see logs when this happens, if necessary.

Comment 143 Tom B 2019-09-23 15:43:55 UTC

I'm not sure how KDE handles monitor power behind the scenes but I have an uptime of 2 days now since applying the patches and with KDE I've let it turn off the monitors at least 6 or 7 times and suspend/resume 3 times without issue.

Comment 144 sehellion 2019-09-23 16:04:25 UTC

I also think this is strange. Since yesterday, they turned off and on many times successfully without any problems. Most likely, it's connected with something else, but I don’t know where to find.

Comment 145 tom91136 2019-09-24 09:44:20 UTC

@Alex any plans for the patches to be merged for 5.4 or even backported to 5.3 at some point?

Comment 146 Alex Deucher 2019-09-27 14:46:11 UTC

(In reply to tom91136 from comment #145)
> @Alex any plans for the patches to be merged for 5.4 or even backported to
> 5.3 at some point?

Already merged to 5.4.  I'll take a look at older kernels as well.

Comment 147 ReddestDream 2019-09-27 15:12:34 UTC

> Already merged to 5.4.  I'll take a look at older kernels as well.

@Alex Deucher Thanks so much for all your help! :)

Comment 148 Anthony Rabbito 2019-09-27 15:13:58 UTC

Everyone's contribution is very much appreciated ! I can finally go back to using my workstation. Alex, thank you

Comment 149 linedot 2019-09-29 19:25:49 UTC

Created attachment 145581 [details]
5.3.1 plus Alex's patches, kde wayland crash, then kde xorg crash

This issue is not fixed for me with Alex's patches.

I use only a single monitor via DP. Running a patched 5.3.1 kernel. Attached is a dmesg log: First a wayland KDE session crashes, I kill all user processes and restart sddm and start a KDE Xorg session, which later also crashes.

Comment 150 linedot 2019-09-29 19:28:34 UTC

Created attachment 145582 [details]
5.3.1 patched, wayland crash

Sorry, the file got messed up, here is the wayland crash

Comment 151 linedot 2019-09-29 19:30:24 UTC

Created attachment 145583 [details]
5.3.1 patched, xorg crash

And here is a dmesg of just an X session crashing

Comment 152 ReddestDream 2019-09-30 20:20:22 UTC

Kernel 5.4-rc1, the first kernel version that includes the Vega 20 patches noted by Alex Deucher, is now out and in linux-mainline on Arch Linux AUR. :)

I plan to do some testing of this version over the next few days, and it might be worth it for people who are still having issues to confirm on this version as well. Thanks!

Comment 153 ReddestDream 2019-10-01 23:44:38 UTC

Just FYI, it appears that kernel 5.3.2 does not have the Vega 20 fix commits that Alex Deucher mentioned.

Comment 154 linedot 2019-10-03 06:54:21 UTC

Created attachment 145623 [details]
5.4.0-rc1 hangup

dmesg with 5.4.0-rc1.

System freezes and becomes unresponsive to input like before

Comment 155 ReddestDream 2019-10-04 12:43:36 UTC

So, I've done some tests with 5.4-rc1 and it seems like I'm getting similar results to linedot@xcpp.org and sehellion@gmail.com. I'm using GNOME with Wayland (which works fine with only 1 display). Sometimes it works for a while. Sometimes I can't see the mouse cursor. Sometimes I get glitches all over the screen containing pieces and parts of previous framebuffers. But, I mean, it's better than 5.3 was, which was so bad I never could see anything and I would get stuck on blackscreen. At least on 5.4-rc1 I've been able to manually switch to a virtual console and reboot rather than force a reboot with the power button.

Still hoping for some fix for this, but it's become less important to me as further improvements to GNOME and MESA have made the Radeon VII + iGPU setup I've been using run smoother. I've also discovered further issues on Windows regarding the high memory clock when using multiple monitors with Radeon VII, and it's been affecting performance there too. I'm considering just sticking with 1 monitor only with for this machine/card. lol

Comment 156 Tom B 2019-10-06 14:16:56 UTC

This is strange because with a patched 5.3.1, I have perfect stability. An uptime of over a week and no issues. Are you saying that the issue comes back in 5.4? Hopefully not as Linux 5.4 + Mesa 19.3 looks to have a nice performance bump on the VII. 

With the patches, do you see the card boosting correctly? Do the wattage, voltage and clocks change under load? Asking an obvious question here, but is the crash temperature related? Maybe the patches increase power and overheat. If so, it might explain why I'm not affected as my card is water cooled.

Comment 157 ReddestDream 2019-10-06 16:39:09 UTC

@Tom B. Well, some good news. Kernel 5.3.4 should have the patches for Radeon VII included now. I'll do some more tests on that ...

Comment 158 ReddestDream 2019-10-06 17:06:19 UTC

More good news. It seems that 5.3.4 does work for me and doesn't (at least immediately since I'm typing this from there right now) fall apart into a glitchy mess.

I'm still not really sure of the complete stability of things tho because we do still see our old friend: "amdgpu: [powerplay] Failed to send message 0x28, response 0x0, amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!" in dmesg. So, AFAICT, there's still something wrong. It's just more stable than it was before.

But yeah. This is the first time since I've gotten this card that I've been able to boot to a DE w/o crashing and w/o disabling dpm. :)

Comment 159 ReddestDream 2019-10-06 17:07:52 UTC

Oh. Also,

cat /sys/kernel/debug/dri/0/amdgpu_pm_info

Now seems to work on 5.3.4 with more than one monitor in. It doesn't report nonsense values like 0 watts like it did before. :)

Comment 160 ReddestDream 2019-10-10 12:50:44 UTC

Well, today I had a hard freeze using more than one display with Radeon VII. Back to Radeon VII + iGPU . . . :(

Comment 161 Gargoyle 2019-10-12 23:34:19 UTC

Hi there. I've been trying to solve some lockups and pauses with my system and have just read this entire thread.

The good news is that I am another Radeon VII owner having the same problems and I am willing to do whatever I can to help.

My current situation is:-

- I'm running dual 2560x1440@60Hz via display port.

- I am running the beta of ubuntu 19:10 (Linux ryzen1910 5.3.0-18-generic #19-Ubuntu SMP Tue Oct 8 20:14:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux).

- I don't push the R:VII at all under Linux. I boot into Windows 10 to play games.

- I have disabled IOMMU in BIOS/EFI. With IOMMU enabled things are MUCH worse.

- My system is mostly stable. If the displays blank, sometimes after waking them I get the 15-30 second freeze. Then the "amdgpu [powerplay] Failed..." messages and then everything continues ok. I can semi-reliably recreate this by using the "xset dpms force off" command someone posted earlier. I've not managed to find any kind of pattern yet, but 8 out of 10 times running that command and then waking the system with a keypress/mouse click will cause the freeze.

- I use X11 and not wayland. Not sure that is significant, but with Ubuntu 19:10 it seems wayland is started temporarily and then stopped during boot / starting gdm. If I enable IOMMU my GDM login screen will be completely corrupt. However, if I press enter (to select my user) and enter my password, my X11 gnome session starts. Although there are LOTS of pauses and warnings and errors all over the place in "journalctl -f".

Comment 162 linedot 2019-10-14 09:15:30 UTC

Created attachment 145730 [details]
Freeze/Black screen/Crash on 5.3.6

Apologies, I have been on vacation and thus away from my main System.

Attached is the dmesg log of another crash with kernel version 5.3.6. Here is a description of what the crash looked like:
1) Successfully booted up to login manager
2) Logged into a graphical session
3) Shortly after, the screen freezes
4) Screen flashes to black (~5-10 sec)
5) Screen flashes back to the frozen desktop (~5-10 sec)
6) Screen goes black (not off), no response to input, switching to tty doesn't work. I was able to ssh into the machine from a laptop and get the dmesg output.

Comment 163 Tom B 2019-10-14 10:39:05 UTC

Gargoyle, linedot, can you confirm whether this crash is with both patches applied?

I'm still on 5.3.1 patched and haven't had a single crash.

Comment 164 linedot 2019-10-14 11:37:45 UTC

(In reply to Tom B from comment #163)
> Gargoyle, linedot, can you confirm whether this crash is with both patches
> applied?
> 
> I'm still on 5.3.1 patched and haven't had a single crash.

For 5.3.1 I've built the kernel with the arch build system and manually added lines to apply the two patches to PKGBUILD and also have seen them being applied in the log.

For 5.3.6 I've checked that the patches are already applied.

Comment 165 Tom B 2019-10-14 17:05:37 UTC

I just tried 5.3.5 (which is the latest in the arch repo) and it's working fine for me.

I do have an issue on Wayland. If the screen turns off, Wayland crashes and I have to hard reset. The log shows 

Oct 14 17:48:56 desktop kernel: amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to send message 0x26, response 0x0
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to set soft min gfxclk !
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to upload DPM Bootup Levels!


But, this also shows on boot so I'm not sure it's a problem and it seems to be wayland that segfaults, not an issue with amdgpu. 

I do still get `kernel: [drm] schedsdma0 is not ready, skipping` repeating forever in my journal.

Comment 166 Peter Hercek 2019-10-19 17:35:46 UTC

I tried, 5.3.6-arch1-1 on archlinux with 3 DP monitors. It should contain the patch based on the comment from linedot@xcpp.org.

I got the crash after 4 days of use. It looks the same as before:
ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and failure to initialize parser for ever.

The situation looked like this from my experience: with each new kernel the error got worse and worse; 5.3.6 improved it a lot, but it is still not fixed.

Comment 167 Alex Deucher 2019-10-20 18:27:33 UTC

(In reply to Peter Hercek from comment #166)
> I got the crash after 4 days of use. It looks the same as before:
> ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and
> failure to initialize parser for ever.

The parser error just means you need to restart your desktop environment.  At the moment no desktop managers properly handle GPU resets (recreate their context and buffers) so you need to restart your desktop to get it back.

Comment 168 linedot 2019-10-21 08:11:44 UTC

Created attachment 145784 [details]
5.3.7: Fence fallback timer expired on ring <x>

Here is a freeze which went a bit differently. 
This time the system is frozen without any blinking and there are tons of messages like:

[ 2940.919451] [drm] Fence fallback timer expired on ring page1

This is on 5.3.7-arch1-1

(Also I'm using only one single monitor connected through DP, as opposed to the others)

Comment 169 picard12 2019-11-10 16:36:00 UTC

I am using a Radeon VII with Arch Linux, a 1440p144hz and a 4K60Hz monitor, and I had similar crashes to the others here if I tried running the 1440p144hz monitor at 144hz, at 60hz it was stable. This behavior stayed all the way from kernel 5.0 up to 5.3, and only stopped when I started using kernel 5.4.0 (5.4.0-rc6-mainline right now). Now I can run it at 144hz without crashes.

The driver still isn't working that well, as games seem very stuttery, but at least it doesn't crash anymore.

Comment 170 Peter Hercek 2019-11-10 17:45:37 UTC

Maybe this helps since there is a stack trace. GUI stopped to respond so I shut it down over ssh. A kernel crash during the shutdown on 5.3.6-arch1-1-ARCH even when amdgpu.dpm=0. That is the option which is supposed to work. It has both the patch and also amdgpu.dpm=0.

Nov 04 17:38:58 phnm kernel: ------------[ cut here ]------------
Nov 04 17:38:58 phnm kernel: WARNING: CPU: 6 PID: 640 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5804 amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Nov 04 17:38:58 phnm kernel: Modules linked in: fuse xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter tun bridge cfg80211 rfkill 8021q garp mrp stp llc intel_rapl_msr intel_rapl_common amdgpu x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel hid_microsoft radeon mousedev input_leds joydev ff_memless kvm gpu_sched snd_hda_codec_realtek snd_hda_codec_generic i2c_algo_bit irqbypass ledtrig_audio ttm crct10dif_pclmul snd_hda_intel crc32_pclmul hid_generic ghash_clmulni_intel cdc_acm drm_kms_helper snd_hda_codec aesni_intel usbhid iTCO_wdt iTCO_vendor_support snd_hda_core wmi_bmof aes_x86_64 hid crypto_simd cryptd mxm_wmi snd_hwdep glue_helper drm intel_cstate snd_pcm agpgart r8169 syscopyarea intel_uncore sysfillrect realtek sysimgblt snd_timer pcspkr i2c_i801 fb_sys_fops e1000e intel_rapl_perf
Nov 04 17:38:58 phnm kernel:  mei_me snd libphy mei soundcore lpc_ich wmi evdev mac_hid sg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel firewire_ohci xhci_pci xhci_hcd firewire_core ehci_pci crc_itu_t ehci_hcd sr_mod cdrom sd_mod ahci libahci libata scsi_mod
Nov 04 17:38:58 phnm kernel: CPU: 6 PID: 640 Comm: Xorg Not tainted 5.3.6-arch1-1-ARCH #1
Nov 04 17:38:58 phnm kernel: Hardware name: System manufacturer System Product Name/P9X79, BIOS 4502 10/15/2013
Nov 04 17:38:58 phnm kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Nov 04 17:38:58 phnm kernel: Code: c7 c7 08 1e db c0 e8 0f 59 a0 db 0f 0b 41 83 7c 24 08 00 0f 85 92 ff f1 ff e9 ad ff f1 ff 48 c7 c7 08 1e db c0 e8 f0 58 a0 db <0f> 0b e9 32 f5 f1 ff 48 8b 85 00 fd ff ff 4c 89 f2 48 c7 c6 0d 0f
Nov 04 17:38:58 phnm kernel: RSP: 0018:ffffa98c410475a0 EFLAGS: 00010046
Nov 04 17:38:58 phnm kernel: RAX: 0000000000000024 RBX: ffff894125e06000 RCX: 0000000000000000
Nov 04 17:38:58 phnm kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
Nov 04 17:38:58 phnm kernel: RBP: ffffa98c410478c0 R08: 000016b622fb648e R09: ffffffff9deb3254
Nov 04 17:38:58 phnm kernel: R10: 0000000000000616 R11: 000000000001d890 R12: 0000000000000286
Nov 04 17:38:58 phnm kernel: R13: ffff8940f30b0400 R14: ffff894129c20000 R15: ffff894075ba6a00
Nov 04 17:38:58 phnm kernel: FS:  00007fbf9c35c500(0000) GS:ffff89413fb80000(0000) knlGS:0000000000000000
Nov 04 17:38:58 phnm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 04 17:38:58 phnm kernel: CR2: 0000559991d31420 CR3: 000000082a644002 CR4: 00000000000606e0
Nov 04 17:38:58 phnm kernel: Call Trace:
Nov 04 17:38:58 phnm kernel:  ? commit_tail+0x3c/0x70 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  commit_tail+0x3c/0x70 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  drm_client_modeset_commit_atomic+0x1e8/0x200 [drm]
Nov 04 17:38:58 phnm kernel:  drm_client_modeset_commit_force+0x50/0x150 [drm]
Nov 04 17:38:58 phnm kernel:  drm_fb_helper_pan_display+0xc2/0x200 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  fb_pan_display+0x83/0x100
Nov 04 17:38:58 phnm kernel:  fb_set_var+0x1e8/0x3d0
Nov 04 17:38:58 phnm kernel:  fbcon_blank+0x1dd/0x290
Nov 04 17:38:58 phnm kernel:  do_unblank_screen+0x98/0x130
Nov 04 17:38:58 phnm kernel:  vt_ioctl+0xeff/0x1290
Nov 04 17:38:58 phnm kernel:  tty_ioctl+0x37b/0x900
Nov 04 17:38:58 phnm kernel:  ? preempt_count_add+0x68/0xa0
Nov 04 17:38:58 phnm kernel:  do_vfs_ioctl+0x43d/0x6c0
Nov 04 17:38:58 phnm kernel:  ? syscall_trace_enter+0x1f2/0x2e0
Nov 04 17:38:58 phnm kernel:  ksys_ioctl+0x5e/0x90
Nov 04 17:38:58 phnm kernel:  __x64_sys_ioctl+0x16/0x20
Nov 04 17:38:58 phnm kernel:  do_syscall_64+0x5f/0x1c0
Nov 04 17:38:58 phnm kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 04 17:38:58 phnm kernel: RIP: 0033:0x7fbf9d7b425b
Nov 04 17:38:58 phnm kernel: Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48
Nov 04 17:38:58 phnm kernel: RSP: 002b:00007ffe21162798 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 04 17:38:58 phnm kernel: RAX: ffffffffffffffda RBX: 000055d93ebf5180 RCX: 00007fbf9d7b425b
Nov 04 17:38:58 phnm kernel: RDX: 0000000000000000 RSI: 0000000000004b3a RDI: 000000000000000c
Nov 04 17:38:58 phnm kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
Nov 04 17:38:58 phnm kernel: R10: fffffffffffff4b4 R11: 0000000000000246 R12: ffffffffffffffff
Nov 04 17:38:58 phnm kernel: R13: 000055d93ebfa4a0 R14: 00007ffe21162968 R15: 0000000000000000
Nov 04 17:38:58 phnm kernel: ---[ end trace 40ade9cecd96ffc0 ]---

Comment 171 linedot 2019-11-26 12:03:01 UTC

Created attachment 146026 [details]
5.4.0-arch1-1 GPU initialization fails

With kernel version 5.4.0-arch1-1 the GPU can flat out no longer be initialized.

My system is now completely unusable with the current kernel.

Does this specifically mean anything?
[   15.575361] amdgpu: [powerplay] smu driver if version = 0x00000013, smu fw if version = 0x00000012, smu fw version = 0x00282d00 (40.45.0)
[   15.575362] amdgpu: [powerplay] SMU driver if version not matched

Comment 172 linedot 2019-11-26 23:13:27 UTC

I had dpm=2 as a module option. GPU initialization failure does not occur without dpm=2

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.