I'm in the process of diagnosing issues with a Radeon VII that I might have damaged during the attempts to improve its thermal conditions. Prior to all this the card has no major issues, just that it still runs too hot while mining (around 80-90 celsius even with fan maxed out via Radeon Profile, which, as well as the noise, was beyond acceptable and was the main reason why I wanted to improve the thermal condition in the first place).
The GPU in question now automatically switches to some kind of "safe clock" of 700/350 (as observed in Radeon Profile) when under heavy load such as mining (using ROCm backend on Manjaro/Arch), and cannot return to normal clock on its own. While I can force the default clocks back using Radeon Profile, however, if the card is still under load, the screen will immediately become messed up and a few seconds later the system hard resets with the GPU not detected in subsequent boots (as the screen got routed to the BMC on the motherboard instead of the video card) until I do a power cycle (manually or via IPMI).
After some failed attempts to mod the stock cooler to improve thermal condition (during which the symptoms began), I eventually replaced the cooler altogether with an Alphacool Eiswolf for this card. Despite the thermal condition has been improved greatly (it can still run Unigine Heaven tests at full clock for a short while without issues and at an acceptable 60 celsius), however, the issue with entering "safe clock" while mining does not go away.
I was able to get a usable under-load GPU clock of 1150MHz with Radeon Profile after some testing (it runs at around 40 celsius under load), but the condition only gets worse as now I can only maintain stable clock at around 1000MHz without entering "safe clock" too quickly. The "safe clock" can still kick in when I'm doing something else while mining, but as long as the clocks are set below safe ranges, I do not get system lockup/resets if I force the clock back (by reapplying).
I couldn't get any detailed logs yet as I haven't switched on debug parameters for amdgpu, but recently I was able to capture one occurrence with the log ended with "ring timeout" and "GPU reset begin" before the system hard reset.
I don't know where to start the investigation and find what caused the "safe clock" to trigger and, in case the card really got damaged, which CUs are causing issues (that I need to disable, as I just found out that I could disable CUs using boot parameters). I'm not sure which debug parameters I can use to get the information I need to look into the issue.
The current PSU installed on the system is an EVGA Supernova 750 P2 (750W 80+ Platinum) and I have both power connectors on the video card connected. The power supply should be sufficient and shouldn't be a problem.
After all, the experience with this card raised a lot of questions that I previously have neglected, especially regarding cooling, such as which kind of thermal compound/pads to use, where and how to apply/place them... but personally, cooling was never this hard to get right even with some very power-hungry CPUs I currently have.
A little update:
Was able to trigger system lockup/hard reset even at low clock speed (at around 1000MHz) last night. It happened when I reapplied the clock settings after it entered safe clock again, most likely due to that I was also doing something else on the system (which caused the card to enter safe clock in the first place).
I still couldn't find a way to increase the log verbosity, but I've tried the following:
1. Disabling CUs to match that of a Vega 56 (amdgpu.disable_cu=0.0.14,0.0.15,1.0.14,1.0.15,2.0. 14,2.0.15,3.0.14,3.0.15, found in the Phoronix post). This obviously had no effect as I'm yet to know which are the real ones I should disable, or if this problem can really be worked around or not this way.
2. Disabling DC (amdgpu.dc=0): The system won't boot at all.
3. Setting amdgpu.vm_update_mode=3: The system freezes a short while after startup, but it doesn't hard reset. I can just press the reset button and the card still works after reboot, without having to do a power cycle.
The system is currently at the latest stable kernel (5.3), but the problem had existed for quite a while (on all kernel versions, so it must be due to the hardware itself).
NOTE: On the other hand, the card seems to intermittently cause stutters to the system possibly with some related but corrected AERs showing in the system log, but the problem went away after I set the PCIe slot the card is on to GEN2 speed.