Bug 91880 - Radeonsi on Grenada cards (r9 390) exceptionally unstable and poorly performing
Summary: Radeonsi on Grenada cards (r9 390) exceptionally unstable and poorly performing
Status: NEEDINFO
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
: 92495 93288 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-09-04 15:00 UTC by Lauri Gustafsson
Modified: 2018-06-06 05:33 UTC (History)
23 users (show)

See Also:
i915 platform:
i915 features:


Attachments
An xorg log on Linux 4.2 and dmp disabled (53.02 KB, text/plain)
2015-09-04 18:40 UTC, Lauri Gustafsson
Details
My MSI r9 390 VBIOS ROM (64.00 KB, application/octet-stream)
2015-09-04 18:46 UTC, Lauri Gustafsson
Details
Dmesg from 4.2 kernel DPM enabled before crash/hang (61.37 KB, text/plain)
2015-09-04 19:05 UTC, Lauri Gustafsson
Details
modinfo_radeon (47.10 KB, text/plain)
2015-10-17 11:17 UTC, John Frei
Details
XFX R9 390 vbios.rom (64.00 KB, application/octet-stream)
2015-11-18 14:36 UTC, Jonas
Details
Output of dmesg command (68.69 KB, text/plain)
2015-11-26 20:06 UTC, haverland.n
Details
Xorg log (54.58 KB, text/plain)
2015-11-26 20:07 UTC, haverland.n
Details
dmesg output (95.16 KB, text/plain)
2015-12-01 19:07 UTC, Jonas
Details
Xorg log (50.15 KB, text/plain)
2015-12-01 19:08 UTC, Jonas
Details
Xorg log (no longer crashes) (78.89 KB, text/plain)
2015-12-09 16:28 UTC, Jonas
Details
dmesg output (no longer crashes) (97.46 KB, text/plain)
2015-12-09 16:30 UTC, Jonas
Details
dmesg with amdgpu crash (97.80 KB, text/plain)
2015-12-12 16:12 UTC, John Frei
Details
dmesg output latest Arch Linux (107.17 KB, text/plain)
2015-12-20 18:29 UTC, Jonas
Details
MSI R9 390x Grenada vbios.rom [VER015.048.000.062.000000] (64.00 KB, application/octet-stream)
2016-01-14 06:52 UTC, Thomas DEBESSE
Details
XFX R9 DD Black Edition vbios.rom (64.00 KB, application/octet-stream)
2016-01-14 20:47 UTC, Harald Judt
Details
dmesg after boot (63.72 KB, text/plain)
2016-01-14 20:49 UTC, Harald Judt
Details
dmesg after resume (76.61 KB, text/plain)
2016-01-14 20:49 UTC, Harald Judt
Details
Club3D R9 390 Vbios (64.00 KB, text/plain)
2016-01-29 23:12 UTC, Ioannis Panagiotopoulos
Details
attachment-9713-0.html (7.22 KB, text/html)
2016-07-20 10:01 UTC, John Bridgman
Details
PowerColor R9 390 PCS+ vbios (64.00 KB, application/octet-stream)
2016-10-31 19:49 UTC, Marcel Schaal
Details
kernel patch: set "high" default DPM profile instead of "auto" for 0x67B1 variant (3.69 KB, patch)
2017-09-10 05:13 UTC, Thomas DEBESSE
Details | Splinter Review
kernel patch: set "high" default DPM level instead of "auto" for 0x67B0/0x67B1 variants (3.81 KB, patch)
2017-09-10 05:39 UTC, Thomas DEBESSE
Details | Splinter Review
kernel patch: set "high" default DPM level instead of "auto" for 0x67B0/0x67B1 variants (3.81 KB, patch)
2017-09-10 08:11 UTC, Thomas DEBESSE
Details | Splinter Review
dmesg capture (56.93 KB, text/plain)
2017-09-18 23:06 UTC, garththeisen
Details
Xorg.0.log (86.77 KB, text/x-log)
2018-03-13 05:50 UTC, Chris Heald
Details
dmesg output (80.94 KB, text/x-log)
2018-03-13 05:50 UTC, Chris Heald
Details
attachment-9725-0.html (2.39 KB, text/html)
2018-05-11 22:13 UTC, Sandeep
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lauri Gustafsson 2015-09-04 15:00:09 UTC
I have an r9 390, and I had to disable DPM (on Linux 4.2) to get a stable destop. The performance is less than half of r9 290 (which is almost the same GPU) on the same driver.

How is Grenada support planned and can I get an estimate of it's arrival?
Comment 1 Alex Deucher 2015-09-04 15:03:22 UTC
It's supported.  Please attach your xorg log, dmesg output and a copy of your vbios.  To get a copy of your vbios:
(as root)
(use lspci to get the bus id)
cd /sys/bus/pci/devices/<pci bus id>
echo 1 > rom
cat rom > /tmp/vbios.rom
echo 0 > rom
Comment 2 Lauri Gustafsson 2015-09-04 18:40:27 UTC
Created attachment 118081 [details]
An xorg log on Linux 4.2 and dmp disabled
Comment 3 Lauri Gustafsson 2015-09-04 18:46:18 UTC
Created attachment 118082 [details]
My MSI r9 390 VBIOS ROM
Comment 4 Lauri Gustafsson 2015-09-04 18:50:26 UTC
(In reply to Alex Deucher from comment #1)
> It's supported.  Please attach your xorg log, dmesg output and a copy of
> your vbios.  To get a copy of your vbios:
> (as root)
> (use lspci to get the bus id)
> cd /sys/bus/pci/devices/<pci bus id>
> echo 1 > rom
> cat rom > /tmp/vbios.rom
> echo 0 > rom

Thanks for responding. I added the log and the ROM.
Still, even on the newest Linux from Git I get a display crash/hang whithout disabling DPM in Kernel line, so I'm not sure if I can supply an Xorg log with DPM switched on.
Performance on this freshly built kernel (without DPM) is still the same as before: somewhat low.
Comment 5 Lauri Gustafsson 2015-09-04 19:05:26 UTC
Created attachment 118083 [details]
Dmesg from 4.2 kernel DPM enabled before crash/hang
Comment 6 Lauri Gustafsson 2015-09-04 19:08:48 UTC
An interesting observation is that lspci shows my card as "Hawaii PRO [Radeon R9 290]" (even though it's not), and dmesg says
"[    6.072219] radeon 0000:01:00.0: Invalid ROM contents"
Comment 7 Lauri Gustafsson 2015-09-08 13:13:19 UTC
I did get Unigine Heaven running with DPM on for a few seconds (before hanging), showing way better frame rates than before.
It really seems that
- The performance issues are most likely caused by the lack of DPM
- DPM for Grenada GPUs is not supported completely
Or:
- My card has a really weird (stock) VBIOS that Radeon(si?) cannot interpret/work with.
Comment 8 Alex Deucher 2015-10-16 15:01:33 UTC
*** Bug 92495 has been marked as a duplicate of this bug. ***
Comment 9 Alex Deucher 2015-10-16 15:13:59 UTC
Can you try the latest smc and mc ucode from here:
http://people.freedesktop.org/~agd5f/radeon_ucode/hawaii/
Make sure to update your initrd if you are using one.
Comment 10 John Frei 2015-10-16 21:20:10 UTC
Wow, thank you for this quick reply!

Here's what i've done now. Please correct me, if I've done something wrong.

//Removing the old ucode.
-rm /usr/lib/firmware/radeon/hawaii*
-rm /usr/lib/firmware/radeon/HAWAII*

Then I downloaded all 11 files from the link and moved them to /usr/lib/firmware/radeon/.

Finally I executed 'sudo dracut -f' to generate a new initramfs.


Sadly it doesn't change the behavior. I still get a freeze/black screen. :-(
Comment 11 Alex Deucher 2015-10-16 22:04:31 UTC
(In reply to John Frei from comment #10)
> Wow, thank you for this quick reply!
> 
> Here's what i've done now. Please correct me, if I've done something wrong.
> 
> //Removing the old ucode.
> -rm /usr/lib/firmware/radeon/hawaii*
> -rm /usr/lib/firmware/radeon/HAWAII*

Are you sure that is the correct directory?  Most distros look for the firmware in /lib/firmware/radeon/
Comment 12 John Frei 2015-10-16 22:17:01 UTC
Thank you for your quick response.

'ls -al /' tells me:
[...]
lrwxrwxrwx.   1 root root    7 16. Aug 2014  lib -> usr/lib
lrwxrwxrwx.   1 root root    9 16. Aug 2014  lib64 -> usr/lib64
[...]

So /lib should be just a symbolic link to /usr/lib in Fedora 22.

If you need any other data/logs of my system, just tell it!
Comment 13 John Frei 2015-10-17 11:17:08 UTC
Created attachment 118933 [details]
modinfo_radeon

Here's the output of 'modinfo radeon'.

As you can see, the uppercase 'HAWAII_*' firmware is still listed, too, although these files were deleted from the /lib/firmware/radeon/ directory.

Did I do something wrong regarding to the process of the initramfs rebuilding?
Comment 14 Alex Deucher 2015-10-17 14:06:44 UTC
(In reply to John Frei from comment #13)
> Created attachment 118933 [details]
> modinfo_radeon
> 
> Here's the output of 'modinfo radeon'.
> 
> As you can see, the uppercase 'HAWAII_*' firmware is still listed, too,
> although these files were deleted from the /lib/firmware/radeon/ directory.

The upper case ones were the original versions without the header information.  They are only used for backwards compatibility.  All you need is the lower case ones.
Comment 15 Lauri Gustafsson 2015-10-18 11:59:45 UTC
I installed the new firmware files, now my screens do still go black when running GPU-intensive tasks, but it soon restores itself and keeps running with an unimpressive framerate. /proc/kmesg had this:

<3>[  137.063440] radeon 0000:01:00.0: ring 0 stalled for more than 10120msec
<4>[  137.063450] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000002637 last fence id 0x000000000000263a on ring 0)
<6>[  137.117833] radeon 0000:01:00.0: Saved 89 dwords of commands on ring 0.
<6>[  137.117886] radeon 0000:01:00.0: GPU softreset: 0x00000009
<6>[  137.117889] radeon 0000:01:00.0:   GRBM_STATUS=0xF5D04028
<6>[  137.117891] radeon 0000:01:00.0:   GRBM_STATUS2=0x52000008
<6>[  137.117892] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0xEC400000
<6>[  137.117894] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0xEE400000
<6>[  137.117896] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0xEC400000
<6>[  137.117898] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0xEC400000
<6>[  137.117900] radeon 0000:01:00.0:   SRBM_STATUS=0x20000A40
<6>[  137.117901] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
<6>[  137.117903] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
<6>[  137.117905] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
<6>[  137.117907] radeon 0000:01:00.0:   CP_STAT = 0x84228600
<6>[  137.117909] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
<6>[  137.117910] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x40000000
<6>[  137.117912] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000400
<6>[  137.117914] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000006
<6>[  137.117916] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000001
<6>[  137.117918] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80000063
<6>[  137.117919] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
<6>[  137.117921] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
<6>[  137.117923] radeon 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
<6>[  137.117925] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
<6>[  137.117927] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
<6>[  137.145522] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00010001
<6>[  137.145576] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
<6>[  137.146731] radeon 0000:01:00.0:   GRBM_STATUS=0x00003028
<6>[  137.146733] radeon 0000:01:00.0:   GRBM_STATUS2=0x00000008
<6>[  137.146735] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
<6>[  137.146738] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
<6>[  137.146740] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
<6>[  137.146742] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
<6>[  137.146744] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
<6>[  137.146745] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
<6>[  137.146747] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
<6>[  137.146749] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
<6>[  137.146751] radeon 0000:01:00.0:   CP_STAT = 0x00000000
<6>[  137.146753] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000000
<6>[  137.146754] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
<6>[  137.146756] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
<6>[  137.146758] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000000
<6>[  137.146760] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
<6>[  137.146762] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x00000000
<6>[  137.146763] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
<6>[  137.146765] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
<6>[  137.146767] radeon 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
<6>[  137.146788] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
<3>[  137.366496] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
<3>[  137.366509] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
<6>[  137.366513] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
<6>[  137.366516] [drm] PCIE gen 2 link speeds already enabled
<6>[  137.369071] [drm] PCIE GART of 2048M enabled (table at 0x0000000000324000).
<6>[  137.369180] radeon 0000:01:00.0: WB enabled
<6>[  137.369184] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000200000c00 and cpu addr 0xffff8802312b9c00
<6>[  137.369186] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000200000c04 and cpu addr 0xffff8802312b9c04
<6>[  137.369188] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000200000c08 and cpu addr 0xffff8802312b9c08
<6>[  137.369190] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000200000c0c and cpu addr 0xffff8802312b9c0c
<6>[  137.369191] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000200000c10 and cpu addr 0xffff8802312b9c10
<6>[  137.369616] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xffffc90001036c98
<6>[  137.369769] radeon 0000:01:00.0: fence driver on ring 6 use gpu addr 0x0000000200000c18 and cpu addr 0xffff8802312b9c18
<6>[  137.369771] radeon 0000:01:00.0: fence driver on ring 7 use gpu addr 0x0000000200000c1c and cpu addr 0xffff8802312b9c1c
<6>[  137.371302] [drm] ring test on 0 succeeded in 3 usecs
<6>[  137.371383] [drm] ring test on 1 succeeded in 3 usecs
<6>[  137.371395] [drm] ring test on 2 succeeded in 2 usecs
<6>[  137.371541] [drm] ring test on 3 succeeded in 5 usecs
<6>[  137.371548] [drm] ring test on 4 succeeded in 5 usecs
<6>[  137.417580] [drm] ring test on 5 succeeded in 2 usecs
<6>[  137.437595] [drm] UVD initialized successfully.
<6>[  137.539615] [drm] ring test on 6 succeeded in 981 usecs
<6>[  137.539626] [drm] ring test on 7 succeeded in 3 usecs
<6>[  137.539627] [drm] VCE initialized successfully.
<3>[  137.539677] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
<6>[  137.551432] [drm] ib test on ring 0 succeeded in 0 usecs
<6>[  137.551465] [drm] ib test on ring 1 succeeded in 0 usecs
<6>[  137.551492] [drm] ib test on ring 2 succeeded in 0 usecs
<6>[  137.551558] [drm] ib test on ring 3 succeeded in 0 usecs
<6>[  137.551611] [drm] ib test on ring 4 succeeded in 0 usecs
<6>[  138.070169] [drm] ib test on ring 5 succeeded
<6>[  138.091061] [drm] ib test on ring 6 succeeded
<6>[  138.091815] [drm] ib test on ring 7 succeeded
Comment 16 John Frei 2015-10-19 11:16:41 UTC
In my case just a hard reset is the only option.
The computer doesn't restore itself.
But the screen still get a signal from the graphics card, which is just black with a red dotted noise on the left.
Comment 17 Jonas 2015-11-18 14:35:00 UTC
Hi. I have the same card, although it is from XFX, and I get the same behaviour (apparently random crash when gaming/video). The card is also recognized as Hawaii PRO by "lspci". I'm uploading my "vbios.rom" in case it is useful. I hope I can help in some way.
Comment 18 Jonas 2015-11-18 14:36:19 UTC
Created attachment 119907 [details]
XFX R9 390 vbios.rom
Comment 19 haverland.n 2015-11-26 20:06:05 UTC
Created attachment 120151 [details]
Output of dmesg command

I have the same issue on my system with the MSI 390. I made a copy of the requested files.
Comment 20 haverland.n 2015-11-26 20:07:00 UTC
Created attachment 120152 [details]
Xorg log

Xorg log
Comment 21 Jonas 2015-12-01 19:07:21 UTC
Created attachment 120236 [details]
dmesg output

This is my "dmesg" after successful reboot, after crash.
Comment 22 Jonas 2015-12-01 19:08:30 UTC
Created attachment 120237 [details]
Xorg log

This is Xorg log.
Comment 23 Jonas 2015-12-09 11:58:13 UTC
I have some news. Yesterday I installed latest Debian Sid, which includes a 2 days old firmware-amd-graphics (https://packages.debian.org/sid/firmware-amd-graphics), and I see a pretty big improvement overall. I'm not home right now, but yesterday I could play every Valve game I own in good conditions.

Difference is that it no longer crashes for me, although sometimes it still stutters a little. It's like ~70% of the time it plays smooth, and ~30% it stutters. I tried to make it crash, but I couldn't, even with high details and some options pushed to max.

So, dpm seems to fail sometimes, but it is definitely improving for Grenada cards (although system still recognises GPU as Hawaii and uses Hawaii firmware). Many thanks to the developers that take care of this, I really appreciate your work and effort.

I hope you guys get the same results. It seems we finally get to play with our shiny R9 390!
Comment 24 Marek Olšák 2015-12-09 12:27:29 UTC
Yeah, old firmware versions can be problematic. Can this bug be closed then?
Comment 25 Ernst Sjöstrand 2015-12-09 12:28:35 UTC
Is there newer firmware than http://people.freedesktop.org/~agd5f/radeon_ucode/ ?
Comment 26 John Frei 2015-12-09 15:05:16 UTC
I will try the new firmware on weekend and share my experiences.
What do you mean with 'dpm seems to fail'? Is it "just" a poor performance or still a system crash?
Comment 27 Jonas 2015-12-09 16:23:54 UTC
I just retried Portal 2, Team Fortress 2 and L4D2, and it seems that when games launch I get a big hang (which looks like it's going to crash), but then it restores itself and works great. When I talk about "dpm that seems to fail" I mean that when the GPU must use more power, the game stutters. But at least it's working, I just tried to make it crash without success. I'm attaching dmesg and xorg, because I think there are some differences.

As far as I'm concerned, it seems to be working pretty good. I'm looking forward to see if someone can confirm this.
Comment 28 Jonas 2015-12-09 16:28:56 UTC
Created attachment 120436 [details]
Xorg log (no longer crashes)
Comment 29 Jonas 2015-12-09 16:30:17 UTC
Created attachment 120437 [details]
dmesg output (no longer crashes)

I guess what looks different is there:
[ 1976.736987] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
[ 1976.737005] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
Comment 31 Thomas DEBESSE 2015-12-10 01:08:42 UTC
Hi, it looks like I'm experiencing the same bug (see #93288 ).

I own a radeon 39 R390X manufactured by MSI. Right now I tried the last firmware from git repository but it fixed nothing. (tip for ubuntu users, the package from upcoming xenial is currently shipping the latest hawaii firmwares, I checked them against the latest master: https://launchpad.net/ubuntu/+source/linux-firmware ).

I can send you my dmesg if needed.
Comment 32 Alex Deucher 2015-12-10 05:40:04 UTC
Can you try the code in this branch:
http://cgit.freedesktop.org/~agd5f/linux/log/?h=new_smc
and the new ucode from here:
http://people.freedesktop.org/~agd5f/radeon_ucode/k/
Comment 33 John Frei 2015-12-12 16:11:48 UTC
I've tried the new stuff by the following steps:
-cloning the code from the new_smc branch...
-I enabled the "Enable amdgpu support for CIK parts" due to a recommendation on reddit.
-compiling and installing the kernel+modules...
-I placed the ucode to /lib/firmware/radeon and /lib/firmware/amdgpu
-The rebuild of the modules was done by 'sudo dracut -f'. <-- I hope this is the correct command.

-Then I rebooted the system with 'rdblacklist=0 amdgpu.enable_scheduler=1'.
-I was able to log in and use the system for a short time until the screen went black. (~1min)
-BUT: I pressed ALT+CTRL+F4 and 'blindly' typed my login credentials and was able to 
 dump the dmesg output to file (attachment). The number lock could be switched on/off for a few seconds.
 Even though the screen remained black.
-Then I booted with unmodified kernel parameters.
-A few seconds after the login screen appeared the screen switched to black.
-Unlike the first try with amdgpu I wasn't able to execute a 'dmesg >> file'.
 The number lock couldn't be switched on/off anymore.
Comment 34 John Frei 2015-12-12 16:12:51 UTC
Created attachment 120486 [details]
dmesg with amdgpu crash
Comment 35 Alex Deucher 2015-12-14 15:35:10 UTC
The patches have no affect on amdgpu.  Please test radeon.
Comment 36 John Frei 2015-12-14 17:36:57 UTC
I assume that the second try (boot with unmodified kernel parameters) the module radeon was used.
The only thing I can try is to compile the kernel without the 'amdgpu CIK parts', but this should not affect the radeon module, right?
Comment 37 Alex Deucher 2015-12-14 17:39:18 UTC
(In reply to John Frei from comment #36)
> I assume that the second try (boot with unmodified kernel parameters) the
> module radeon was used.
> The only thing I can try is to compile the kernel without the 'amdgpu CIK
> parts', but this should not affect the radeon module, right?

Either disable CIK support in amdgpu or blacklist the amdgpu module (e.g., modprobe.blacklist=amdgpu on the kernel command line in grub) to keep the amdgpu driver from loading on CIK hardware.
Comment 38 John Frei 2015-12-18 19:42:52 UTC
I explicitly blacklisted amdgpu. Sadly, I got a instant crash on login screen with radeon, too. (kernel 4.4-rc3+)
Comment 39 Jonas 2015-12-20 18:29:54 UTC
Created attachment 120607 [details]
dmesg output latest Arch Linux

After some testing under Arch Linux I noticed that the crash after ~10 sec of gaming (sometimes it's instant) is still happening. However the dmesg output I provide here seems to be very different than others so far. It happened while in Big Picture mode on Steam, no game launched, only playing with menus. I got a crash on tty1 (the X.org one) but I could get dmesg out of another tty. However once I got back to tty1 I couldn't do anything else and had to reboot.

Other than that, playing videos on Firefox or mpv still crashes after some time (~30 sec), but I can't get dmesg output.
Comment 40 Marek Olšák 2015-12-21 13:34:57 UTC
Jonas,

Can you please try latest LLVM and Mesa git? LLVM has some fixes for VM faults.
Comment 41 Jonas 2015-12-21 17:45:38 UTC
I installed a [mesa-git] repository, which includes LLVM and mesa packages. I also installed lib32 versions, but I'm still facing the same VM errors when I'm able to get a dmesg output. I also tried latest [testing] repository but I get the same result.

Thanks for your help.
Comment 42 lainlives 2015-12-25 22:50:02 UTC
I worked around this by setting performance & high BEFORE Xorg is able to start.  I seem to only have this problem if it switches power modes while X is running.
Comment 43 Lauri Gustafsson 2015-12-26 09:06:50 UTC
(In reply to lainlives from comment #42)
> I worked around this by setting performance & high BEFORE Xorg is able to
> start.  I seem to only have this problem if it switches power modes while X
> is running.

My card still hangs after a little use even when setting performance to power_dmp_state before startx.
http://imgur.com/GQf2NJY
Comment 44 John Frei 2015-12-27 00:32:51 UTC
I found out some interesting fact on my system:

If I start in textmode (runlevel 3) with no specific kernel parameter like 'radeon.dpm=0', set power_dm_force_perofrmance_level to 'high' and power_dpm_state to 'performance' and finally start the gdm service, the system works without crash. I was even able to play some games.

If I boot in textmode with radeon.dpm=0 and try to set power_profile to 'high', I get a crash immediately. I cannot even start gdm in this case.
Comment 45 lainlives 2015-12-27 02:34:42 UTC
(In reply to John Frei from comment #44)
> I found out some interesting fact on my system:
> 
> If I start in textmode (runlevel 3) with no specific kernel parameter like
> 'radeon.dpm=0', set power_dm_force_perofrmance_level to 'high' and
> power_dpm_state to 'performance' and finally start the gdm service, the
> system works without crash. I was even able to play some games.
> 
> If I boot in textmode with radeon.dpm=0 and try to set power_profile to
> 'high', I get a crash immediately. I cannot even start gdm in this case.

Your experience sounds exactly like mine, setting a udev rule to do that upon boot works fine as a workaround that persists accross reboots.
Comment 46 Jonas 2015-12-27 10:46:37 UTC
My case is completely different. I just tried setting dpm "state" to "performance" and "level" to "high" while in Plasma 5 session, and I got no problem at all. I didn't use any special kernel parameter, so I have dpm on. I could even play some games without crash, but still with some stutter. In fact, "dmesg" shows nothing about radeon or dpm failing, I guess because it's not regulating voltage and frequency (which is what seems to make it crash).

So in my case it seems I can switch profiles while in active X session without any problem. I have kernel 4.2.5, latest mesa, llvm, firmware and libdrm.
Comment 47 Lauri Gustafsson 2015-12-28 17:13:47 UTC
(In reply to John Frei from comment #44)
> I found out some interesting fact on my system:
> 
> If I start in textmode (runlevel 3) with no specific kernel parameter like
> 'radeon.dpm=0', set power_dm_force_perofrmance_level to 'high' and
> power_dpm_state to 'performance' and finally start the gdm service, the
> system works without crash. I was even able to play some games.
> 
> If I boot in textmode with radeon.dpm=0 and try to set power_profile to
> 'high', I get a crash immediately. I cannot even start gdm in this case.

Ah, setting the force_performance_level was the key, now I can get >70fps in Unigine heaven without any crashes! I hope dpm gets fixed and trickles down to Arch Linux soon, it's a bit of a waste to be running my card at full power all the time...
Comment 48 Julian 2016-01-12 15:39:02 UTC
I have the same issue. R9 390, running DPM=1 will end up in an eventual freeze. Using DPM=0 works fine but is obviously suboptimal.

I could provide another xorg log/vbios/demsg but they seem to be quite similar to the thread starter's.

Two things to add:

* I've tried with Ubuntu's default Mesa version (which is 11.0.8 I think) and the newest (11.1) and the newer release seems to cause the freeze much faster. On 11.0.8 I'd be able to use the system for hours until it froze. 11.1 never lasted for more than 20 minutes.

* A quick way to force the freeze to happen is to use google maps. Half a minute of scrolling around and zooming in/out is enough to make it happen.
Comment 49 Thomas DEBESSE 2016-01-14 06:46:45 UTC
Hey, it's awesome, the workaround Lauri Gustafsson found is working well!

I just do that at after my computer startup:

```
service 'gdm' stop

echo 'high' > '/sys/class/drm/card0/device/power_dpm_force_performance_level'
echo 'performance' > '/sys/class/drm/card0/device/power_dpm_state'

service 'gdm' start
```

Then it works! I was able to run the Unigine Valley and Heaven Benchmarks for the first time wit dpm enabled, without crashing at startup, and with descent performances.

If I don't do that and keep the dpm default config, my system crashes some minutes after startup (and for sure when running on heavy benchmarks like those ones).

I own a MSI R9 390X. In the past I wrote some details on bug 93288 . I'll upload my vbios soon. There is some strings like that inside:

```
05/21/15 03:39
MS-V30823-F2
GRENADA
PCI_EXPRESS
GDDR5
113-MSITV308MH.201                                                          
(C) 1988-2010, Advanced Micro Devices, Inc.
ATOMBIOSBK-AMD VER015.048.000.062.000000
V30830SC.bin
MSI_GRENADA_V30823XT_C67201_A0_GDDR5_8GB_30S\config.h
AMD ATOMBIOS
```
Comment 50 Thomas DEBESSE 2016-01-14 06:52:22 UTC
Created attachment 121014 [details]
MSI R9 390x Grenada vbios.rom [VER015.048.000.062.000000]
Comment 51 Thomas DEBESSE 2016-01-14 13:34:44 UTC
*** Bug 93288 has been marked as a duplicate of this bug. ***
Comment 52 Harald Judt 2016-01-14 20:47:23 UTC
Created attachment 121045 [details]
XFX R9 DD Black Edition vbios.rom

I have the same DPM problems. OpenGL seems to work fine, though, and I have not experienced any hangs so far, though I have only started using this new graphics card yesterday.

I am using the newest radeon ucode from here:
http://people.freedesktop.org/~agd5f/radeon_ucode/hawaii/
But have renamed the http://people.freedesktop.org/~agd5f/radeon_ucode/k/ instead of using the code from the new_smc branch.

I will attach dmesg and dmesg after resume too.

XFX R9 DD Black Edition
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290] (rev 80) (prog-if 00 [VGA controller])
        Subsystem: XFX Pine Group Inc. Hawaii PRO [Radeon R9 290]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 25
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at f0000000 (64-bit, prefetchable) [size=8M]
        Region 4: I/O ports at e000 [size=256]
        Region 5: Memory at f7e00000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at f7e40000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0100c  Data: 4191
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [200 v1] #15
        Capabilities: [270 v1] #19
        Capabilities: [2b0 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [2c0 v1] #13
        Capabilities: [2d0 v1] #1b
        Kernel driver in use: radeon
Comment 53 Harald Judt 2016-01-14 20:49:22 UTC
Created attachment 121046 [details]
dmesg after boot

dmesg after boot
Comment 54 Harald Judt 2016-01-14 20:49:52 UTC
Created attachment 121047 [details]
dmesg after resume

dmesg after resuming from hibernation
Comment 55 Thomas DEBESSE 2016-01-15 01:47:07 UTC
Hi, I made some tests, I discovered that my R9 390X works very well if I never load the "auto" `power_dpm_force_performance_level` profile, both "low" and "high" works. Also, both `power_dpm_state` "balanced", "battery", "performance" works.

So, I wrote a little systemd service that load the "low/battery" profile at startup (just before multi-user.target so it's loaded before the login manager startup.

I discovered I can switch `power_dpm_force_performance_level` profile to the value I want and the same for `power_dpm_state` without any crashing, even when an heavy task (like Unigine Valley Benchmark) is running.

The fault is on the "auto" `power_dpm_force_performance_level` profile, and only that. All other options work.

If you want to workaround the bug, you can use my service:
https://github.com/illwieckz/dpm-query/

Just install the service (it will load "low/battery" DPM profile at startup), then use the `dpm-query` tool to set "high/performance" DPM profile when you need it.
Comment 56 Julian 2016-01-15 12:13:36 UTC
(In reply to Thomas DEBESSE from comment #55)
> Hi, I made some tests, I discovered that my R9 390X works very well if I
> never load the "auto" `power_dpm_force_performance_level` profile, both
> "low" and "high" works. Also, both `power_dpm_state` "balanced", "battery",
> "performance" works.
> 
> ...

Thank you for that. It seems like my system reacts the same way - I've re-enabled DPM and set the perf level to something that isn't auto. It's not crashed on me for about an hour now which is better than it did before.

I've also tried your tool - the service doesn't actually set the dpm parameters to 'low' and 'battery' for me. Maybe some issue with permissions involved. But dpm-query works fine and it great for convenience. Thanks for that as well.
Comment 57 Andy Furniss 2016-01-15 12:30:39 UTC
I have different h/w from you (TONGA).

Forcing high can help me but be aware that for me at least it may change back to auto as soon as some display change happens eg.

Coming out of dpms or changing a mode.
Comment 58 Julian 2016-01-16 14:00:55 UTC
In case anyone is interested, I've hacked together a GNOME extension that adds a status icon showing the current DPM setting and a popup menu to easily switch: https://github.com/JuBan1/radeon-dpm-control.git

It only handles three combinations of settings: low/battery, auto/balanced, high/performance. I don't think the others are particularly useful.

It makes use of Thomas DEBESSE's dpm-query script a few comments earlier.
Comment 59 Thomas DEBESSE 2016-01-18 04:23:46 UTC
Hi, since the bug is narrowed to the power_dpm_force_performance_level's auto profile, is there something we can do to help more?
Comment 60 Alex Deucher 2016-01-19 15:13:22 UTC
Can you see if disabling different dpm options helps?  Try to narrow down which one(s) are problematic.  E.g., this patch will disable all of them.


diff --git a/drivers/gpu/drm/radeon/ci_dpm.c b/drivers/gpu/drm/radeon/ci_dpm.c
index 4a09947..60ed634 100644
--- a/drivers/gpu/drm/radeon/ci_dpm.c
+++ b/drivers/gpu/drm/radeon/ci_dpm.c
@@ -5707,10 +5707,10 @@ int ci_dpm_init(struct radeon_device *rdev)
 
        pi->mclk_activity_target = CISLAND_MCLK_TARGETACTIVITY_DFLT;
 
-       pi->sclk_dpm_key_disabled = 0;
-       pi->mclk_dpm_key_disabled = 0;
-       pi->pcie_dpm_key_disabled = 0;
-       pi->thermal_sclk_dpm_enabled = 0;
+       pi->sclk_dpm_key_disabled = 1;
+       pi->mclk_dpm_key_disabled = 1;
+       pi->pcie_dpm_key_disabled = 1;
+       pi->thermal_sclk_dpm_enabled = 1;
 
        /* mclk dpm is unstable on some R7 260X cards with the old mc ucode */
        if ((rdev->pdev->device == 0x6658) &&
Comment 61 Julian 2016-01-20 01:13:18 UTC
(In reply to Alex Deucher from comment #60)
> Can you see if disabling different dpm options helps?  Try to narrow down
> which one(s) are problematic.  E.g., this patch will disable all of them.
> 
> ...

So, I've never done this before but I managed to recompile the kernel with the attached patch and boot from it.

I've been running on the performance_level auto setting for about an hour now and haven't encountered the freeze. I'll be using it for a few hours tomorrow to make sure but considering it usually takes only ~20 minutes to freeze, I'd consider this version fixed.

On this note, I guess there are ways to apply a patch to the radeon driver without recompiling the whole kernel? That'd make things less time-consuming.
Comment 62 Thomas DEBESSE 2016-01-20 04:32:05 UTC
(In reply to Julian from comment #61)
> I'd consider this version fixed.

Beware, no, it's not fixed because Alex's patche is not a patch to fix something, it's a patch test something. Applying the patch as-is only helps to show the error is in one of theses 4 lines, but all theses 4 lines are lines that must be enabled. Saying this patch fixes something is like saying not using the "auto" level profile fixes the driver, which not. :D

But thanks a lot for that test, now we know the error is in one of these 4 options. We need to test each of them separately now. :-)
Comment 63 Julian 2016-01-20 15:29:27 UTC
Well, "fixed" :p

I've used the patched module for 5-6 hours without issues now. So this version checks out.

Next is with sclk_dpm_key_disabled = 0 and mclk_dpm_key_disabled = 0;

My guess is that this'll make the freeze happen again. I've dug through the sources in an attempt to understand what's going on. My hypothesis is that the lockup happens under rare circumstances when the driver switches from one performance_level to another. Forcing the highest or lowest perf level is fine, but 'auto' switches perf levels often enough that the bug will happen relatively soon.

At first I thought it was an issue with the writeback feature that caches certain register values because it also caches an rptr value that is used in the driver's gpu_lockup_check and is, to my knowledge, never actually written to.

Buuut using radeon.no_wb=1 doesn't help. So if I've found a bug it is not the culprit of the lockups.
Comment 64 Alex Deucher 2016-01-20 17:12:34 UTC
(In reply to Julian from comment #63)
> At first I thought it was an issue with the writeback feature that caches
> certain register values because it also caches an rptr value that is used in
> the driver's gpu_lockup_check and is, to my knowledge, never actually
> written to.

More likely it never gets written because the GPU has hung due to something else.

> 
> Buuut using radeon.no_wb=1 doesn't help. So if I've found a bug it is not
> the culprit of the lockups.

The no_wb option isn't really applicable on newer chips and most likely won't work.  Newer hw does not support the necessary features to not support wb.  It's mainly a leftover from the early radeons.
Comment 65 Julian 2016-01-20 17:37:13 UTC
I've now tested the following configurations:

sclk_dpm_key_disabled = 0;
mclk_dpm_key_disabled = 0;
pcie_dpm_key_disabled = 0;
thermal_sclk_dpm_enabled = 0;
==> LOCKS UP - this is the default configuration

sclk_dpm_key_disabled = 1;
mclk_dpm_key_disabled = 1;
pcie_dpm_key_disabled = 1;
thermal_sclk_dpm_enabled = 1;
==> NO LOCKUP

sclk_dpm_key_disabled = 0;
mclk_dpm_key_disabled = 0;
pcie_dpm_key_disabled = 1;
thermal_sclk_dpm_enabled = 1;
==> LOCKS UP

sclk_dpm_key_disabled = 0;
mclk_dpm_key_disabled = 1;
pcie_dpm_key_disabled = 1;
thermal_sclk_dpm_enabled = 1;
==> LOCKS UP

sclk_dpm_key_disabled = 1;
mclk_dpm_key_disabled = 0;
pcie_dpm_key_disabled = 1;
thermal_sclk_dpm_enabled = 1;
==> NO LOCKUP - however, I periodically checked /sys/kernel/debug/dri/0/radeon_pm_info and mclk never actually changed (always 150000). Maybe mclk usually always changes based on sclk? With slck locked, mclk wouldn't change either, even if unlocked.




(In reply to Alex Deucher from comment #64)
> (In reply to Julian from comment #63)
> > At first I thought it was an issue with the writeback feature that caches
> > certain register values because it also caches an rptr value that is used in
> > the driver's gpu_lockup_check and is, to my knowledge, never actually
> > written to.
> 
> More likely it never gets written because the GPU has hung due to something
> else.

Sorry, I meant that I've searched the radeon source for the lines of code where the value was accessed from. I've only found instances of it being read, no instances of it being written to. But I won't put too much stock into that since I hardly know the code base at all. I plan to mess with the code a little now that I know how to compile single modules. If I find something substantial that this might be a bug, I'll open a new issue about it.


> > 
> > Buuut using radeon.no_wb=1 doesn't help. So if I've found a bug it is not
> > the culprit of the lockups.
> 
> The no_wb option isn't really applicable on newer chips and most likely
> won't work.  Newer hw does not support the necessary features to not support
> wb.  It's mainly a leftover from the early radeons.

Thanks. Good to know.
Comment 66 Alex Deucher 2016-01-20 17:51:58 UTC
(In reply to Julian from comment #65)
> (In reply to Alex Deucher from comment #64)
> > (In reply to Julian from comment #63)
> > > At first I thought it was an issue with the writeback feature that caches
> > > certain register values because it also caches an rptr value that is used in
> > > the driver's gpu_lockup_check and is, to my knowledge, never actually
> > > written to.
> > 
> > More likely it never gets written because the GPU has hung due to something
> > else.
> 
> Sorry, I meant that I've searched the radeon source for the lines of code
> where the value was accessed from. I've only found instances of it being
> read, no instances of it being written to. But I won't put too much stock
> into that since I hardly know the code base at all. I plan to mess with the
> code a little now that I know how to compile single modules. If I find
> something substantial that this might be a bug, I'll open a new issue about
> it.
> 

The gpu writes to it.  The driver only reads from it.  The GPU shadows the value in system memory so that the driver doesn't have to read the register directly.
Comment 67 Alex Deucher 2016-01-20 17:53:46 UTC
(In reply to Julian from comment #65)
> sclk_dpm_key_disabled = 1;
> mclk_dpm_key_disabled = 0;
> pcie_dpm_key_disabled = 1;
> thermal_sclk_dpm_enabled = 1;
> ==> NO LOCKUP - however, I periodically checked
> /sys/kernel/debug/dri/0/radeon_pm_info and mclk never actually changed
> (always 150000). Maybe mclk usually always changes based on sclk? With slck
> locked, mclk wouldn't change either, even if unlocked.

What are you testing with?  You may not be producing enough memory load.
Comment 68 Julian 2016-01-20 18:11:31 UTC
(In reply to Alex Deucher from comment #67)
> (In reply to Julian from comment #65)
> > sclk_dpm_key_disabled = 1;
> > mclk_dpm_key_disabled = 0;
> > pcie_dpm_key_disabled = 1;
> > thermal_sclk_dpm_enabled = 1;
> > ==> NO LOCKUP - however, I periodically checked
> > /sys/kernel/debug/dri/0/radeon_pm_info and mclk never actually changed
> > (always 150000). Maybe mclk usually always changes based on sclk? With slck
> > locked, mclk wouldn't change either, even if unlocked.
> 
> What are you testing with?  You may not be producing enough memory load.

Every time I've ran a couple of 3D games, used google maps (which I've previously used to cause the freeze; a few minutes are usually enough), and run some video steams. Usually all at once.

Here's a screenshot of what radeontop looks like: http://i.imgur.com/yQHXoFI.png
Comment 69 Julian 2016-01-20 18:15:39 UTC
P.S. Here's also a screengrab of radeon_pm_info: http://i.imgur.com/zvQeXRR.png

I've modified it to show the settings I've tested, to double-check that they are as they should be.
Comment 70 Julian 2016-01-20 18:31:18 UTC
PPS:

Nevermind. The mclk does actually change. It just immediately jumps to 150000 (which is the max my card is rated for) very easily. If I close all applications that could tax the grapics card at all, it jumps back down to 15000. Sorry for the confusion.
Comment 71 Alex Deucher 2016-01-20 19:17:56 UTC
Can you also test with the code and firmware from comment 32?
Comment 72 Julian 2016-01-20 20:48:43 UTC
(In reply to Alex Deucher from comment #71)
> Can you also test with the code and firmware from comment 32?

To clarify, do you mean only with the code from http://cgit.freedesktop.org/~agd5f/linux/commit/?h=new_smc or the whole kernel under agd5f?
Comment 73 Alex Deucher 2016-01-21 19:40:32 UTC
(In reply to Julian from comment #72)
> (In reply to Alex Deucher from comment #71)
> > Can you also test with the code and firmware from comment 32?
> 
> To clarify, do you mean only with the code from
> http://cgit.freedesktop.org/~agd5f/linux/commit/?h=new_smc or the whole
> kernel under agd5f?

Either use the kernel from that branch or cherry-pick the top 4 commits to whatever kernel you want to use and then install the new firmware files.
Comment 74 Julian 2016-01-22 17:01:49 UTC
ALright, I've applied the patch to cik.c and copied over the new firmware.

It's been running for two hours without problems now, with all reclocking enabled like normal. I'll test it for a day or two to be absolutely certain, but the freeze is long overdue by now.
Comment 75 Ioannis Panagiotopoulos 2016-01-25 22:10:22 UTC
(In reply to Alex Deucher from comment #73)
> (In reply to Julian from comment #72)
> > (In reply to Alex Deucher from comment #71)
> > > Can you also test with the code and firmware from comment 32?
> > 
> > To clarify, do you mean only with the code from
> > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=new_smc or the whole
> > kernel under agd5f?
> 
> Either use the kernel from that branch or cherry-pick the top 4 commits to
> whatever kernel you want to use and then install the new firmware files.

Also tried this since I have the same problem, compiled the new_smc branch, installed the kernel and the firmware, however the problem persists if I boot it without radeon.dpm=0.
Comment 76 Julian 2016-01-25 23:12:02 UTC
(In reply to Ioannis Panagiotopoulos from comment #75)
> Also tried this since I have the same problem, compiled the new_smc branch,
> installed the kernel and the firmware, however the problem persists if I
> boot it without radeon.dpm=0.

What card do you have?

You can try this as well: download and move the hawaii_k_smc.bin firmware image into /lib/firmware/radeon and rename it to hawaii_smc.bin (backup or delete the existing file). Boot into your normal unpatched Kernel and see if it works. That's what I'm doing right now and it works for me.

Just keep in mind that this might break your Linux install if it doesn't work (so having a live USB/CD and the backuped firmware file is handy).
Comment 77 Thomas DEBESSE 2016-01-26 00:22:05 UTC
Hi, I've not found the time to recompile anything yet, but I confirm that http://people.freedesktop.org/~agd5f/radeon_ucode/k/hawaii_k_smc.bin works with my MSI Radeon R9 390x. I just renamed it as "hawaii_smc.bin" so the vanilla code loads it without modification. I'm running "auto/balanced" DPM profile right now and was able to run the Unigine Valley Benchmark without issues (if there were issues, I will have had a lockup before anything would have been displayed).

Sorry for not having tested it before, it looks like I never received a notification for comment 32.
Comment 78 Ioannis Panagiotopoulos 2016-01-26 02:21:57 UTC
(In reply to Julian from comment #76)
> (In reply to Ioannis Panagiotopoulos from comment #75)
> > Also tried this since I have the same problem, compiled the new_smc branch,
> > installed the kernel and the firmware, however the problem persists if I
> > boot it without radeon.dpm=0.
> 
> What card do you have?
> 
> You can try this as well: download and move the hawaii_k_smc.bin firmware
> image into /lib/firmware/radeon and rename it to hawaii_smc.bin (backup or
> delete the existing file). Boot into your normal unpatched Kernel and see if
> it works. That's what I'm doing right now and it works for me.
> 
> Just keep in mind that this might break your Linux install if it doesn't
> work (so having a live USB/CD and the backuped firmware file is handy).

I replaced the firmware file and run update-initramfs -u
The system still hangs when it starts X-server on boot.
My gpu is Club3D R9 390
Comment 79 Ioannis Panagiotopoulos 2016-01-29 18:15:21 UTC
Ok, after further testing, It ends up that it might be a xserver driver problem. I downgraded to ver 7.3.0 x-server driver and the system booted with dpm enabled, however the Xserver did not have Hardware acceleration. Upgrading to the latest (7.6.x), makes the x-server fail to start with enabled dpm.
Comment 80 Thomas DEBESSE 2016-01-29 21:40:25 UTC
(In reply to Ioannis Panagiotopoulos from comment #78)
> I replaced the firmware file and run update-initramfs -u
> The system still hangs when it starts X-server on boot.
> My gpu is Club3D R9 390

I don't know if it can help upstream, but Alex asked in comment #1 for vbios dump, perhaps he will need yours too! (he wrote how to do that)

(In reply to Ioannis Panagiotopoulos from comment #79)
> Ok, after further testing, It ends up that it might be a xserver driver
> problem. I downgraded to ver 7.3.0 x-server driver and the system booted
> with dpm enabled, however the Xserver did not have Hardware acceleration.
> Upgrading to the latest (7.6.x), makes the x-server fail to start with
> enabled dpm.

Well, if your Xserver is not able to get DRI working, you probably do not have the opportunity to trigger the bug. It's something you must fix first. ^_^

To prevent the auto profile to hang up your GPU while you're playing with your xserver to fix it while dpm is enabled, do something like that, they looks like safe options for some people like me:

echo 'low' > '/sys/class/drm/card0/device/power_dpm_force_performance_level'
echo 'battery' > '/sys/class/drm/card0/device/power_dpm_state'

After you get your xserver working with direct rendering enabled, try to do:

echo 'auto' > '/sys/class/drm/card0/device/power_dpm_force_performance_level'

and see if it still crashes.
Comment 81 Ioannis Panagiotopoulos 2016-01-29 23:12:32 UTC
Created attachment 121402 [details]
Club3D R9 390 Vbios
Comment 82 Ioannis Panagiotopoulos 2016-01-29 23:12:49 UTC
I used your method, but it worked 2 out of my 5 attempts. To be precise, I booted on text mode, set the values as you wrote and then startx. On successful starts, the environment was still slow, despite the Xorg.log stating that acceleration was on. The successful starts did not last very long however, as they crashed after 1-2 minutes.

I attached my GPU Vbios.
Comment 83 Thomas DEBESSE 2016-01-30 08:56:21 UTC
(In reply to Ioannis Panagiotopoulos from comment #82)
> the environment was still slow, despite the Xorg.log
> stating that acceleration was on.

Yeah, that was a "low" profile, so it was expected to be slow… You can do that to get a "high" profile:

echo 'high' > '/sys/class/drm/card0/device/power_dpm_force_performance_level'
echo 'performance' > '/sys/class/drm/card0/device/power_dpm_state'

But by the way:

> The successful starts did not last very
> long however, as they crashed after 1-2 minutes.

If you still get a hang using a profile different than "auto" it looks like you are facing an issue I never got myself… It means the bug is wider than expected, it's not a good news. I can't help more at this point, I hope someone else will have an answer.
Comment 84 Ioannis Panagiotopoulos 2016-01-30 20:20:31 UTC
(In reply to Thomas DEBESSE from comment #83)
I did further testing. I used a script on init.d to assign the values before xserver starts, and managed to get a working system with the parameters set on low, battery. I created 2 scripts to run after the system boot with dpm enabled, to toggle between low/battery to high/performance and vice versa. Changing from low to high succeeds and x server runs well. I even run a 3d game for about 10 minutes without problem. However when using the script to change back to low/battery, the system crashes with black screen. I tested this case 3 times and crashed all 3 when set from high to low.
Comment 85 Jonas 2016-01-31 13:33:55 UTC
I just tried the hawaii_k_smc.bin as in comment #77, and it worked great. I could play every game I have installed without problem. Some games still don't work as they should (worse performance than my old 7770), but most of them work great.

Now I'm using latest .bin files as pointed out in comment #32 and it still works great.

So, it's definitely getting better, at least for some of us :). Thanks for your hard work!
Comment 86 Zentdayn 2016-03-15 17:57:23 UTC
I also had this issue but comment #77 fixed it for me.

My Ubuntu install is now usable.
Comment 87 Zentdayn 2016-03-15 18:12:19 UTC
(In reply to Zentdayn from comment #86)
> I also had this issue but comment #77 fixed it for me.
> 
> My Ubuntu install is now usable.

Nevermind, probably luck. Just froze, going back to low profile.
Comment 88 Harald Judt 2016-03-18 11:44:41 UTC
What is more annoying than dpm not working is problems with hibernating and resuming. Is anyone here able hibernate and resume correctly (multiple times)? Standby seems to work fine, except dpm of course.
Comment 89 Orlando Nigro 2016-03-21 17:31:07 UTC
Hi! I have been reading this bugreport since several days since it seems to be very cloase to my situation!

I have a MSI r9 390. 
I have debian jessie with kernel 4.3, latest Mesa, and I also installed 
"firmware-amd-graphics" as jonas wrote in Comment 23. 

As it is now I can start debian both with Mate and Gnome, and of course Gnome freezes much faster while it takes a while for Mate, also because, for some reason, all the graphic funcionalities are deactivated in Mate, the docky doesn't allow me to choose 3d icons for instance.
Gnome looks much better with all its windows-effects and nice graphic and it makes me think that, a part from the freezing, the open source drivers work fine. 

I have been making lots of tests, using different distribution with different drivers, kernels and DE but nothing, same problem all the time! I get black screens while working. I haven't tried yet to change the DPM, and I think that it could be the solution! (I hope) I intend to do it now and run some tests! But I have a couple of question first. 
Should I dowload the unicode as recommended by many like this

 http://people.freedesktop.org/~agd5f/radeon_ucode/

or this:

people.freedesktop.org/~agd5f/radeon_ucode/k/hawaii_k_smc.bin

or will the ones I have got in the debian package be enough?

Do I only need to run this command to make the tests: 
echo 'high' > '/sys/class/drm/card0/device/power_dpm_force_performance_level'

Or a do I need to stop the xserver first? (would it be gdm? I choose lightdm when I installed gnome alongside Mate)

I apologize for all the questions that should be obvious after all the great work you have done in this report but I'm afraid to do wrong and I'm not much of an expert.
Comment 90 Jonas 2016-03-27 19:47:58 UTC
comment #32 is really all you need right now. I tried it in Arch Linux and it also works like a charm. Just follow instructions on comment #77 for the "k" firmware and you're good to go. You shouldn't need to change anything in "/sys/class/drm/card0/device/power_dpm_force_performance_level", since this was to avoid the problem. Now the default behaviour (the automated dpm one) works with latest firmware.

I hope you can get it to work too :).
Comment 91 Orlando Nigro 2016-03-31 18:01:56 UTC
it didn't work :(. I downloaded the firmware (the k one) changed the name and replaced the old one. I reboot and without change the DPM value it freezes after a while. 
Now I'm back using the echo-command when I login and it's the only way to make the GPU work. Do I maybe have to do more after replacing the file? Run some command?
Comment 92 Alex Deucher 2016-03-31 18:03:45 UTC
(In reply to Orlando Nigro from comment #91)
> it didn't work :(. I downloaded the firmware (the k one) changed the name
> and replaced the old one. I reboot and without change the DPM value it
> freezes after a while. 
> Now I'm back using the echo-command when I login and it's the only way to
> make the GPU work. Do I maybe have to do more after replacing the file? Run
> some command?

If you are using an initrd, you'll need to update the copy of the firmware in your initrd.
Comment 93 C 2016-03-31 18:34:59 UTC
(In reply to Orlando Nigro from comment #91)
> it didn't work :(. I downloaded the firmware (the k one) changed the name
> and replaced the old one. I reboot and without change the DPM value it
> freezes after a while. 
> Now I'm back using the echo-command when I login and it's the only way to
> make the GPU work. Do I maybe have to do more after replacing the file? Run
> some command?

I think only MSI 390X and XFX 390 users confirmed that hawaii_k_smc.bin works for them. I also own R9 390 from MSI and it does not work for me either. Tried in both Arch and Fedora, with updated initrd.

@Alex, is the firmware in radeon_ucode/k/ folder the same that was just updated in the linux-firmware git tree?
http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/?id=6e767c2b85c62fb7325fdc00f51b90f6747c13ab
Comment 94 Alex Deucher 2016-03-31 18:37:13 UTC
(In reply to Christoffer from comment #93)
> @Alex, is the firmware in radeon_ucode/k/ folder the same that was just
> updated in the linux-firmware git tree?
> http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/
> commit/?id=6e767c2b85c62fb7325fdc00f51b90f6747c13ab

No, they are different.
Comment 95 Orlando Nigro 2016-04-03 17:34:26 UTC
(In reply to Christoffer from comment #93)
> (In reply to Orlando Nigro from comment #91)
> > it didn't work :(. I downloaded the firmware (the k one) changed the name
> > and replaced the old one. I reboot and without change the DPM value it
> > freezes after a while. 
> > Now I'm back using the echo-command when I login and it's the only way to
> > make the GPU work. Do I maybe have to do more after replacing the file? Run
> > some command?
> 
> I think only MSI 390X and XFX 390 users confirmed that hawaii_k_smc.bin
> works for them. I also own R9 390 from MSI and it does not work for me
> either. Tried in both Arch and Fedora, with updated initrd.
> 
> @Alex, is the firmware in radeon_ucode/k/ folder the same that was just
> updated in the linux-firmware git tree?
> http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/
> commit/?id=6e767c2b85c62fb7325fdc00f51b90f6747c13ab

I see. I also own a MSI r9 390, that would explain the fact that it's not working. But I didn't update initrd (how do I do it and hod do I check if I need to do it, sorry about my ignorance :S ), but apparently it won't work anyway.
Comment 96 Jan Ziak 2016-04-14 06:43:31 UTC
(In reply to Alex Deucher from comment #94)
> (In reply to Christoffer from comment #93)
> > @Alex, is the firmware in radeon_ucode/k/ folder the same that was just
> > updated in the linux-firmware git tree?
> > http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/
> > commit/?id=6e767c2b85c62fb7325fdc00f51b90f6747c13ab
> 
> No, they are different.

I do not mean to sound overly impatient, but this bug has been reported on 2015-09-04 15:00 UTC. Today it is 2016-04-14 06:15 UTC.

Also, http://people.freedesktop.org/~agd5f/radeon_ucode/k/hawaii_k_smc.bin has modification date 2015-12-10 05:36.

If hawaii_k_smc.bin is better than hawaii_smc.bin, then why hasn't it been uploaded to git.kernel.org firmware tree? 

Why is http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/radeon/hawaii_smc.bin?id=6e767c2b85c62fb7325fdc00f51b90f6747c13ab different from hawaii_k_smc.bin? 

Does, or doesn't, hawaii_smc.bin uploaded to git.kernel.org on 2016-03-31 01:15:57 GMT contain the fix for this freedesktop.org bug?

Could the problem be that hawaii_k_smc.bin for R9 390 is incompatible with R9 290? In that case the solution would be to create grenada_smc.bin.


There are too many contradictions for me to understand the situation.
Comment 97 Jonas 2016-04-20 18:59:47 UTC
Actually, it sure is annoying to have to change files on every firmware update. Today's firmware update in Arch Linux broke DPM again. After using firmware files of 16 october 2015, everything is fine again.
Comment 98 Lauri Gustafsson 2016-04-23 13:48:04 UTC
Yup, can confirm that http://people.freedesktop.org/~agd5f/radeon_ucode/k/hawaii_k_smc.bin partially fixes the issue. Screen(s) are a bit flickery when running GPU heavy applications but no crashes or hangups.
Comment 99 Brian 2016-05-02 04:05:47 UTC
Just wanted to throw my chip on the pile..

I'm running an R7 360 (BONAIRE, uses GCN 1.1, just like HAWAII)and I'm having the exact same issues. And forcing DPM to performance mode fixes the issue for me.

Using kernel 4.6rc5 on openSUSE Tumbleweed.
Comment 100 John Frei 2016-05-04 22:38:55 UTC
Unfortunately the new ucode (hawaii_k_smc.bin) file doesn't work for me.

I don't know if it does matter but as I tried to install a Hackintosh system recently I expericenced EXACTLY the same symptom(black screen, unresponsive system) after a couple of minutes.

One could argue that it's a complete different setup but it is the fact that if I disable the GraphicsEnabler then the black screen issue vanishes.
So maybe one can assume that the dpm (via hardware) is corrupt.

Is it possible to try (experimentally) the new amdgpu-pro driver for r9 390 on Ubuntu 16.04?
Or can we expect to receive the same issue as the all-open stack, either way?
Comment 101 Thomas DEBESSE 2016-05-04 23:36:33 UTC
> Is it possible to try (experimentally) the new amdgpu-pro driver for r9 390 on Ubuntu 16.04?

I did it :

1. some wrong firmware issue message
2. very poor performance (worst than radeon with "low battery" profile), probably reclocking issue
3. EDID issue (you have to add modeline for your screen resolution by hand)
4. Proprietary advanced stuff based on amdgpu is broken (like OpenCL not working due to wrong vram size reporting)
5. etc.

The amdgpu driver has a very big problem: it's a driver without user. I mean, it works almost only on APU because it's the current market for GCN 1.2, it means it works for people who have a laptop to not do OpenCL nor Vulkan, but people owning powerful GPU still have GCN 1.1 hardware since it's still the market for powerful GPU.
Comment 102 Parker Reed 2016-05-09 02:40:02 UTC
(In reply to Brian from comment #99)
> Just wanted to throw my chip on the pile..
> 
> I'm running an R7 360 (BONAIRE, uses GCN 1.1, just like HAWAII)and I'm
> having the exact same issues. And forcing DPM to performance mode fixes the
> issue for me.
> 
> Using kernel 4.6rc5 on openSUSE Tumbleweed.

What did you do to get it working? I keep getting these errors on linux-git builds https://bugzilla.kernel.org/show_bug.cgi?id=117151
Comment 103 Parker Reed 2016-05-09 02:53:23 UTC
(In reply to Parker Reed from comment #102)
> (In reply to Brian from comment #99)
> > Just wanted to throw my chip on the pile..
> > 
> > I'm running an R7 360 (BONAIRE, uses GCN 1.1, just like HAWAII)and I'm
> > having the exact same issues. And forcing DPM to performance mode fixes the
> > issue for me.
> > 
> > Using kernel 4.6rc5 on openSUSE Tumbleweed.
> 
> What did you do to get it working? I keep getting these errors on linux-git
> builds https://bugzilla.kernel.org/show_bug.cgi?id=117151

And I just realized this thread pertains to Radeonsi... oh well. Here's to hoping something comes of the kernel bug report.
Comment 104 Ernst Sjöstrand 2016-06-02 07:23:37 UTC
https://patchwork.freedesktop.org/series/8116/
Comment 105 Jan Ziak 2016-06-02 14:54:14 UTC
(In reply to Ernst Sjöstrand from comment #104)
> https://patchwork.freedesktop.org/series/8116/

GPU: R9 390
GPU manufacturer: Gigabyte
Kernel: 4.5.5
Kernel module: radeon.ko

I applied the kernel patches and let Metro Last Light benchmark run for about an hour with DPM enabled while I went shopping.

Before the patch: The benchmark would lock the GPU after a while

After the patch: Running OK for about an hour

I checked that 'new_smc' variable in radeon/cik.c gets set to 1.

----

Remaining issues on R9 390 after the patch:

Heavy screen flickering, not capturable on a screenshot. It is related to mclk transitions. Forcing mclk=1.5GHz, and letting sclk be controlled by DPM, removes the flickering.

I created a new bugzilla entry related to the flickering: http://bugs.freedesktop.org/show_bug.cgi?id=96326

----

GPU lockup in http://bugs.freedesktop.org/show_bug.cgi?id=92302 might have the same cause as this issue.
Comment 106 Thomas DEBESSE 2016-06-30 00:41:48 UTC
If the latest amdgpu-pro release (16.30.3) includes the patches from Comment 104 , it means these patches fix the bug since it's the first amdgpu driver I can run without seeing my system hanging and without having to do any workaround. So, it probably means the same patches work for radeon too if they do exactly the same on the radeon side.
Comment 107 Chris Waters 2016-07-17 06:26:55 UTC
Since this bug makes the R9 390 either completely unusable or extremely badly performing (by disabling DPM), shouldn't this bug have a higher severity/importance rating?
Comment 108 John Bridgman 2016-07-17 19:00:46 UTC
Chris, are you using the -k firmware files and associated kernel patches, updated initrd if needed, etc... ?

My impression is that a few different issues are being discussed in this one ticket, and that is hampering progress. Not quite sure what the right split would be (if we were to close this and replace with N more focused tickets) but probably one of them should be for stability problems when running the -k firmware and checking whether locking dpm level to performance makes a difference.
Comment 109 Jan Ziak 2016-07-17 20:18:15 UTC
(In reply to John Bridgman from comment #108)
> Chris, are you using the -k firmware files and associated kernel patches,
> updated initrd if needed, etc... ?
> 
> My impression is that a few different issues are being discussed in this one
> ticket, and that is hampering progress. Not quite sure what the right split
> would be (if we were to close this and replace with N more focused tickets)
> but probably one of them should be for stability problems when running the
> -k firmware and checking whether locking dpm level to performance makes a
> difference.

My question would be: Why isn't the _k patch already in linux-git?

I tested the patch on my R9 390, so the Hawaii-specific part of the patch works on at least one machine in the world outside of freedesktop.org.

Or is there a reason to delay patch submission to linux-git?

_k firmware files are already available in Gentoo Linux for example.
Comment 110 Chris Waters 2016-07-19 21:39:53 UTC
(In reply to John Bridgman from comment #108)
> Chris, are you using the -k firmware files and associated kernel patches,
> updated initrd if needed, etc... ?

I have not for two reasons.

1. As Christoffer mentions, the _k firmware only seems good for the 390X

2. I'm unfamiliar with the process of doing such and the steps for doing it are spread across 100+ comments with no real sign of what steps are the right ones.


> My impression is that a few different issues are being discussed in this one
> ticket, and that is hampering progress. Not quite sure what the right split
> would be (if we were to close this and replace with N more focused tickets)
> but probably one of them should be for stability problems when running the
> -k firmware and checking whether locking dpm level to performance makes a
> difference.

Still, these issues make using an 390 or 390X on Linux a nightmare. This issue isn't "minor" and its importance rating should reflect that. This is a major (or even critical) bug report, IMO.

One that has been open for 10 months, I might add.
Comment 111 John Bridgman 2016-07-20 01:39:27 UTC
(In reply to Jan Ziak from comment #109)
> My question would be: Why isn't the _k patch already in linux-git?
> 
> I tested the patch on my R9 390, so the Hawaii-specific part of the patch
> works on at least one machine in the world outside of freedesktop.org.
> 
> Or is there a reason to delay patch submission to linux-git?
> 
> _k firmware files are already available in Gentoo Linux for example.

The patch is queued up for 4.8. It's the usual chicken-and-egg problem... if we can't get enough users testing a patch early in a kernel cycle then it ends up having to wait for the next merge window. 

(In reply to Chris Waters from comment #110)
> (In reply to John Bridgman from comment #108)
> > Chris, are you using the -k firmware files and associated kernel patches,
> > updated initrd if needed, etc... ?
> 
> I have not for two reasons.
> 
> 1. As Christoffer mentions, the _k firmware only seems good for the 390X

I think that is part of the "few different issues" point. The -k microcode may not be the only change required but AFAIK it is one part of the solution. 

In other words if the patch and new ucode don't fix the problem that's still useful information, and it doesn't mean the new ucode isn't worth running. 

> 
> 2. I'm unfamiliar with the process of doing such and the steps for doing it
> are spread across 100+ comments with no real sign of what steps are the
> right ones.

Yep, that's a fair point. I was just trying to make sure we were collecting good data. Thanks.
Comment 112 Chris Waters 2016-07-20 03:22:33 UTC
> Yep, that's a fair point. I was just trying to make sure we were collecting
> good data. Thanks.

I'm more than willing to help test, just need directions. Having everything split up over all these comments is messy and makes this nigh impossible to figure out for those not familiar with the process.
Comment 113 John Bridgman 2016-07-20 10:01:16 UTC
Created attachment 125155 [details]
attachment-9713-0.html

Yep, agree. Will see if I can get that documented. Thanks !!

From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of bugzilla-daemon@freedesktop.org
Sent: Tuesday, July 19, 2016 11:23 PM
To: dri-devel@lists.freedesktop.org
Subject: [Bug 91880] Radeonsi on Grenada cards (r9 390) exceptionally unstable and poorly performing

Comment # 112<https://bugs.freedesktop.org/show_bug.cgi?id=91880#c112> on bug 91880<https://bugs.freedesktop.org/show_bug.cgi?id=91880> from Chris Waters<mailto:doublebytegames@gmail.com>

> Yep, that's a fair point. I was just trying to make sure we were collecting

> good data. Thanks.



I'm more than willing to help test, just need directions. Having everything

split up over all these comments is messy and makes this nigh impossible to

figure out for those not familiar with the process.

________________________________
You are receiving this mail because:

  *   You are the assignee for the bug.
Comment 114 pc.jago1337 2016-07-23 01:57:59 UTC
This bug really should be marked as MAJOR, seeing as how it literally runs the same as or worse than the R9 270X, which is something like $200 cheaper.
Comment 115 Jonas 2016-07-25 10:22:58 UTC
I'll try to sum up, to help a little if some people are confused about the process of "workarounding" it.


1) Go here: https://people.freedesktop.org/~agd5f/radeon_ucode/hawaii
2) Download every file.
3) Go here https://people.freedesktop.org/~agd5f/radeon_ucode/k/
4) Download hawaii_k_smc.bin
5) Rename this file to hawaii_smc.bin (to replace the one you have from step 2)
6) Put all those files in your "firmware" folder (in my case /usr/lib/firmware/radeon), make a backup of those before, if you want.
7) Reboot & enjoy.

For some people it seems necessary to update initrd, but in my case (Arch Linux), only a reboot gives me good DPM support and it almost never crashes. On every firmware update on your system, you might have to repeat all the steps (until those files are used on your distro by default).

I hope it is somewhat easier to get it working.
Comment 116 Calico Bass 2016-08-02 19:19:18 UTC
I found this bug report because I was having problems extracting my vbios and parsing the extracted rom. I notice that all the vbios.rom files attached here are 64K.

When parsing my rom I found it was truncated to 64K. A coincidence or a symptom of an unrelated or related problem?
Comment 117 Yuxuan Shui 2016-08-04 06:29:18 UTC
This _k firmware seems to fix the problem for me as well (have been running without lockup for a while), but only with the radeon driver. With amdgpu, I still got lock ups, though less often.
Comment 118 pc.jago1337 2016-08-11 09:03:11 UTC
Using radeon, the microcode 'fix' doesn't fix the problem for me. Perhaps there are multiple bugs, and not just the one?
Comment 119 PhilipW 2016-08-31 11:07:53 UTC
I also suspect this - microcode does not fix the issue on a Powercolor r9 390 pcs+.
Comment 120 DesiOtaku 2016-09-19 02:36:35 UTC
I hate to "me too" here, but I can confirm that the steps outlined in comment #115 does not resolve the problem. However, the power_dpm_force_performance_level and power_dpm_state trick outlined in comment #29 does prevent it from freezing.

I have a ASUS Radeon R9 390X running Kubuntu 16.04. I am willing to test out any potential fixes.
Comment 121 emilio.moretti 2016-10-14 21:20:13 UTC
Finally working:
I just upgraded to ubuntu 16.10  (kernel 4.8.0-22-generic) and I've been using the pc all day long without problems (this was impossible before).
I had to use the kernel parameter radeon.dpm=0 in grub in order to get a stable desktop, and I can confirm I don't need it any more.
Comment 122 Lauri Gustafsson 2016-10-28 19:25:09 UTC
Seconded that 4.8 kernel on Arch has fixed the unstability. Still getting rather bad artifacts from time to time because of the dynamic memory clock.
Comment 123 Marcel Schaal 2016-10-31 19:47:58 UTC
Still no luck with PowerColor R9 390 PCS+ on Fedora 25 (Kernel 4.8, mesa 12.0.3). Computer freezes after a few seconds. It seems to last a bit longer with dpm=0, but less than 5-10 minutes. Attaching vbios since it was never provided.
Comment 124 Marcel Schaal 2016-10-31 19:49:04 UTC
Created attachment 127648 [details]
PowerColor R9 390 PCS+ vbios
Comment 125 Christoph Seifert 2016-11-01 18:53:34 UTC
For me switching power states does also result in a system freeze. With radeon.dpm = 0 everthing is working properly but slowly. If I switch manually to another power profile (e.g. echo high > /sys/class/drm/card0/device/power_profile) I got a freeze likewise. With radeon.dpm = 1 the freeze happens after a few seconds of video playback or some other load, ergo if the card changes its power profile.

The work around (k firmware) from comment #115 does not work for me. Same behaviour as without.

So I tried disabling specific DPM features as suggested by Alex Deucher in comment #60. Disabling mclk (pi->mclk_dpm_key_disabled = 1) does the trick for my card but the performance is similar to the radeon.dpm = 0 kernel parameter.

With radeon.dpm = 1 and only mclk disabled, the sclk (core clock) adjusts just fine. High on load, low on no load. The mclk (memory clock) is just fixed at 150 Mhz.

So the freezes seems to be caused by switching memory clock.


Any hints for digging deeper?

My card is a MSI Radeon R9 390 too.
Linux Kernel 4.8.6
Mesa 13.0.0rc2
Comment 126 Jan Ziak 2016-11-02 09:52:39 UTC
Resolution of this bug is critical for CONFIG_DRM_AMDGPU_CIK to move from experimental to non-experimental state.
Comment 127 Chris Waters 2016-11-03 05:52:33 UTC
I've noticed that on my Win7 install if I up the memory clock a good (1700MHz+) bit I'll get the same exact artifacts as I would in Linux. Maybe the driver is trying to switch to too high of a clock for memory?

Second question, is there any way to just force the memory clock to stay at 1500MHz? If the bug is caused by memory switching (and not the memory being clocked way too high), then forcing the memory to stay at the max standard clock should give us the stability of 150MHz.
Comment 128 Jan Ziak 2016-11-03 09:45:56 UTC
(In reply to Chris Waters from comment #127)
> Second question, is there any way to just force the memory clock to stay at
> 1500MHz? If the bug is caused by memory switching (and not the memory being
> clocked way too high), then forcing the memory to stay at the max standard
> clock should give us the stability of 150MHz.

Some months ago, I thought it was just mclk switching. But today I am not so sure about it.

Executing the following steps on my machine with an R9 390:

- lock mclk to 1500MHz
- start playing Mad Max via Steam
- switch to the game map via gamepad or keyboard (https://www.google.sk/search?q=mad+max+game+map&tbm=isch)

results in noticeable screen image flickering.

If I disable the lowest 300 MHz shader clock, so that the minimum shader clock is 500 MHz and maximum is 1000 MHz, the flickering disappears.
Comment 129 Chris Waters 2016-11-03 22:46:57 UTC
> If I disable the lowest 300 MHz shader clock, so that the minimum shader
> clock is 500 MHz and maximum is 1000 MHz, the flickering disappears.

What about lower settings than 500 MHz?  Did you try 400 MHz or even 350?

What is the performance like?
Comment 130 Chris Waters 2016-11-30 13:42:46 UTC
This bug has been marked as 'critical' for over a month now, has there been any work on this in that time?

The bug seems related to the shader clock, possibly the driver isn't setting correct values?  Has anyone checked to see if the driver is properly setting the clock?
Comment 131 Harald Judt 2016-11-30 14:12:29 UTC
I do not have problems so far except there are sometimes red pixels flickering on screen. Still need to investigate why that happens.
Comment 132 Filip Brygidyn 2016-12-04 23:11:56 UTC
I am experiecing the same freezes on my updated arch box (kernel 4.8.11-2, mesa 13.0.2-2) with my MSI R9 390.
Without radeon.dpm=0 I observe the following:
- I can get into and stay in tty with no crashes
- I experience wierd filckering/'image jumping'. (Not artifacts)
- With 2 monitors connected the mentioned flickering is unbearable and the image jumps so much it's hard to see anything. (and it gets worse over time)
- After I start a window manager (xfce in my case) the system will freeze at the moment I cause any major load. This would support the ongoing suspition that the problem may be caused by clock or voltage changes.

With radeon.dpm=0 I am yet to crash a single time in the similar way.

This may be unrelated but I experience similar (although much less frequently occuring) crashes under windows.
The system would run perfectly fine under load or idle for a long time but when the load is changing the system sometimes freezes or crash with a BSOD.
Good example would be when I am running a game and I alt-tab a lot, changing the load drasticaly from 100% to idle and back. That is the moment when crashes occur most frequently.


I am yet to try other firmware, mclk or other suggestions mentioned in here.
Will post when I get enough time and motivation to tinker...
Comment 133 Chris Waters 2016-12-08 22:59:53 UTC
I decided to record how my system behaves when using Linux and my 390 together.

This is what I see https://www.youtube.com/watch?v=9uVIzHFlTZk

This is Ubuntu MATE 16.10 with Firefox open¹, something that is not at all graphically intensive. On KDE Plasma 5 it's even worse, my entire system will sometimes crash. It's completely unusable and I'm stuck on Windows because of this.

¹ I know there is a tab that says "Chrome Experiments", but it's unloaded. I tried loading one to see what would happen if I tried something a bit intensive, but my entire system crashed. Ended up having to restart recovered my session in Firefox.

This look the same for other 390 users?
Comment 134 Jan Ziak 2016-12-08 23:25:32 UTC
(In reply to Chris Waters from comment #133)
> I decided to record how my system behaves when using Linux and my 390
> together.
> 
> This is what I see https://www.youtube.com/watch?v=9uVIzHFlTZk
> 
> This is Ubuntu MATE 16.10 with Firefox open¹, something that is not at all
> graphically intensive. On KDE Plasma 5 it's even worse, my entire system
> will sometimes crash. It's completely unusable and I'm stuck on Windows
> because of this.
> 
> ¹ I know there is a tab that says "Chrome Experiments", but it's unloaded. I
> tried loading one to see what would happen if I tried something a bit
> intensive, but my entire system crashed. Ended up having to restart
> recovered my session in Firefox.
> 
> This look the same for other 390 users?

The video is similar to what I am seeing, in the video it is more severe.

----

If the following commands are executable on your machine:

$ ls -d /sys/class/drm/card?
/sys/class/drm/card0  /sys/class/drm/card1

$ cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 300Mhz *
1: 500Mhz 
2: 698Mhz 
3: 858Mhz 
4: 899Mhz 
5: 935Mhz 
6: 969Mhz 
7: 1000Mhz

then (question) does the flickering go away if you run:

$ echo 1234567 > /sys/class/drm/card1/device/pp_dpm_sclk
Comment 135 Jan Ziak 2016-12-08 23:27:51 UTC
(In reply to Jan Ziak from comment #134)
> (In reply to Chris Waters from comment #133)
> > I decided to record how my system behaves when using Linux and my 390
> > together.
> > 
> > This is what I see https://www.youtube.com/watch?v=9uVIzHFlTZk
> > 
> > This is Ubuntu MATE 16.10 with Firefox open¹, something that is not at all
> > graphically intensive. On KDE Plasma 5 it's even worse, my entire system
> > will sometimes crash. It's completely unusable and I'm stuck on Windows
> > because of this.
> > 
> > ¹ I know there is a tab that says "Chrome Experiments", but it's unloaded. I
> > tried loading one to see what would happen if I tried something a bit
> > intensive, but my entire system crashed. Ended up having to restart
> > recovered my session in Firefox.
> > 
> > This look the same for other 390 users?
> 
> The video is similar to what I am seeing, in the video it is more severe.
> 
> ----
> 
> If the following commands are executable on your machine:
> 
> $ ls -d /sys/class/drm/card?
> /sys/class/drm/card0  /sys/class/drm/card1
> 
> $ cat /sys/class/drm/card1/device/pp_dpm_sclk
> 0: 300Mhz *
> 1: 500Mhz 
> 2: 698Mhz 
> 3: 858Mhz 
> 4: 899Mhz 
> 5: 935Mhz 
> 6: 969Mhz 
> 7: 1000Mhz
> 
> then (question) does the flickering go away if you run:
> 
> $ echo 1234567 > /sys/class/drm/card1/device/pp_dpm_sclk

Correction:

$ echo manual > sys/class/drm/card1/device/power_dpm_force_performance_level
$ echo 1234567 > /sys/class/drm/card1/device/pp_dpm_sclk
Comment 136 Chris Waters 2016-12-09 01:15:33 UTC
Can this be done without rebooting?  I'd like to test this on a liveCD since, as I've said, I'm currently stuck on Windows. The idea of needing to install a distro just to test this is a bit unappealing to me.
Comment 137 Thomas DEBESSE 2016-12-09 04:36:54 UTC
(In reply to Chris Waters from comment #136)
> Can this be done without rebooting?  I'd like to test this on a liveCD
> since, as I've said, I'm currently stuck on Windows. The idea of needing to
> install a distro just to test this is a bit unappealing to me.

It's not only it can, it must be done without rebooting, stuff in /sys are live settings, even on installed distro, you must expect to lose them on reboot, it's a fake file system, writing there does not write something on your hard disk, reading and writing there is just reading and writing bits in memory with a file system view for convenience.

Beware, there is a little mistake in Jan Ziak's directives (missing a leading slash before “sys”), this is ok:

$ echo manual > /sys/class/drm/card1/device/power_dpm_force_performance_level
$ echo 1234567 > /sys/class/drm/card1/device/pp_dpm_sclk

Do that *and do not reboot* or you'll lose the changes so you will never test them. The way to test them is to apply these changes at runtime and doing stuff without rebooting. If you reboot you'll lose the change.
Comment 138 Chris Waters 2016-12-10 21:13:09 UTC
Update:

Finally had time to boot a liveCD (Ubuntu MATE 16.10) and try this all out.

I have multiple card listings in /sys/class/drm, but only one that is a card# folder (ie card0), the rest are all variations on card0 plus  port type (eg card0-HDMI-A-1).

I have no pp_dpm_sclk in /sys/class/drm/card0/device/ so I'm unable to do the first command.

I have power_dpm_force_performance_level and I was going to at least try setting it to manual just to see, but echo says "write error: Invalid argument" when doing it as root and bash complains about lack of permissions when not root (sudo makes no difference).

Mesa version is 12.0.3, so I'm grabbing Manjaro since that is running on Mesa 13 and will test there.
Comment 139 Chris Waters 2016-12-11 05:55:42 UTC
Tried Manjaro and an install of Ubuntu MATE with a ppa for mesa-git drivers. No pp_dpm_sclk in either.

Am I missing something?
Comment 140 Jan Ziak 2016-12-11 11:12:49 UTC
(In reply to Chris Waters from comment #139)
> Tried Manjaro and an install of Ubuntu MATE with a ppa for mesa-git drivers.
> No pp_dpm_sclk in either.
> 
> Am I missing something?

Manjaro (both stable and development) is running on Linux kernel 4.4.

Ubuntu MATE 16.10 is running on Linux kernel 4.8.0, but it has CONFIG_DRM_AMDGPU_CIK disabled in kernel configuration and is loading radeon.ko instead of amdgpu.ko.

The following 3 commands can be used to check whether amdgpu+CIK are enabled:

$ uname -r
4.8.0 (or later version)

$ lsmod | grep amdgpu
amdgpu

$ zgrep CIK /proc/config.gz
CONFIG_DRM_AMDGPU_CIK=y

An alternative form of the last command:

$ grep CIK /boot/config-$(uname -r)
CONFIG_DRM_AMDGPU_CIK=y
Comment 141 Chris Waters 2016-12-11 19:39:21 UTC
At what point did we go from talking about radeon, the driver this bug report is about, to amdgpu?

Isn't support and performance for GCN 1.1 cards rather bad on amdgpu compared to radeon?
Comment 142 Jan Ziak 2016-12-11 21:39:48 UTC
(In reply to Chris Waters from comment #141)
> At what point did we go from talking about radeon, the driver this bug
> report is about, to amdgpu?

This freedesktop.org bug is a Mesa bug. The bug title is "Radeonsi ...".

If the visual artifacting issue is solved in radeon.ko, the patch will probably be in short time applied to amdgpu.ko, and vice versa.

> Isn't support and performance for GCN 1.1 cards rather bad on amdgpu
> compared to radeon?

Support for R9 390/390X (Grenada) in amdgpu.ko is basically equivalent to radeon.ko (at least on my machine). Performance is very similar too.

amdgpu.ko and radeon.ko are loading the same firmware files for R9 390/390X (/lib/firmware/radeon/{hawaii,HAWAII}*.bin).
Comment 143 Harald Judt 2016-12-19 14:07:15 UTC
I have merged https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.10-wip into 4.9 and am now using amdgpu with the r9 390, and everything works great so far except that the cursor is no longer visible after hibernating and resuming, but switching to SWCursor resolves this, so this is a viable workaround. I will open another bug for this, but it is only a minor issue for me. With radeon, DPM still fails to resume properly for the 390 after hibernation.
Comment 144 Michel Dänzer 2016-12-20 02:13:09 UTC
(In reply to Harald Judt from comment #143)
> With radeon, DPM still fails to resume properly for the 390 after hibernation.

Make sure the bonaire_uvd.bin firmware file is up to date, see bug 98988.
Comment 145 Harald Judt 2017-01-11 09:55:57 UTC
You are certainly right and the files differ, so I assume it could be a problem with the firmware. But since amdgpu works fine for me now, I will simply use this and bother no longer.
Comment 146 Marek Olšák 2017-01-14 14:36:14 UTC
All Hawaii cards should have a TDP switch on the side of the card. Can you flip the switch when the computer is powered off and do the testing again. You can google some info about that switch.
Comment 147 John Boero 2017-01-17 12:10:09 UTC
Would love to see this get pushed through.  Frustrating having to manually build kernels each release.  It looks like history of the Fedora kernels has this being disabled and re-enabled by default a few times.  Much easier to blacklist a module than have to build a whole new kernel every time.  My $0.02
Comment 148 alvarex 2017-03-16 11:37:00 UTC
(In reply to Chris Waters from comment #133)
> I decided to record how my system behaves when using Linux and my 390
> together.
> 
> This is what I see https://www.youtube.com/watch?v=9uVIzHFlTZk
> 
> This is Ubuntu MATE 16.10 with Firefox open¹, something that is not at all
> graphically intensive. On KDE Plasma 5 it's even worse, my entire system
> will sometimes crash. It's completely unusable and I'm stuck on Windows
> because of this.
> 
> ¹ I know there is a tab that says "Chrome Experiments", but it's unloaded. I
> tried loading one to see what would happen if I tried something a bit
> intensive, but my entire system crashed. Ended up having to restart
> recovered my session in Firefox.
> 
> This look the same for other 390 users?

thanks for the video maybe I should open another bug, I have the same artifacts with a rx460. It happens randomly when resuming from standby, similar symptoms as described in this bug. I will try patching the kernel and disabling some dpm features as suggested, I'm not sure if it started happening with 4.9 or 4.10. Setting power_dpm_force_performance_level to manual or high doesn't work.
Comment 149 alvarex 2017-03-16 11:49:56 UTC
I forgot to mention I had a similar problem on Windows also with a 260x on the same motherboard so I think in my case is something to do with the motherboard.
I tried modyfing some option in grub but that did't help, I added

iommu=pt acpi_osi=Linux acpi=force acpi_enforce_resources=lax acpi_osi='!Windows 2013' acpi_osi='!Windows 


The iommu table is bugged on this motherboard maybe it has something todo? If I disable iommu completely on the bios usb devices will not power up.
Comment 150 Alfredo Mendez 2017-04-18 07:58:55 UTC
I have an ASUS Strix R9 390 and attempted to get it to work, in a nutshell for those looking for redemption... it didn't work.

I tried to switch drivers from radeon to amdgpu, and both equally fail randomly. Setting the DPM to zero reduces the chances of black screening, but expect to encounter them either way. Simply put, the drivers are still bad for the 390's.

This bug is about to become three years old, and while 17.04 already came out with the same issue, this should be set to its highest level, its been long overdue with no real progress. I would love to collaborate in fixing this issue, but I am simply below the average linux user standard.

I hope more people can volunteer in finding the culprit, but for now it points out to be more power related.
Comment 151 Marek Olšák 2017-04-28 18:49:46 UTC
Alfredo, did you try to switch the TDP switch on the side of the card?
Comment 152 Alfredo Mendez 2017-05-19 22:25:19 UTC
(In reply to Marek Olšák from comment #151)
> Alfredo, did you try to switch the TDP switch on the side of the card?

Yeah, and the system eventually was hit with a blackscreen.
Comment 153 garththeisen 2017-05-20 18:21:01 UTC
I too have toggled my TPD switch (XFX R9 390) and continue to get black screen freeze at non-deterministic times with a non-stressful workload. Running 4.11.1 with the latest git firmware (radeon).

My work-around is to set the power_dpm_force_performance_level and the black screen behaviour is mitigated.

Let me know if I might assist in any way collecting logs etc.
Comment 154 Stefan 2017-05-30 20:50:32 UTC
R9 390 owner here. Just replaced my old Ubuntu 14.04 + fglrx with Fedora 25, then 26 + Mesa and hit this problem as well. It took me a lot of fiddling and searching to understand why my 3 years newer system has become unusable.

The only thing that worked for me was radeon.dpm=0 but, well, (expletive), my 400€ card has been reduced to a 40€ card thanks to this.

I can report that I tried #115 and it didn't solve it for me. What it did though, was to delay the crash from 10 seconds to about one 30-60 seconds (I'm getting a full hard crash with black screen, dead I/O, all fans maxing out, regardless of what I'm doing, if I'm fast I can get to run Unigine Valley for a bit, if I don't then it can crash while desktop is sitting idle).

I'm not very Linux-handy, but if anybody needs my system as a guinea pig for hopefully solving this, please let me know, I'm willing to help.

And if someone has knowledge whether a fix is (still) under work, please update us here.
Comment 155 john 2017-06-10 17:28:33 UTC
Running a gigabyte AMD R9 380 card and getting similar problems.

the system boots, loads the driver and the fans briefly spin to 100% before the video goes completely dark and the monitor turns off.  the card is detected as a Tonga PRO 285/380.  

this issue is going to become much more critical in the next week, as SteamOS just switched their AMD proprietary drivers for amdgpu open source drivers.
Comment 156 John Bridgman 2017-07-08 18:29:50 UTC
#155 sounds like a completely different problem - john@dev1ce.com can you please start a new ticket ? Please indicate in that ticket whether this is a regression (worked before, stopped working when you did <xyz>) or a system that has never worked ?
Comment 157 john 2017-07-08 19:56:12 UTC
https://bugs.freedesktop.org/show_bug.cgi?id=101377

this has been open for about a month with no activity.
Comment 158 janweber560 2017-08-13 15:57:09 UTC
asus strix 390 owner.

gentoo kernel 4.12.3 and amdgpu 1.3.0.


get crash after 10-30seconds dwm.

only think that helps: "echo  high > /sys/class/device/drm/card0/power_dpm_force_performance_level" but you have to do this everytime the system start and before launch xserver.


i found this here (at bottom):
https://wiki.archlinux.org/index.php/AMDGPU
Comment 159 Jon Doane 2017-09-09 11:29:04 UTC
Is there going to be any work on this? I've literally been doing:
"echo  high > /sys/class/device/drm/card0/power_dpm_force_performance_level"
every day to boot my machine for over a year hoping that this might get fixed but, I've seen practically no progress made on this bug. After attempting to use kernel 4.13, it becomes unstable and crashes faster than I can duck out into a TTY so, it's almost like the problem is getting worse as time goes on. This is literally a crippling bug and I find it astonishing that something, like locking VRAM clocks and making 500Mhz the minimum core clock for all 390s hasn't been implemented as a stopgap measure to at least get the thing stable. For crying out loud, this bug has been open for 2 years now and has been reproduced by several different people, or do we have to wait another 2 years before we're told that it won't be fixed?
Comment 160 Stefan 2017-09-09 12:01:19 UTC
Jon Doane, to alleviate your pains, set radeon.dpm=0 as a boot option.
Comment 161 garththeisen 2017-09-09 16:33:01 UTC
Not sure how well a bug bounty might work for this issue, but would there be any interest? Maybe a fund for hardware purchase (R9 390) to send to someone with the expertise to diagnose/fix work against. 

I'm willing to chip in with the hope of seeing some progress and resolution.
Comment 162 Chris Waters 2017-09-09 18:47:52 UTC
One of the issues I've seen mentioned by one of the AMD guys on Reddit (/u/bridgmanAMD) is that they have been unable to reproduce the issue.  It seems tied to a small subset of 390s.
Comment 163 Thomas DEBESSE 2017-09-09 23:55:24 UTC
> It seems tied to a small subset of 390s.

Many precise models were listed. It's easy to get one of them.
For example this one is known to be buggy: https://www.amazon.com/dp/B00ZGF158A/

It's weird if AMD developers can't put their hand on AMD hardware.
Comment 164 Thomas DEBESSE 2017-09-10 00:00:58 UTC
(In reply to Jon Doane from comment #159)
> I've literally been doing:
> "echo  high > /sys/class/device/drm/card0/power_dpm_force_performance_level"
> every day to boot my machine for over a year

See comment 55 if your system is running systemd, you can use this service:

https://github.com/illwieckz/dpm-query/

It will do it for you at startup, it's painless and I'm using it since 19 months without any issue.
Comment 165 Thomas DEBESSE 2017-09-10 03:45:29 UTC
Note that all joined dmesg reports a HAWAII 0x1002:0x67B1 PCI ID except mine reporting a HAWAII 0x1002:0x67B0 PCI ID (I also checked the dmesg joined in duplicates of this bug). So the subset seems to be well known.

In fact I haven't checked in a while if my GPU is still affected or not, I remember some firmware could change things. I will recheck, if my GPU works on auto, it means only 0x67B1 would be affected.
Comment 166 garththeisen 2017-09-10 03:50:35 UTC
As does mine, my dmesg ...

[drm] initializing kernel modesetting (HAWAII 0x1002:0x67B1 0x1682:0x9390 0x80).
Comment 167 Thomas DEBESSE 2017-09-10 05:13:26 UTC
Created attachment 134124 [details] [review]
kernel patch: set "high" default DPM profile instead of "auto" for 0x67B1 variant

So, using my 0x67B0 variant, new firmware and "auto" profile I was able to run the vkQuake Vulkan game, the Unvanquished OpenGL game, the Unigine Superposition OpenGL benchmark, the Luxmark Hotel OpenCL benchmark, and did some OpenCL tasks with Darktable photo software. In fact in the past, just running Gnome Shell (even the one run by GDM) was enough to take down the computer.

I remember having tried the "auto" profile with new firmware in the past but I had to to return to "low" or "high" profile because at this time the firmware update just made the hang less immediate, but the hang was still happening randomly, it was just happening lately. The bug appears to now be gone on my end.

Everything looks fine on my 0x67B0 variant so it looks like the firmware and some kernel updates did the trick for me (I'm now running Linux 4.12).

So, the only remaining variant known to be faulty is the 0x67B1 one.

So forget what I've said about the MSI 39 390X, it's now a model known to work.

Since the only affected model looks to be the 0x67B1 variant, I wrote this small patch that must set "auto" as DPM profile on AMD GPU except for the 0x67B1 HAWAII variant which will use "high" as default. This patch does not prevent the user to force the "auto" DPM profile by itself, but the "high" one must now be set by default on this known variant.

I'm not able to test this patch since my GPU is the 0x67B0 one and looks to not be affected by the bug anymore.

Before testing this patch, please check that you have the latest firmware for your card. If not, update your firmware first as explained in previous comments and check if it fixes the issue for you. If the latest firmware and a recent kernel is not enough for you, well, perhaps we will have to mainline this kernel change if AMD is not able to provide a fix.

This patch targets the 4.12 kernel tree but is so simple it must work on some other versions too.

A review by someone at AMD like Marek Olšák, Alex Deucher or John Bridgman who had participated in that thread would be very appreciated.
Comment 168 Thomas DEBESSE 2017-09-10 05:39:53 UTC
Created attachment 134125 [details] [review]
kernel patch: set "high" default DPM level instead of "auto" for 0x67B0/0x67B1 variants

So, I had to wait two hours, but I got the crash on my 0x67B0 variant too. It's still better than two years ago when it happened some seconds after X spawning, but it's still there. So I updated my patch to make the trick for these two variants.
Comment 169 Thomas DEBESSE 2017-09-10 08:11:07 UTC
Created attachment 134127 [details] [review]
kernel patch: set "high" default DPM level instead of "auto" for 0x67B0/0x67B1 variants

Sorry, there was a stupid typo in the patch, this is now fixed.

I'm running on that code and it works.
Comment 170 Jon Doane 2017-09-10 11:47:25 UTC
(In reply to Stefan from comment #160)
> Jon Doane, to alleviate your pains, set radeon.dpm=0 as a boot option.

Crippling GPU performance is not a solution and doesn't alleviate pains because it basically forces me to not do anything 3d-related. I would rather boot with X disabled so I can force the perf level to high. This is what I used to do and it's not an acceptable solution.

(In reply to Thomas DEBESSE from comment #164)
> (In reply to Jon Doane from comment #159)
> > I've literally been doing:
> > "echo  high > /sys/class/device/drm/card0/power_dpm_force_performance_level"
> > every day to boot my machine for over a year
> 
> See comment 55 if your system is running systemd, you can use this service:
> 
> https://github.com/illwieckz/dpm-query/
> 
> It will do it for you at startup, it's painless and I'm using it since 19
> months without any issue.

This sounds a lot like what I've been doing manually which sounds nice. Thanks for the input. I honestly would like a solution that doesn't cause my machine to draw an additional 90 watts at idle though. As I said, I've been doing this for well over a year now and I'd prefer a solution, not a hack, considering how old this issue is.
Comment 171 Thomas DEBESSE 2017-09-10 16:16:00 UTC
> This sounds a lot like what I've been doing manually which sounds nice.
> Thanks for the input. I honestly would like a solution that doesn't cause my
> machine to draw an additional 90 watts at idle though.

Unlike the kernel patch above, that systemd service is setting the GPU to "low battery" by default, which is the most energy saving profile. The provided `dpm query` tool allows you to change that at any time. That's what I'm doing: at init, my GPU is set to "low battery" profile, and when I need to do some heavy time, I do that:

dpm-query set all high performance

And then once the heavy task is done, I do that to save energy again:

dpm-query set all low battery

With the default config for the service, you just have to add your own user to the "video" group to have the right to change the profile as user.

So, even if the patch above get merged one day, this service and tool is still useful, it's an easy way to change the default profile, whatever the default is.

Notice that the kernel patch above only set the level to "high", but keep the state to "balanced", so it's still adaptative. What "high balanced" does is setting the shader and memory frequencies to the max, which is drawing more power than default, but you will notice the fan are still idling and stopped if you do nothing because it's still saving a lot of energy. If you set "high performance" the fan will almost instantaneously start because there is no saving anymore. So "high balanced" is less energy saving that "auto balanced", but is still saving a lot of energy because it does not have to cold the chip while doing nothing (meaning the chip does nothing strong enough to get hot).
Comment 172 Jon Doane 2017-09-13 22:41:15 UTC
(In reply to Thomas DEBESSE from comment #171)
> > This sounds a lot like what I've been doing manually which sounds nice.
> > Thanks for the input. I honestly would like a solution that doesn't cause my
> > machine to draw an additional 90 watts at idle though.
> 
> Unlike the kernel patch above, that systemd service is setting the GPU to
> "low battery" by default, which is the most energy saving profile. The
> provided `dpm query` tool allows you to change that at any time. That's what
> I'm doing: at init, my GPU is set to "low battery" profile, and when I need
> to do some heavy time, I do that:
> 
> dpm-query set all high performance
> 
> And then once the heavy task is done, I do that to save energy again:
> 
> dpm-query set all low battery
> 
> With the default config for the service, you just have to add your own user
> to the "video" group to have the right to change the profile as user.
> 
> So, even if the patch above get merged one day, this service and tool is
> still useful, it's an easy way to change the default profile, whatever the
> default is.
> 
> Notice that the kernel patch above only set the level to "high", but keep
> the state to "balanced", so it's still adaptative. What "high balanced" does
> is setting the shader and memory frequencies to the max, which is drawing
> more power than default, but you will notice the fan are still idling and
> stopped if you do nothing because it's still saving a lot of energy. If you
> set "high performance" the fan will almost instantaneously start because
> there is no saving anymore. So "high balanced" is less energy saving that
> "auto balanced", but is still saving a lot of energy because it does not
> have to cold the chip while doing nothing (meaning the chip does nothing
> strong enough to get hot).

Unless something has changed with how the dpm state is handled, I don't expect that to make the system completely stable. They're more stable than balanced but, it's not stable enough to prevent a crash. I tried by starting off with:
echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level

The only method that I've had luck with while retaining clock scaling is this:
echo 234567 > /sys/class/drm/card0/device/pp_dpm_sclk

This disables the 300Mhz clock step which seems to work however, I've observed that doing this also forced memory clocks to full tilt instead of idle so, I'm uncertain if the memory clock or core clock is responsible.

Something I've observed is that if my machine crashes and I use the reset button to restart it, that when X loads and if I don't force clocks up, it always crashes and that part of the old image that was on the screen when it initially crashed gets displayed, albeit rather garbled but, enough to identify it which makes me think that it's related to the memory clock or how GPU memory is managed.

One way or another, I have ways around the problem but, these are hacks that would be considered intolerable solutions by a regular user.
Comment 173 Thomas DEBESSE 2017-09-14 05:13:29 UTC
(In reply to Jon Doane from comment #172)
> Unless something has changed with how the dpm state is handled, I don't
> expect that to make the system completely stable. They're more stable than
> balanced but, it's not stable enough to prevent a crash.

Hmm, until now the discussion only talked about level (low, auto, high) and not state (battery, balanced, performance), it was suspected "auto" level being faulty, but no one yet suspected "balanced" state being faulty. It's currently assumed any state is working but one level is not (auto). Perhaps that's a wrong assumption by the way. Have you specifically experienced an issue due to the "balanced" state and not due to the "auto" level that is commonly used with it?

> The only method that I've had luck with while retaining clock scaling is
> this:
> echo 234567 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> This disables the 300Mhz clock step which seems to work however

Oh yes I forgot this trick because on my end using "low" or "high" level is enough so I never had to mess with that. By the way when I'm on "low" level I'm running at 300MHz and it runs nicely for weeks (i.e. until I reboot for something unrelated like a kernel upgrade).

> One way or another, I have ways around the problem but, these are hacks that
> would be considered intolerable solutions by a regular user.

Sure, but if no one is going to fix that, it would be better to have these hacks applied by default and not expecting the user to do them by hand. Since more than two years now, running a LiveCD to install Linux on a system having an R9 390X leads to a crash while installing… A by-default hack would be better than nothing if no one is going to fix it.

I still don't understand why it's so hard for AMD employees to get their hand on AMD hardware to work on a fix, and we know the faulty models (0x67B0, 0x67B1) and many commercial names were listed.

Their AMDGPU-PRO driver looks to not be affected by the bug, so they have a fix somewhere. Why this fix can't make it's way to the open driver?
Comment 174 Alex Deucher 2017-09-14 13:08:37 UTC
(In reply to Thomas DEBESSE from comment #173)
> 
> Their AMDGPU-PRO driver looks to not be affected by the bug, so they have a
> fix somewhere. Why this fix can't make it's way to the open driver?

The pro stack and the open stack share the same amdgpu kernel driver.
Comment 175 Thomas DEBESSE 2017-09-14 15:14:40 UTC
(In reply to Alex Deucher from comment #174)
> (In reply to Thomas DEBESSE from comment #173)
> > 
> > Their AMDGPU-PRO driver looks to not be affected by the bug, so they have a
> > fix somewhere. Why this fix can't make it's way to the open driver?
> 
> The pro stack and the open stack share the same amdgpu kernel driver.

Yes I know, so what?

Is this a packaging issue like the firmware delivered not being the good one or things like that?
Comment 176 Jon Doane 2017-09-14 15:18:01 UTC
(In reply to Thomas DEBESSE from comment #175)
> (In reply to Alex Deucher from comment #174)
> > (In reply to Thomas DEBESSE from comment #173)
> > > 
> > > Their AMDGPU-PRO driver looks to not be affected by the bug, so they have a
> > > fix somewhere. Why this fix can't make it's way to the open driver?
> > 
> > The pro stack and the open stack share the same amdgpu kernel driver.
> 
> Yes I know, so what?
> 
> Is this a packaging issue like the firmware delivered not being the good one
> or things like that?

I've experienced this issue with both AMDGPU-Pro and AMDGPU with the same visual artifacts out of the box. The radeon driver seems to not have the same kind of visual artifacts that AMDGPU(-Pro) has but, it's just as unstable without intervention, at least for me. Changing the firmware didn't make enough of an impact to make my machine even close to stable by itself.
Comment 177 garththeisen 2017-09-18 23:06:43 UTC
Created attachment 134324 [details]
dmesg capture

Logged this problem against 4.13.2. Started up an accelerated program (game) and with in seconds the screen went black.

In the attached dmesg output the amdgpu emits a timeout *ERROR*, but I was unable to recover the session/switch DPM parameters to force recovery.

>[   85.983053] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, last signaled seq=2103, last emitted seq=2105
>[   85.983125] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=7839, last emitted seq=7841
>[   85.983129] [drm] No hardware hang detected. Did some blocks stall?
>[   85.983130] [drm] No hardware hang detected. Did some blocks stall?
Comment 178 Andrea Zanoni 2017-09-25 13:19:03 UTC
I have the same issues with an MSI R9 390 running on a MSI Z170A GAMING M5 motherboard.

The only way I can even use the card, avoiding constant black screens and os lockup, is to use the DPM Query service (https://github.com/illwieckz/dpm-query/) and force the dpm level on "high" and the state profile on "performance". 

If I try to revert to the "low" level, I get this: 

ERROR: card0/power_dpm_force_performance_level does not accept "low"
Comment 179 Thomas DEBESSE 2017-09-26 20:36:34 UTC
Andrea Zanoni, can you print the output of this command?

lspci -nn | grep VGA
Comment 180 xuka 2018-01-11 14:56:14 UTC
From watching this bug progress over the last two years and not really improving, I guess this problem is never going to get fixed, at least not too soon.

From my testing last year, it would seem that the majority who have this problem own the MSI brand one, both 390 an 390x have the problem, and there isn't a proper workaround for it besides forcing the memory(maybe core) clock high, so it would seem that the clock is not being set correctly, or the "sensor" isn't realizing that clock should be increased.

I have tried every kernel from the past year in hopes that this problem would be solved, with no successful test, I have tried using both AMDGPU and radeon modules, while blacklisting the other to make sure that there was not a problem with one, and still no success.

On Windows 10, there is never an issue with artifacts, or crashes.

If there is anyway to get this on a fast track to being fixed, please let everyone know, so that we can help, that is what community is for...I think.
Comment 181 Sandeep 2018-03-03 04:44:19 UTC
I was able to reliably reproduce the bug with openarena on my XFX R9 390 - here's a link to a trace file (1.8 GB) - https://drive.google.com/file/d/1YbOtWheR9RJdqnwya1rMNw1NphVptAUX/view?usp=sharing
Comment 182 Chris Heald 2018-03-13 05:49:27 UTC
Just adding a data point here, I've got an MSI R9 390 running on Ubuntu 18.04 on 4.15.0-10-generic - I haven't had any stability issues, but I have had maddening screen flickering/corruption.

I'm running dual 2560x1440 monitors off the card, which forces the memory clock to 1500MHz. However, when the GPU clock is at the 300MHz level, I get horrendous artifacting any time an accelerated portion of the screen is drawn. I can easily reproduce the issue by mousing over certain KDE widgets which are acclerated. Interestingly, running glxgears doesn't cause the issue.

* Setting power_dpm_force_performance_level -> high fixes it (but runs the clock up to its max, obviously)
* Setting power_dpm_force_performance_level -> manual and then `echo 1234567 > pp_dpm_sclk` fixes it, with the GPU clock fixed to 500MHz.

I've been up and down this issue with both radeon and amdgpu drivers; neither seems to make a difference.

# lspci -nnk | grep -iA2 vga
06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Hawaii PRO [Radeon R9 290/390] [1462:2015]
        Kernel driver in use: amdgpu

# cat vbios_version
MS-V30823-F6

(Interestingly, this BIOS is newer than anything on techpowerup. I was hoping a BIOS flash would fix it, but I can't find anything newer)

I'll attach dmesg and Xorg logs as well. If I can provide extra data points, I'd like to help.
Comment 183 Chris Heald 2018-03-13 05:50:13 UTC
Created attachment 138058 [details]
Xorg.0.log
Comment 184 Chris Heald 2018-03-13 05:50:44 UTC
Created attachment 138059 [details]
dmesg output
Comment 185 Sandeep 2018-03-13 14:47:30 UTC
I have that display/corruption issue too, on the XFX R9 390.

Usually happens after suspend/resume. Will test if it gets fixed with above steps.
Comment 186 Chris Heald 2018-03-18 20:31:41 UTC
I've been doing a lot of experimentation, and I've found a few more things that I feel are probably related:

* I can force a system hard-lock by doing anything which disables a monitor. Notably, going full-screen under KDE/Xorg does this, but I can trigger it just as easily by disabling a monitor with xrandr. Fullscreen under gnome doesn't seem to trigger the issue, which I suspect is due to gnome's using mutter for screen management.

* Occassioanlly, the system boots up and gets stuck with a 150MHz memory clock, rather than clocking up to the 1500MHz state. This causes the display corruption even if the sclk is set to 500MHz+. Setting the mclk mask manually fixes display corruption.

* I've been experimenting with different kernels ranging from 4.4 to 4.16rc5. Earlier kernels feel more susceptible to hard-locking, though the later kernels aren't immune to it.

* I tried a fresh Ubuntu 16.04 LTS install, and while it did NOT exhibit the artifacting behavior, the system hard-locked within a few minutes of light desktop usage.

I've had a few classes of exceptions show up in kern.log:

On 4.4, my kde/wayland session hard-froze when moving a window, and produced a log like this:

kernel: [  116.904013] radeon 0000:06:00.0: GPU fault detected: 146 0x0d8e040c
kernel: [  116.904017] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001776C
kernel: [  116.904019] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E10400C
kernel: [  116.904021] VM fault (0x0c, vmid 7) at page 96108, read from 'TC3' (0x54433300) (260)
kernel: [  127.306156] radeon 0000:06:00.0: ring 0 stalled for more than 10404msec
kernel: [  127.306164] radeon 0000:06:00.0: GPU lockup (current fence id 0x0000000000002419 last fence id 0x0000000000002431 on ring 0)
kernel: [  127.357942] radeon 0000:06:00.0: Saved 2200 dwords of commands on ring 0.
kernel: [  127.357961] radeon 0000:06:00.0: GPU softreset: 0x00000009
kernel: [  127.357963] radeon 0000:06:00.0:   GRBM_STATUS=0xF5D01028
kernel: [  127.357965] radeon 0000:06:00.0:   GRBM_STATUS2=0x50000008
kernel: [  127.357968] radeon 0000:06:00.0:   GRBM_STATUS_SE0=0xEC400002
kernel: [  127.357970] radeon 0000:06:00.0:   GRBM_STATUS_SE1=0xEC400002
kernel: [  127.357972] radeon 0000:06:00.0:   GRBM_STATUS_SE2=0x08000002
kernel: [  127.357974] radeon 0000:06:00.0:   GRBM_STATUS_SE3=0xEC000002
kernel: [  127.357976] radeon 0000:06:00.0:   SRBM_STATUS=0x20000040
kernel: [  127.357978] radeon 0000:06:00.0:   SRBM_STATUS2=0x00000000
kernel: [  127.357980] radeon 0000:06:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
kernel: [  127.357982] radeon 0000:06:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
kernel: [  127.357984] radeon 0000:06:00.0:   CP_STAT = 0x84228600
kernel: [  127.357986] radeon 0000:06:00.0:   CP_STALLED_STAT1 = 0x00000c00
kernel: [  127.357988] radeon 0000:06:00.0:   CP_STALLED_STAT2 = 0x40000000
kernel: [  127.357991] radeon 0000:06:00.0:   CP_STALLED_STAT3 = 0x00000400
kernel: [  127.357993] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT = 0x00000006
kernel: [  127.357995] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 = 0x00000003
kernel: [  127.357997] radeon 0000:06:00.0:   CP_CPF_STATUS = 0x80000063
kernel: [  127.357999] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT = 0x00000000
kernel: [  127.358001] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
kernel: [  127.358003] radeon 0000:06:00.0:   CP_CPC_STATUS = 0x00000000
kernel: [  127.358005] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
kernel: [  127.358007] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
kernel: [  127.404670] radeon 0000:06:00.0: GRBM_SOFT_RESET=0x00010001
kernel: [  127.404725] radeon 0000:06:00.0: SRBM_SOFT_RESET=0x00000100
kernel: [  127.405874] radeon 0000:06:00.0:   GRBM_STATUS=0x00003028
kernel: [  127.405876] radeon 0000:06:00.0:   GRBM_STATUS2=0x00000008
kernel: [  127.405878] radeon 0000:06:00.0:   GRBM_STATUS_SE0=0x00000006
kernel: [  127.405880] radeon 0000:06:00.0:   GRBM_STATUS_SE1=0x00000006
kernel: [  127.405882] radeon 0000:06:00.0:   GRBM_STATUS_SE2=0x00000006
kernel: [  127.405884] radeon 0000:06:00.0:   GRBM_STATUS_SE3=0x00000006
kernel: [  127.405885] radeon 0000:06:00.0:   SRBM_STATUS=0x20000A40
kernel: [  127.405887] radeon 0000:06:00.0:   SRBM_STATUS2=0x00000000
kernel: [  127.405889] radeon 0000:06:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
kernel: [  127.405891] radeon 0000:06:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT = 0x00000000
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT = 0x00000000
kernel: [  127.405895] radeon 0000:06:00.0:   CP_STALLED_STAT1 = 0x00000000
kernel: [  127.405896] radeon 0000:06:00.0:   CP_STALLED_STAT2 = 0x00000000
kernel: [  127.405898] radeon 0000:06:00.0:   CP_STALLED_STAT3 = 0x00000000
kernel: [  127.405900] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT = 0x00000000
kernel: [  127.405902] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
kernel: [  127.405903] radeon 0000:06:00.0:   CP_CPF_STATUS = 0x00000000
kernel: [  127.405905] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT = 0x00000000
kernel: [  127.405907] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
kernel: [  127.405909] radeon 0000:06:00.0:   CP_CPC_STATUS = 0x00000000
kernel: [  127.405929] radeon 0000:06:00.0: GPU reset succeeded, trying to resume
kernel: [  127.658172] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
kernel: [  127.658189] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
kernel: [  127.658194] [drm] probing gen 2 caps for device 1022:1453 = 733903/e
kernel: [  127.658197] [drm] PCIE gen 3 link speeds already enabled
kernel: [  127.664213] [drm] PCIE GART of 2048M enabled (table at 0x0000000000326000).
kernel: [  127.664341] radeon 0000:06:00.0: WB enabled
kernel: [  127.664344] radeon 0000:06:00.0: fence driver on ring 0 use gpu addr 0x0000000200000c00 and cpu addr 0xffff8807f3799c00
kernel: [  127.664346] radeon 0000:06:00.0: fence driver on ring 1 use gpu addr 0x0000000200000c04 and cpu addr 0xffff8807f3799c04
kernel: [  127.664347] radeon 0000:06:00.0: fence driver on ring 2 use gpu addr 0x0000000200000c08 and cpu addr 0xffff8807f3799c08
kernel: [  127.664349] radeon 0000:06:00.0: fence driver on ring 3 use gpu addr 0x0000000200000c0c and cpu addr 0xffff8807f3799c0c
kernel: [  127.664350] radeon 0000:06:00.0: fence driver on ring 4 use gpu addr 0x0000000200000c10 and cpu addr 0xffff8807f3799c10
kernel: [  127.664772] radeon 0000:06:00.0: fence driver on ring 5 use gpu addr 0x0000000000078b30 and cpu addr 0xffffc90003c38b30
kernel: [  127.664933] radeon 0000:06:00.0: fence driver on ring 6 use gpu addr 0x0000000200000c18 and cpu addr 0xffff8807f3799c18
kernel: [  127.664934] radeon 0000:06:00.0: fence driver on ring 7 use gpu addr 0x0000000200000c1c and cpu addr 0xffff8807f3799c1c
kernel: [  127.666482] [drm] ring test on 0 succeeded in 2 usecs
kernel: [  127.666568] [drm] ring test on 1 succeeded in 2 usecs
kernel: [  127.666586] [drm] ring test on 2 succeeded in 2 usecs
kernel: [  127.666735] [drm] ring test on 3 succeeded in 3 usecs
kernel: [  127.666745] [drm] ring test on 4 succeeded in 3 usecs
kernel: [  127.692636] [drm] ring test on 5 succeeded in 1 usecs
kernel: [  127.712543] [drm] UVD initialized successfully.
kernel: [  127.813896] [drm] ring test on 6 succeeded in 708 usecs
kernel: [  127.813920] [drm] ring test on 7 succeeded in 3 usecs
kernel: [  127.813921] [drm] VCE initialized successfully.
kernel: [  127.814029] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed

On 4.15.10-041510-generic, I left my computer running overnight and came back to it frozen with this in kern.log:

Mar 18 04:25:10 Gaia kernel: [  559.092721] BUG: stack guard page was hit at 000000001ecd1fa8 (stack is 0000000020941864..00000000cf703fbf)
Mar 18 04:25:10 Gaia kernel: [  559.092729] kernel stack overflow (page fault): 0000 [#1] SMP NOPTI
Mar 18 04:25:10 Gaia kernel: [  559.092733] Modules linked in: nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay xfrm_user xfrm4_tunnel tunnel4 l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel ipcomp xfrm_ipcomp udp_tunnel esp4 pppox ah4 af_key xfrm_algo xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables devlink iptable_filter binfmt_misc snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib kvm_amd snd_hwdep kvm uvcvideo snd_seq_midi irqbypass snd_seq_midi_event snd_rawmidi crct10dif_pclmul videobuf2_vmalloc crc32_pclmul
Mar 18 04:25:10 Gaia kernel: [  559.092784]  videobuf2_memops videobuf2_v4l2 snd_seq ghash_clmulni_intel videobuf2_core snd_pcm pcbc videodev snd_seq_device media snd_timer joydev aesni_intel aes_x86_64 snd crypto_simd input_leds glue_helper serio_raw soundcore cryptd ccp k10temp shpchp mac_hid wmi_bmof sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu chash radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 r8169 ahci mii libahci wmi gpio_amdpt gpio_generic
Mar 18 04:25:10 Gaia kernel: [  559.092832] CPU: 5 PID: 7352 Comm: tail Tainted: G        W        4.15.10-041510-generic #201803152130
Mar 18 04:25:10 Gaia kernel: [  559.092834] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F10 12/01/2017
Mar 18 04:25:10 Gaia kernel: [  559.092881] RIP: 0010:amdgpu_get_pp_num_states+0x88/0x120 [amdgpu]
Mar 18 04:25:10 Gaia kernel: [  559.092884] RSP: 0018:ffffb3cb8a837ca8 EFLAGS: 00010282
Mar 18 04:25:10 Gaia kernel: [  559.092888] RAX: 00000000000000d4 RBX: ffffb3cb8a837cac RCX: 0000000000000001
Mar 18 04:25:10 Gaia kernel: [  559.092890] RDX: 0000000000000000 RSI: ffffffffc087a88c RDI: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092893] RBP: ffffb3cb8a837d20 R08: ffffffffc087a865 R09: ffff88c9ecebd98b
Mar 18 04:25:10 Gaia kernel: [  559.092895] R10: 0000000000000000 R11: ffff88c9ecebd98a R12: ffff88c9ecebd000
Mar 18 04:25:10 Gaia kernel: [  559.092898] R13: ffffffffc087a858 R14: 00000000000000d4 R15: 0000000000000993
Mar 18 04:25:10 Gaia kernel: [  559.092901] FS:  00007fccb1787540(0000) GS:ffff88c9fe740000(0000) knlGS:0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 18 04:25:10 Gaia kernel: [  559.092906] CR2: ffffb3cb8a838000 CR3: 00000004a30d0000 CR4: 00000000003406e0
Mar 18 04:25:10 Gaia kernel: [  559.092909] Call Trace:
Mar 18 04:25:10 Gaia kernel: [  559.092918]  ? tty_insert_flip_string_fixed_flag+0x86/0xe0
Mar 18 04:25:10 Gaia kernel: [  559.092925]  dev_attr_show+0x23/0x60
Mar 18 04:25:10 Gaia kernel: [  559.092931]  sysfs_kf_seq_show+0xa3/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092935]  kernfs_seq_show+0x27/0x30
Mar 18 04:25:10 Gaia kernel: [  559.092939]  seq_read+0xe5/0x430
Mar 18 04:25:10 Gaia kernel: [  559.092943]  kernfs_fop_read+0x137/0x180
Mar 18 04:25:10 Gaia kernel: [  559.092948]  __vfs_read+0x3a/0x170
Mar 18 04:25:10 Gaia kernel: [  559.092954]  ? security_file_permission+0xa1/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092958]  vfs_read+0x8e/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092962]  SyS_read+0x55/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092967]  do_syscall_64+0x73/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092973]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 18 04:25:10 Gaia kernel: [  559.092976] RIP: 0033:0x7fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092978] RSP: 002b:00007ffc17d84d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092982] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092984] RDX: 0000000000002000 RSI: 00007ffc17d84db0 RDI: 0000000000000003
Mar 18 04:25:10 Gaia kernel: [  559.092986] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fccb1313b40
Mar 18 04:25:10 Gaia kernel: [  559.092988] R10: 00000000fffffff3 R11: 0000000000000246 R12: 00007ffc17d84db0
Mar 18 04:25:10 Gaia kernel: [  559.092991] R13: 0000000000000003 R14: ffffffffffffffff R15: 000055e8f3b747e0
Mar 18 04:25:10 Gaia kernel: [  559.092994] Code: c7 c2 7a a8 87 c0 be 00 10 00 00 4c 89 e7 e8 d0 08 90 d1 41 89 c7 8b 45 8c 85 c0 74 72 48 8d 5d 8c 45 31 f6 49 c7 c5 58 a8 87 c0 <42> 8b 44 b3 04 44 89 f1 4d 89 e8 83 f8 0a 74 2d 83 f8 02 49 c7
Mar 18 04:25:10 Gaia kernel: [  559.093080] RIP: amdgpu_get_pp_num_states+0x88/0x120 [amdgpu] RSP: ffffb3cb8a837ca8
Mar 18 04:25:10 Gaia kernel: [  559.093084] ---[ end trace dbba232a9ca4c5c7 ]---

Possibly related, if I `cat pp_num_states` from a terminal, I get a segmentation fault:

root@Gaia:~# cat /sys/class/drm/card0/device/pp_num_states
Segmentation fault

I'm going to continue to dig. Let me know what logs/tests/whatnot I can provide that would be useful.
Comment 187 Lauri Gustafsson 2018-03-19 16:23:50 UTC
About "I can force a system hard-lock by doing anything which disables a monitor", on my system it used to crash more frequently the less desktop area I had. Small resolution monitor crashes often, multi monitor setup less often. But I don't have the hardware any more so I can't test if it still works that way.
Comment 188 chris 2018-05-03 23:11:27 UTC
I suffered pretty much all of the issue listed in this thread for many weeks since upgrading to 3 1440p monitors.

I have an Club 3D R9 390 Royal Queen.

Comment #182 best describes the problem I faced and what I had to do to work around it.

I am happy to announce that kernel 4.16.7 has solved this issue for me.

My system has no more issues booting. KDE is stable. No flickering at all. Even when forcing power_dpm_force_performance_level to 'low'.

I'm running Gentoo with:
mesa-18.1.0
libdrm-2.4.91
xf86-video-amdgpu-18.0.1
xorg-drivers-1.19

kernel params include: radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.modeset=1 amdgpu.dc=1 amdgpu.dpm=1

If anyone needs more info please ask.
Comment 189 tkdestroyer2+bugs-freedesktop 2018-05-04 03:08:59 UTC
I tried what chris suggested, and it worked up until the point at which some X session closes (right after login or after exiting basic xterm test session), where the system locks up. They seem to have patched 4.16.*, which was totally broken before for me on 4.16.2. I guess I'll keep doing my dpm=0 thing for now.
Comment 190 notog 2018-05-06 18:29:40 UTC
With 4.16.6, it seems to be fixed!
With a fresh installation of Antergos 18.4, tested with different kernel arguments, `radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1` is all that's needed to work fine.
amdgpu.modeset=1 does not actually do anything according to logs.
I ran different tests from glxgears to unigine valley and heaven, they seemed to work fine without any issues.

hardware: MSI R9 390 Gaming 8G
kernel: 4.16.6
mesa: 18.0.2
xorg-server: 1.19.6
xf86-video-amdgpu: 18.0.1
kernel arguments: radeon.cik_support=0 amdgpu.cik_support=1 amdgu.dpm=1
Comment 191 notog 2018-05-07 15:02:31 UTC
I did some more testing this time with Arch Linux, I started with kernel 4.16.6-1, and went back two months, one month at a time using the Arch Linux Archive. I never ran into an issue.
It would seem that this issue might have been fixed a few months ago.

The solution is to switch from Radeon, to AMDGPU. It seems that AMDGPU DPM is needed to fix the issue entirely, at least in my case.

device: MSI R9 390 Gaming 8G
distro: Arch Linux
kernel: 4.16.6-1, 4.15.13-1 and 4.15.5-1
mesa: 18.0.2-1, 17.3.7-1 and 17.3.5-1
xorg-server: 1.19.6+13+gd0d1a694f-1
xf86-video-amdgpu: 18.0.1-1 and 1.4.0-1

In all cases I appended the following kernel arguments: radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1

radeon.cik_support=0 Disables Radeon CIK(Sea Island) Support
amdgpu.cik_support=1 Enables AMDGPU CIK(Sea Island) Support
amdgpu.dpm=1 Enables DPM support
amdgpu.dc=1 Enables Display Core driver
Comment 192 Sandeep 2018-05-07 18:23:35 UTC
I tested with the APItrace file that I uploaded, still broken on 4.16.7 .
Comment 193 notog 2018-05-07 22:31:41 UTC
(In reply to Sandeep from comment #192)
> I tested with the APItrace file that I uploaded, still broken on 4.16.7 .

I updated to 4.16.7 and replayed the APItrace file uploaded in comment #181
I also tried setting the env variable vblank_mode=0, so there was no vsync/fps cap.
I then tried suspending and resuming, and rerunning the APItrace, and it was fine.
I did not experience any artifacting or corruptions in either test.

So there are still issues with the XFX branded R9 390. :(
Comment 194 Sandeep 2018-05-11 21:36:48 UTC
So, I had never used amdgpu.dc=1 and amdgpu.dpm=1 kernel parameters when testing earlier.

I tried using them now, with the 4.16.7 kernel, and replayed the APItrace file. No crashes! Finally.

Well, this is still a workaround, but atleast it works. I doubt it has anything to do with DC, although it's possible......
Comment 195 Alex Deucher 2018-05-11 21:38:37 UTC
(In reply to Sandeep from comment #194)
> So, I had never used amdgpu.dc=1 and amdgpu.dpm=1 kernel parameters when
> testing earlier.
> 
> I tried using them now, with the 4.16.7 kernel, and replayed the APItrace
> file. No crashes! Finally.
> 
> Well, this is still a workaround, but atleast it works. I doubt it has
> anything to do with DC, although it's possible......

setting dpm=1 uses the new powerplay code rather than the old dpm code for power management on CI parts.
Comment 196 Sandeep 2018-05-11 22:13:07 UTC
Created attachment 139505 [details]
attachment-9725-0.html

Ah, I didn't know that. I thought it just disabled/enabled dpm.......well,
it works so that's good.

It would be great if it worked out of the box though, without having to add
kernel parameters.....

On Fri, May 11, 2018, 14:39 <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 195 <https://bugs.freedesktop.org/show_bug.cgi?id=91880#c195>
> on bug 91880 <https://bugs.freedesktop.org/show_bug.cgi?id=91880> from Alex
> Deucher <alexdeucher@gmail.com> *
>
> (In reply to Sandeep from comment #194 <https://bugs.freedesktop.org/show_bug.cgi?id=91880#c194>)> So, I had never used amdgpu.dc=1 and amdgpu.dpm=1 kernel parameters when
> > testing earlier.
> >
> > I tried using them now, with the 4.16.7 kernel, and replayed the APItrace
> > file. No crashes! Finally.
> >
> > Well, this is still a workaround, but atleast it works. I doubt it has
> > anything to do with DC, although it's possible......
>
> setting dpm=1 uses the new powerplay code rather than the old dpm code for
> power management on CI parts.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>
Comment 197 Sandeep 2018-05-11 22:15:18 UTC
Anyway, thanks for fixing the bug, AMD devs! (or whoever else did it).
Comment 198 Jon Doane 2018-05-12 13:11:12 UTC
I would like to add that the problem appears to be resolved on my installation of Ubuntu 18.04 with the MSI R9 390 8GB GAMING GPU using the current mainline kernel (4.17-rc4,) with the kernel flags of "amdgpu.dc=1" and "amdgpu.dpm=1". Clock scaling appears to be working as expected and I haven't had any visual artifacts or crashes to speak of yet. Using "amdgpu.dc=1" alone didn't make a difference but, "amdgpu.dpm=1" made all the difference.

Good work to everyone involved!
Comment 199 heavyjoe 2018-05-13 20:21:24 UTC
Can someone please help the linux noob?

I am on fedora 28. my "GRUB_CMDLINE_LINUX=" looks like this:

BOOT_IMAGE=/boot/vmlinuz-4.16.7-300.fc28.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dc=1 amdgpu.dpm=1

i added the "amdgpu.dc=1 amdgpu.dpm=1"
when i start gnome there are 3 errors like this:

WARNING: CPU: 2 PID: 369 at drivers/gpu/drm/amd/amdgpu/../display/dc/dm_services.h:132 generic_reg_update_ex+0x12c/0x160 [amdgpu]

what parts of the grub-line do i need to add or delete to get things working, fixed as suggested in this thread? do i still have the wrong kernel?

Thanks very much!!! I really try to make the switch to linux but the problems with my ASUS R9 390 DirectCU III OC (and the bad performance) are a bit annoying.
Comment 200 Sandeep 2018-05-13 20:40:27 UTC
You can probably ignore the warnings, I get them too and nothing bad has happened so far. As long as GPU hang doesn't occur, it's all good.
Comment 201 heavyjoe 2018-05-13 20:49:40 UTC
(In reply to Sandeep from comment #200)
> You can probably ignore the warnings, I get them too and nothing bad has
> happened so far. As long as GPU hang doesn't occur, it's all good.

Thanks for the reply.
In the problem reporting app it is written as
"Unexpeted System Error" The system has encountered a problem and recovered.

The reason is as i wrote:
WARNING: CPU: 1 PID: 369 at drivers/gpu/drm/amd/amdgpu/../display/dc/dm_services.h:132 generic_reg_update_ex+0x12c/0x160 [amdgpu]

so the system errors can occur as warnings? then i can live with it but i wasn't sure because it was labeled as system error...

thanks again. i will go on with that and hope no freezes appear.
Comment 202 Ioannis Panagiotopoulos 2018-05-14 18:54:03 UTC
Can confirm that worked with kernel 4.16.7-300 on Fedora 28 Gnome with wayland and the right boot arguments. However this worked when I had the R9 390 installed alone. When I used both the R9 390 and Rx 550 on the same system, the display plugged on the R9 390 did not work correctly and produced broken UI elements from time to time. Furthermore UI was very slow on response.
When tried Fedora 28 KDE, SDDM was constantly crashing when it tried to start and stuck on a loop try-to-start->crash->try-to-start.
Kubuntu 18.04 KDE worked, but had the same issues as Fedora 28 Gnome.
So it seems the dpm bug is at last solved despite the other problems that might be unrelated to dpm.

(In reply to heavyjoe from comment #201)
> (In reply to Sandeep from comment #200)
> > You can probably ignore the warnings, I get them too and nothing bad has
> > happened so far. As long as GPU hang doesn't occur, it's all good.
> 
> Thanks for the reply.
> In the problem reporting app it is written as
> "Unexpeted System Error" The system has encountered a problem and recovered.
> 
> The reason is as i wrote:
> WARNING: CPU: 1 PID: 369 at
> drivers/gpu/drm/amd/amdgpu/../display/dc/dm_services.h:132
> generic_reg_update_ex+0x12c/0x160 [amdgpu]
> 
> so the system errors can occur as warnings? then i can live with it but i
> wasn't sure because it was labeled as system error...
> 
> thanks again. i will go on with that and hope no freezes appear.

if you installed the kernel manually, then try to install the kernel-modules-extra package of this kernel version as well.
Comment 203 iburnth3playb00k 2018-06-01 23:11:42 UTC
(In reply to chris from comment #188)
> I suffered pretty much all of the issue listed in this thread for many weeks
> since upgrading to 3 1440p monitors.
> 
> I have an Club 3D R9 390 Royal Queen.
> 
> Comment #182 best describes the problem I faced and what I had to do to work
> around it.
> 
> I am happy to announce that kernel 4.16.7 has solved this issue for me.
> 
> My system has no more issues booting. KDE is stable. No flickering at all.
> Even when forcing power_dpm_force_performance_level to 'low'.
> 
> I'm running Gentoo with:
> mesa-18.1.0
> libdrm-2.4.91
> xf86-video-amdgpu-18.0.1
> xorg-drivers-1.19
> 
> kernel params include: radeon.cik_support=0 amdgpu.cik_support=1
> amdgpu.modeset=1 amdgpu.dc=1 amdgpu.dpm=1
> 
> If anyone needs more info please ask.

Im kinda new to linux and i have this problem.
Can you create a step by step guide to help me fix the problem?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.