Bug 105021

Summary: suspend / rx550 / extremely slow after 2nd thaw
Product: DRI Reporter: arne_woerner
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
4.15.0-1-MANJARO / journalctl -r
none
4.15.1-1-MANJARO / journalctl -r
none
amdgpu related interrupt count
none
dmesg of boot/suspend/resume with polaris12_uvd.bin from 20180104
none
journalctl -r|tac of a resume failure with polaris12_uvd.bin from 20180119-2 none

Description arne_woerner 2018-02-09 07:58:11 UTC
Created attachment 137241 [details]
4.15.0-1-MANJARO / journalctl -r

Hi!

After a reboot only the first thaw works fine,
but the second thaw results in a box, that is extremely slow (keypresses take 10 seconds to show up in the gnome-terminal).

In the journal i found some suspicious messages.

Thx.

Bye
Arne
Comment 1 arne_woerner 2018-02-09 08:00:04 UTC
Created attachment 137242 [details]
4.15.1-1-MANJARO / journalctl -r
Comment 2 arne_woerner 2018-02-09 08:01:42 UTC
i reported it already here:
1. https://bugzilla.kernel.org/show_bug.cgi?id=198619
2. https://forum.manjaro.org/t/issue-with-resume-and-kernel-4-15/39802

and my CPU is an AMD A10-5800K...
Comment 3 Michel Dänzer 2018-02-09 09:14:46 UTC
When it's slow, do the numbers on the amdgpu line in /proc/interrupts increase over time?
Comment 4 arne_woerner 2018-02-09 15:37:44 UTC
Created attachment 137247 [details]
amdgpu related interrupt count

Michel Dänzer asked:
> When it's slow, do the numbers on the amdgpu line in
> /proc/interrupts increase over time?
>
in the first few seconds yes, but then no...
and the CPU number changes from 1 to 0...
-arne
Comment 5 arne_woerner 2018-02-12 06:02:53 UTC
the new kernel 4.15.2-2-MANJARO still says during the second resume:
Feb 11 14:07:21 vaako.intern.wgboome.org kernel: [drm:uvd_v6_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed (0xCAFEDEAD)

it cannot even "halt -p" after the first thaw (it just reboots, instead of turning off the power)...

and the command
> echo schedutil > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
fails after the second resume ("no such file" IIRC)...
is it possibly not just related to amdgpu?

-arne
Comment 6 arne_woerner 2018-02-12 11:08:39 UTC
FYI:
neither
(A) a new BIOS (Gigabyte F2A88XM-D3H, BIOS F10a 02/23/2016 (before it was F9))
nor
(B) turning off IOMMU support in the BIOS
nor
(C) kernel parameters "intremap=off amd_iommu=off"
helped...

-arne
Comment 7 arne_woerner 2018-02-13 07:32:05 UTC
4.15.3-1-ARCH has that bug, too... -arne
Comment 8 Michel Dänzer 2018-02-13 09:08:18 UTC
(In reply to arne_woerner from comment #4)
> in the first few seconds yes, but then no...

So the slowness is probably due to the GPU's IRQ no longer working, the question is why that happens...

Assuming you're running with DC enabled, does amdgpu.dc=0 make a difference? (Or amdgpu.dc=1 otherwise)

If not, since this seems to be a regression, bisecting between 4.14 and 4.15 would probably be the best way forward.
Comment 9 arne_woerner 2018-02-13 18:55:35 UTC
weird news:
the bug seems to be not or not only in the kernel...
because:
i had the same problem at thaw, when i downgraded like this:
linux414 (4.14.10-2 <- 4.14.16-1)
linux414-headers (4.14.10-2 <- 4.14.16-1)
mesa (17.3.1-0 <- 17.3.3-2)
xorg-server-common (1.19.6-0.1 <- 1.19.6+13+gd0d1a694f-1)
xorg-server (1.19.6-0.1 <- 1.19.6+13+gd0d1a694f-1)

i also tried linux414 4.14.13 and 4.14.16... they all do not change that bug...

i thought, that i saw this bug in linux415 only, which i test since Febuary 2018... but now i see it everywhere...

after the first thaw, switching between tty2(text mode) and tty1(Xorg) back and forth (with ALT+CTRL+F2 and ALT+F1) provokes that bug, too...

funnily even without Xorg, the second thaw fails...

does anybody have an idea, which package i should bisect?

is linux413 a good idea?
i mean: i often heard of linux414-drm-next-stage or so?
maybe something happened with linux414?
but then i did not c the bug from 2017-11-24 to 2018-01-31...

or did a package change in January/Febuary 2018, that could cause this?

-arne
Comment 10 arne_woerner 2018-02-13 20:47:43 UTC
could it be the linux-firmware update?
[2018-01-27 05:38] [ALPM] upgraded linux-firmware (20180104.65b1c68-1 -> 20180119.2a713be-1)
[2018-01-30 06:22] [ALPM] upgraded linux-firmware (20180119.2a713be-1 -> 20180119.2a713be-2)

-arne
Comment 11 Alex Deucher 2018-02-13 21:26:21 UTC
(In reply to arne_woerner from comment #10)
> could it be the linux-firmware update?
> [2018-01-27 05:38] [ALPM] upgraded linux-firmware (20180104.65b1c68-1 ->
> 20180119.2a713be-1)
> [2018-01-30 06:22] [ALPM] upgraded linux-firmware (20180119.2a713be-1 ->
> 20180119.2a713be-2)

Were any amdgpu firmwares included in that update?
Comment 12 arne_woerner 2018-02-13 22:10:24 UTC
looks like:
[a: linux-firmware-20180119.2a713be-2-any.pkg.tar.xz]
[b: linux-firmware-20180104.65b1c68-1-any.pkg.tar.xz]
# diff -r a/usr/lib/firmware/amdgpu b/usr/lib/firmware/amdgpu
Binary files a/usr/lib/firmware/amdgpu/fiji_vce.bin and b/usr/lib/firmware/amdgpu/fiji_vce.bin differ
Binary files a/usr/lib/firmware/amdgpu/polaris10_uvd.bin and b/usr/lib/firmware/amdgpu/polaris10_uvd.bin differ
Binary files a/usr/lib/firmware/amdgpu/polaris11_uvd.bin and b/usr/lib/firmware/amdgpu/polaris11_uvd.bin differ
Binary files a/usr/lib/firmware/amdgpu/polaris12_uvd.bin and b/usr/lib/firmware/amdgpu/polaris12_uvd.bin differ
Binary files a/usr/lib/firmware/amdgpu/raven_vcn.bin and b/usr/lib/firmware/amdgpu/raven_vcn.bin differ
Binary files a/usr/lib/firmware/amdgpu/vega10_uvd.bin and b/usr/lib/firmware/amdgpu/vega10_uvd.bin differ
Binary files a/usr/lib/firmware/amdgpu/vega10_vce.bin and b/usr/lib/firmware/amdgpu/vega10_vce.bin differ

even polaris12 seems to be different (RX550 uses polaris12, right?)...
i will test it tomorrow... :)

-arne
Comment 13 arne_woerner 2018-02-14 07:06:48 UTC
1. (on Manjaro Linux) it is not enough to downgrade linux-firmware... u also need to do a mkinitcpio... :)
2. downgrading the directory /usr/lib/firmware/amdgpu from linux-firmware-20180119.2a713be-2 to linux-firmware-20180104.65b1c68-1 fixes this bug...
3. linux-firmware-20180119.2a713be-1 has this bug, too...
4. downgrading only the file polaris12_uvd.bin is not enough (but it looks slightly better)...
5. how does my box use the other new files in /usr/lib/firmware/amdgpu (fiji_vce.bin, polaris10_uvd.bin, polaris11_uvd.bin, raven_vcn.bin, vega10_uvd.bin, vega10_vce.bin)? i thought my box would not use these...
6. does this suffice to find the bug? or do u want me to try certain other combinations of version 20180104 and 20180119.2a713be-2?

-arne
Comment 14 arne_woerner 2018-02-14 08:11:26 UTC
i forgot to mention:
both kernels (4.15.2-2-MANJARO (60Hz vert freq) and 4.14.16-1-MANJARO (40Hz vert freq)) work fine,
_but_ they both need at least one of these kernel parameters in the grub.cfg:
amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0

-arne
Comment 15 Alex Deucher 2018-02-14 14:11:19 UTC
(In reply to arne_woerner from comment #14)
> i forgot to mention:
> both kernels (4.15.2-2-MANJARO (60Hz vert freq) and 4.14.16-1-MANJARO (40Hz
> vert freq)) work fine,
> _but_ they both need at least one of these kernel parameters in the grub.cfg:
> amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0

Can you clarify this last statement?  The first two options are irrelevant for your board since it's not SI or CIK.
Comment 16 Alex Deucher 2018-02-14 14:12:13 UTC
Does disabling MSIs help?  Append amdgpu.msi=0 to the kernel command line in grub?
Comment 17 arne_woerner 2018-02-14 17:57:02 UTC
(In reply to Alex Deucher from comment #15)
> (In reply to arne_woerner from comment #14)
> > _but_ they both need at least one of these kernel parameters in the grub.cfg:
> > amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0
> 
> Can you clarify this last statement?  The first two options are irrelevant
> for your board since it's not SI or CIK.
>
yup:
i mean: i did not test, which one is necessary...
so i cannot say, if all three r necessary, or if just one/two of them suffice...

tomorrow i will test "amdgpu.audio=0 amdgpu.msi=0"...
with the stable linux-firmware package... right?

or do u mean just "amdgpu.msi=0"?
maybe hdmi audio works already, if amdgpu.msi is 0?

-arne
Comment 18 Alex Deucher 2018-02-14 18:14:21 UTC
(In reply to arne_woerner from comment #17)
> yup:
> i mean: i did not test, which one is necessary...
> so i cannot say, if all three r necessary, or if just one/two of them
> suffice...
> 
> tomorrow i will test "amdgpu.audio=0 amdgpu.msi=0"...
> with the stable linux-firmware package... right?
> 
> or do u mean just "amdgpu.msi=0"?
> maybe hdmi audio works already, if amdgpu.msi is 0?

I doubt the audio parameter would affect anything in this regard, but you said it was required for the board to work properly?  Is that the case or am I misinterpreting you?  For audio on your board, you need to enable DC which is only available on kernel 4.15 and newer.  You can boot with amdgpu.dc=1 to enable DC.  The msi option may help narrow down the interrupt issue.  If it helps, it seems MSIs are not working properly on your system after resume.  This would probably be something outside of the driver if it's a regression.
Comment 19 arne_woerner 2018-02-14 21:45:57 UTC
ok...

but why is it necessary to use that old amdgpu firmware,
if it was some other part of the kernel?

i cant find the change log of those firmware files...
could there be something, that does not like to suspend/resume?

-arne
Comment 20 Alex Deucher 2018-02-14 21:56:04 UTC
(In reply to arne_woerner from comment #19)
> ok...
> 
> but why is it necessary to use that old amdgpu firmware,
> if it was some other part of the kernel?
> 
> i cant find the change log of those firmware files...
> could there be something, that does not like to suspend/resume?

Is it the firmware that caused the regression?  Does everything work with the old firmware?
Comment 21 Alex Deucher 2018-02-14 22:01:28 UTC
(In reply to arne_woerner from comment #13)
> 1. (on Manjaro Linux) it is not enough to downgrade linux-firmware... u also
> need to do a mkinitcpio... :)
> 2. downgrading the directory /usr/lib/firmware/amdgpu from
> linux-firmware-20180119.2a713be-2 to linux-firmware-20180104.65b1c68-1 fixes
> this bug...
> 3. linux-firmware-20180119.2a713be-1 has this bug, too...
> 4. downgrading only the file polaris12_uvd.bin is not enough (but it looks
> slightly better)...

Sorry, I missed this part.

> 5. how does my box use the other new files in /usr/lib/firmware/amdgpu
> (fiji_vce.bin, polaris10_uvd.bin, polaris11_uvd.bin, raven_vcn.bin,
> vega10_uvd.bin, vega10_vce.bin)? i thought my box would not use these...

Your system doesn't use those additional firmwares.  They are for other chips.
Comment 22 arne_woerner 2018-02-15 06:51:45 UTC
(In reply to Alex Deucher from comment #20)
> Is it the firmware that caused the regression?  Does everything work with
> the old firmware?

i did the following experiments:

1.
via kernel parameters in grub.cfg i changed
amdgpu.dc from -1 to 1
amdgpu.msi from -1 to 0
amdgpu.audio from -1 to 0.
i installed version 20180119.2a713be-2 of linux-firmware.
result:
during the second suspend the screen went black (no signal) and the following thaw produced the slow box syndrome...

2.
via kernel parameters in grub.cfg i changed
amdgpu.dc from -1 to 1
and left amdgpu.msi and amdgpu.audio untouched.
i installed version 20180119.2a713be-2 of linux-firmware.
and version 20180104.65b1c68-1 of /usr/lib/firmware/amdgpu/polaris12_uvd.bin (package linux-firmware).
now it works three times in a row...
the 3rd time there even was a running mplayer, that uses HDMI Audio... :)

what is that new polaris12_uvd.bin good for?

-arne
Comment 23 arne_woerner 2018-02-17 07:55:02 UTC
i did another experiment:

i set amdgpu.dc=1 and left amdgpu.msi and amdgpu.audio unchanged and
installed the new firmware (180119-2).
result:
i still get this slowness error during second thaw:
PM: Device 0000:01:00.0 failed to thaw async: error -22
dpm_run_callback(): pci_pm_thaw+0x0/0x80 returns -22
[drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-22)

with amdgpu.dc=1 and amdgpu.msi==amdgpu.audio==-1 and
the old firmware (180104) i get this not so bad error during suspend but no slowness even after the 5th thaw in a row:
amdgpu 0000:01:00.0: GPU pci config reset
[drm:amdgpu_suspend [amdgpu]] *ERROR* suspend of IP block <uvd_v6_0> failed -12

-arne
Comment 24 arne_woerner 2018-02-20 20:09:56 UTC
with
kernel 4.15.4-2-MANJARO
and
mesa 17.3.5-0
and
linux-firmware 20180119.2a713be-2
the bug is still there...
-arne
Comment 25 Alex Deucher 2018-02-20 21:03:55 UTC
Please attach your full dmesg output from boot.
Comment 26 arne_woerner 2018-02-21 05:49:06 UTC
Created attachment 137492 [details]
dmesg of boot/suspend/resume with polaris12_uvd.bin from 20180104
Comment 27 arne_woerner 2018-02-21 05:58:12 UTC
Created attachment 137493 [details]
journalctl -r|tac of a resume failure with polaris12_uvd.bin from 20180119-2
Comment 28 arne_woerner 2018-03-06 08:15:41 UTC
linux kernel 4.15.7-1-MANJARO
with mesa 17.3.6-1
still does not like 20180119-2 polaris12 firmware...
-arne
Comment 29 Alex Deucher 2018-03-06 16:15:39 UTC
We are actively debugging this with the firmware teams.  Will update once we get to the root cause.
Comment 30 jamesz@amd.com 2018-03-07 19:09:47 UTC
https://lists.freedesktop.org/archives/amd-gfx/2018-March/019801.html 
the above patch should solve this suspend/resume issue.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.