Summary: | suspend / rx550 / extremely slow after 2nd thaw | ||
---|---|---|---|
Product: | DRI | Reporter: | arne_woerner |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Created attachment 137242 [details]
4.15.1-1-MANJARO / journalctl -r
i reported it already here: 1. https://bugzilla.kernel.org/show_bug.cgi?id=198619 2. https://forum.manjaro.org/t/issue-with-resume-and-kernel-4-15/39802 and my CPU is an AMD A10-5800K... When it's slow, do the numbers on the amdgpu line in /proc/interrupts increase over time? Created attachment 137247 [details] amdgpu related interrupt count Michel Dänzer asked: > When it's slow, do the numbers on the amdgpu line in > /proc/interrupts increase over time? > in the first few seconds yes, but then no... and the CPU number changes from 1 to 0... -arne the new kernel 4.15.2-2-MANJARO still says during the second resume:
Feb 11 14:07:21 vaako.intern.wgboome.org kernel: [drm:uvd_v6_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed (0xCAFEDEAD)
it cannot even "halt -p" after the first thaw (it just reboots, instead of turning off the power)...
and the command
> echo schedutil > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
fails after the second resume ("no such file" IIRC)...
is it possibly not just related to amdgpu?
-arne
FYI: neither (A) a new BIOS (Gigabyte F2A88XM-D3H, BIOS F10a 02/23/2016 (before it was F9)) nor (B) turning off IOMMU support in the BIOS nor (C) kernel parameters "intremap=off amd_iommu=off" helped... -arne 4.15.3-1-ARCH has that bug, too... -arne (In reply to arne_woerner from comment #4) > in the first few seconds yes, but then no... So the slowness is probably due to the GPU's IRQ no longer working, the question is why that happens... Assuming you're running with DC enabled, does amdgpu.dc=0 make a difference? (Or amdgpu.dc=1 otherwise) If not, since this seems to be a regression, bisecting between 4.14 and 4.15 would probably be the best way forward. weird news: the bug seems to be not or not only in the kernel... because: i had the same problem at thaw, when i downgraded like this: linux414 (4.14.10-2 <- 4.14.16-1) linux414-headers (4.14.10-2 <- 4.14.16-1) mesa (17.3.1-0 <- 17.3.3-2) xorg-server-common (1.19.6-0.1 <- 1.19.6+13+gd0d1a694f-1) xorg-server (1.19.6-0.1 <- 1.19.6+13+gd0d1a694f-1) i also tried linux414 4.14.13 and 4.14.16... they all do not change that bug... i thought, that i saw this bug in linux415 only, which i test since Febuary 2018... but now i see it everywhere... after the first thaw, switching between tty2(text mode) and tty1(Xorg) back and forth (with ALT+CTRL+F2 and ALT+F1) provokes that bug, too... funnily even without Xorg, the second thaw fails... does anybody have an idea, which package i should bisect? is linux413 a good idea? i mean: i often heard of linux414-drm-next-stage or so? maybe something happened with linux414? but then i did not c the bug from 2017-11-24 to 2018-01-31... or did a package change in January/Febuary 2018, that could cause this? -arne could it be the linux-firmware update? [2018-01-27 05:38] [ALPM] upgraded linux-firmware (20180104.65b1c68-1 -> 20180119.2a713be-1) [2018-01-30 06:22] [ALPM] upgraded linux-firmware (20180119.2a713be-1 -> 20180119.2a713be-2) -arne (In reply to arne_woerner from comment #10) > could it be the linux-firmware update? > [2018-01-27 05:38] [ALPM] upgraded linux-firmware (20180104.65b1c68-1 -> > 20180119.2a713be-1) > [2018-01-30 06:22] [ALPM] upgraded linux-firmware (20180119.2a713be-1 -> > 20180119.2a713be-2) Were any amdgpu firmwares included in that update? looks like: [a: linux-firmware-20180119.2a713be-2-any.pkg.tar.xz] [b: linux-firmware-20180104.65b1c68-1-any.pkg.tar.xz] # diff -r a/usr/lib/firmware/amdgpu b/usr/lib/firmware/amdgpu Binary files a/usr/lib/firmware/amdgpu/fiji_vce.bin and b/usr/lib/firmware/amdgpu/fiji_vce.bin differ Binary files a/usr/lib/firmware/amdgpu/polaris10_uvd.bin and b/usr/lib/firmware/amdgpu/polaris10_uvd.bin differ Binary files a/usr/lib/firmware/amdgpu/polaris11_uvd.bin and b/usr/lib/firmware/amdgpu/polaris11_uvd.bin differ Binary files a/usr/lib/firmware/amdgpu/polaris12_uvd.bin and b/usr/lib/firmware/amdgpu/polaris12_uvd.bin differ Binary files a/usr/lib/firmware/amdgpu/raven_vcn.bin and b/usr/lib/firmware/amdgpu/raven_vcn.bin differ Binary files a/usr/lib/firmware/amdgpu/vega10_uvd.bin and b/usr/lib/firmware/amdgpu/vega10_uvd.bin differ Binary files a/usr/lib/firmware/amdgpu/vega10_vce.bin and b/usr/lib/firmware/amdgpu/vega10_vce.bin differ even polaris12 seems to be different (RX550 uses polaris12, right?)... i will test it tomorrow... :) -arne 1. (on Manjaro Linux) it is not enough to downgrade linux-firmware... u also need to do a mkinitcpio... :) 2. downgrading the directory /usr/lib/firmware/amdgpu from linux-firmware-20180119.2a713be-2 to linux-firmware-20180104.65b1c68-1 fixes this bug... 3. linux-firmware-20180119.2a713be-1 has this bug, too... 4. downgrading only the file polaris12_uvd.bin is not enough (but it looks slightly better)... 5. how does my box use the other new files in /usr/lib/firmware/amdgpu (fiji_vce.bin, polaris10_uvd.bin, polaris11_uvd.bin, raven_vcn.bin, vega10_uvd.bin, vega10_vce.bin)? i thought my box would not use these... 6. does this suffice to find the bug? or do u want me to try certain other combinations of version 20180104 and 20180119.2a713be-2? -arne i forgot to mention: both kernels (4.15.2-2-MANJARO (60Hz vert freq) and 4.14.16-1-MANJARO (40Hz vert freq)) work fine, _but_ they both need at least one of these kernel parameters in the grub.cfg: amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0 -arne (In reply to arne_woerner from comment #14) > i forgot to mention: > both kernels (4.15.2-2-MANJARO (60Hz vert freq) and 4.14.16-1-MANJARO (40Hz > vert freq)) work fine, > _but_ they both need at least one of these kernel parameters in the grub.cfg: > amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0 Can you clarify this last statement? The first two options are irrelevant for your board since it's not SI or CIK. Does disabling MSIs help? Append amdgpu.msi=0 to the kernel command line in grub? (In reply to Alex Deucher from comment #15) > (In reply to arne_woerner from comment #14) > > _but_ they both need at least one of these kernel parameters in the grub.cfg: > > amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.audio=0 > > Can you clarify this last statement? The first two options are irrelevant > for your board since it's not SI or CIK. > yup: i mean: i did not test, which one is necessary... so i cannot say, if all three r necessary, or if just one/two of them suffice... tomorrow i will test "amdgpu.audio=0 amdgpu.msi=0"... with the stable linux-firmware package... right? or do u mean just "amdgpu.msi=0"? maybe hdmi audio works already, if amdgpu.msi is 0? -arne (In reply to arne_woerner from comment #17) > yup: > i mean: i did not test, which one is necessary... > so i cannot say, if all three r necessary, or if just one/two of them > suffice... > > tomorrow i will test "amdgpu.audio=0 amdgpu.msi=0"... > with the stable linux-firmware package... right? > > or do u mean just "amdgpu.msi=0"? > maybe hdmi audio works already, if amdgpu.msi is 0? I doubt the audio parameter would affect anything in this regard, but you said it was required for the board to work properly? Is that the case or am I misinterpreting you? For audio on your board, you need to enable DC which is only available on kernel 4.15 and newer. You can boot with amdgpu.dc=1 to enable DC. The msi option may help narrow down the interrupt issue. If it helps, it seems MSIs are not working properly on your system after resume. This would probably be something outside of the driver if it's a regression. ok... but why is it necessary to use that old amdgpu firmware, if it was some other part of the kernel? i cant find the change log of those firmware files... could there be something, that does not like to suspend/resume? -arne (In reply to arne_woerner from comment #19) > ok... > > but why is it necessary to use that old amdgpu firmware, > if it was some other part of the kernel? > > i cant find the change log of those firmware files... > could there be something, that does not like to suspend/resume? Is it the firmware that caused the regression? Does everything work with the old firmware? (In reply to arne_woerner from comment #13) > 1. (on Manjaro Linux) it is not enough to downgrade linux-firmware... u also > need to do a mkinitcpio... :) > 2. downgrading the directory /usr/lib/firmware/amdgpu from > linux-firmware-20180119.2a713be-2 to linux-firmware-20180104.65b1c68-1 fixes > this bug... > 3. linux-firmware-20180119.2a713be-1 has this bug, too... > 4. downgrading only the file polaris12_uvd.bin is not enough (but it looks > slightly better)... Sorry, I missed this part. > 5. how does my box use the other new files in /usr/lib/firmware/amdgpu > (fiji_vce.bin, polaris10_uvd.bin, polaris11_uvd.bin, raven_vcn.bin, > vega10_uvd.bin, vega10_vce.bin)? i thought my box would not use these... Your system doesn't use those additional firmwares. They are for other chips. (In reply to Alex Deucher from comment #20) > Is it the firmware that caused the regression? Does everything work with > the old firmware? i did the following experiments: 1. via kernel parameters in grub.cfg i changed amdgpu.dc from -1 to 1 amdgpu.msi from -1 to 0 amdgpu.audio from -1 to 0. i installed version 20180119.2a713be-2 of linux-firmware. result: during the second suspend the screen went black (no signal) and the following thaw produced the slow box syndrome... 2. via kernel parameters in grub.cfg i changed amdgpu.dc from -1 to 1 and left amdgpu.msi and amdgpu.audio untouched. i installed version 20180119.2a713be-2 of linux-firmware. and version 20180104.65b1c68-1 of /usr/lib/firmware/amdgpu/polaris12_uvd.bin (package linux-firmware). now it works three times in a row... the 3rd time there even was a running mplayer, that uses HDMI Audio... :) what is that new polaris12_uvd.bin good for? -arne i did another experiment: i set amdgpu.dc=1 and left amdgpu.msi and amdgpu.audio unchanged and installed the new firmware (180119-2). result: i still get this slowness error during second thaw: PM: Device 0000:01:00.0 failed to thaw async: error -22 dpm_run_callback(): pci_pm_thaw+0x0/0x80 returns -22 [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-22) with amdgpu.dc=1 and amdgpu.msi==amdgpu.audio==-1 and the old firmware (180104) i get this not so bad error during suspend but no slowness even after the 5th thaw in a row: amdgpu 0000:01:00.0: GPU pci config reset [drm:amdgpu_suspend [amdgpu]] *ERROR* suspend of IP block <uvd_v6_0> failed -12 -arne with kernel 4.15.4-2-MANJARO and mesa 17.3.5-0 and linux-firmware 20180119.2a713be-2 the bug is still there... -arne Please attach your full dmesg output from boot. Created attachment 137492 [details]
dmesg of boot/suspend/resume with polaris12_uvd.bin from 20180104
Created attachment 137493 [details]
journalctl -r|tac of a resume failure with polaris12_uvd.bin from 20180119-2
linux kernel 4.15.7-1-MANJARO with mesa 17.3.6-1 still does not like 20180119-2 polaris12 firmware... -arne We are actively debugging this with the firmware teams. Will update once we get to the root cause. https://lists.freedesktop.org/archives/amd-gfx/2018-March/019801.html the above patch should solve this suspend/resume issue. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 137241 [details] 4.15.0-1-MANJARO / journalctl -r Hi! After a reboot only the first thaw works fine, but the second thaw results in a box, that is extremely slow (keypresses take 10 seconds to show up in the gnome-terminal). In the journal i found some suspicious messages. Thx. Bye Arne