Summary: | Raven Ridge: constant lockups since latest pull from Linus | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Samantha McVey <samantham> | ||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | medium | CC: | ernstp, marvin.damschen | ||||||||||||||
Version: | XOrg git | ||||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Created attachment 142288 [details]
Xorg log
My system is a Lenovo A485 with the AMD PRO 2700U. While the system would have freezes previously during certain games, it was almost always stable using KDE and browsing the web. So this seems to be a regression. Attaching xorg log to this post.
Did you also update mesa or X? Does reverting: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf help? if so, can you try the latest firmware for you GPU? If not, can you bisect? X and mesa were not upgraded. It is repeatable, where if I boot into an older kernel things work, and if I boot into the latest Linus master my system will freeze within the next 30mins on average. Firmware is the latest from linux-firmware.git (thus it includes the recent Raven firmware bump). I will revert that commit and see if I still get the GPU lockup. Alex, I can report that reverting commit a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf fixes the GPU freezes I was getting. Does the latest firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git fix the issue as well? Yeah I have the latest linux-firmware installed. I also have raven_dmcu.bin which just was added so I'm certain I have the latest, since I installed the git version 3 days ago (when I read that Raven Ridge had its firmware updated). (In reply to Samantha McVey from comment #6) > Yeah I have the latest linux-firmware installed. I also have raven_dmcu.bin > which just was added so I'm certain I have the latest, since I installed the > git version 3 days ago (when I read that Raven Ridge had its firmware > updated). So to confirm, the newest firmware did not help? Please provide the output of /sys/kernel/debug/dri/0/amdgpu_firmware_info On the newest firmware, the issue is present, yes. Output of /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 42, firmware version: 0x0000009c PFP feature version: 42, firmware version: 0x000000b1 CE feature version: 42, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x00000213 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 42, firmware version: 0x00000192 MEC2 feature version: 42, firmware version: 0x00000192 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x0017ba78 SMC feature version: 0, firmware version: 0x00001e44 SDMA0 feature version: 41, firmware version: 0x000000a8 VCN feature version: 0, firmware version: 0x01004912 DMCU feature version: 0, firmware version: 0x00000001 VBIOS version: 113-RAVEN-110 Created attachment 142315 [details] [review] possible fix Does this patch fix the issue? Can you determine whether it's stutter or gfxoff that causes the problem? Please try booting with amdgpu.ppfeaturemask=0xfffdbfff or amdgpu.ppfeaturemask=0xffff3fff on the kernel command line in grub and see which one fixes the issue. I am testing with amdgpu.ppfeaturemask=0xfffdbfff right now. I still have that commit reverted, is this test still valid? What I mean is I have amdgpu.ppfeaturemask=0xfffdbfff in my cmdline at the same time as commit a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf is reverted. (In reply to Samantha McVey from comment #11) > I am testing with amdgpu.ppfeaturemask=0xfffdbfff right now. I still have > that commit reverted, is this test still valid? What I mean is I have > amdgpu.ppfeaturemask=0xfffdbfff in my cmdline at the same time as commit > a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf is reverted. yes, that is fine. That commit just changes the default value of the parameter. If you specify it directly, you override the default. Created attachment 142319 [details]
amdgpu.ppfeaturemask=0xffff3fff resume from suspend lockup
Alex,
So far it seems that I am not getting the gpu lockup I had previously with either amdgpu.ppfeaturemask=0xfffdbfff or amdgpu.ppfeaturemask=0xffff3fff. I tested again and I still get the gpu lockup without these options (this is without having any commit reverted).
I will try your patch and test that now.
NOTE: While using amdgpu.ppfeaturemask=0xffff3fff I had a graphics lockup on resume from suspend, with messages I don't remember seeing before. May or may not be related to the current issue.
NOTE2: I didn't mention it before, but I have a systemd hook which disables C6 states prior to suspend and enables them again after suspend. The reason for this is it fixes GPU lockups I have gotten on resume. (May be relevant to this current bug or could be unrelated, I'm not sure how the state of the CPU affects the GPU…).
Wanted to make sure a few things were clear from my last message: "I tested again and I still get the gpu lockup without these options (this is without having any commit reverted)." - Since I was not getting a lockup under either of the cmdline options, I went back and made sure I could still easily trigger the lockup with no command line options. I was able to trigger the lockup with no command line options. The systemd hook: - This hook is not related to any regression to my knowledge as I've used it since shortly after I got this laptop, so all of my testing has always had this suspend hook. Alex, The conditions in your patch never get triggered. I added some print statements in there, and gfx_v9_0_init_rlc_ext_microcode(adev) runs, but `adev->powerplay.pp_feature &= ~PP_GFXOFF_MASK` never runs, because my rlc_firmware_version is 531 but rlc_feature_version is 1, so `((adev->gfx.rlc_fw_version == 531) && (adev->gfx.rlc_feature_version < 1)) {` is false. Created attachment 142330 [details]
amdgpu.ppfeaturemask=0xfffdbfff lockup during normal usage
I did more testing on amdgpu.ppfeaturemask=0xfffdbfff, and I just got the gpu lockup while browsing the web. I am going to do further testing of the other command line option to see if that will trigger any freezes during more extended usage. But it looks like 0xfffdbfff is confirmed as triggering the bug.
I am not sure whether this is the same issue, but I get freezes within ~1 hour uptime with the latest firmware (18.40) on a Ryzen 2500U (Lenovo E485) with 4.20 rc3. I will attach a dmesg log of such a freeze. I did not see freezes since weeks, but they reappeared after upgrading the firmware from 18.30 to 18.40. Moving between different kernel/mesa/xorg driver versions did not help. I am now back to the 18.30 firmware and did not see a freeze within a full work day. Created attachment 142527 [details]
dmesg 4.20-rc3 freeze on firmware 18.40
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/579. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 142287 [details] dmesg log My system was stable on Linus's branch until he pulled in a bunch of DRM changes with this merge commit 53b3b6bbfde6. This pull causes the system to freeze just by using the desktop. Oct 30 15:31:11 kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream Oct 30 15:31:11 kernel: [drm] GPU recovery disabled. Oct 30 15:31:22 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=61066, emitted seq=61068 Oct 30 15:31:22 kernel: [drm] GPU recovery disabled. Full dmesg attached.